We are seeking a highly skilled and proactive Infrastructure Architect to lead incident management and technical problem-solving efforts across our enterprise systems. This role requires a hands-on leader with deep technical expertise, strong communication skills, and the ability to operate under pressure in a fast-paced production environment.
Key Responsibilities:
- Incident Management & Resolution
- Lead and coordinate high-severity incident response and root cause analysis.
- Facilitate technical war rooms and drive resolution across cross-functional teams.
- Provide clear, timely updates to stakeholders and leadership during outages or critical issues.
- Technical Leadership
- Guide troubleshooting sessions involving AWS Cloud, Salesforce, databases, and networking infrastructure.
- Analyze complex infrastructure issues and propose creative, actionable solutions.
- Collaborate with vendors to evaluate options and recommend the best course of action.
- Infrastructure & Cloud Expertise
- Design and support scalable, secure, and resilient cloud infrastructure (primarily AWS).
- Understand and troubleshoot across systems including:
- AWS Cloud Services (Must have it)
- Azure Cloud Service (Preferred)
- Snowflake Cloud (nice to have)
- Salesforce platform (Preferred)
- Relational and NoSQL Databases
- Datadog monitoring and observability tools
- Cisco networking (switches, routing, connectivity)
- Operational Excellence
- Bring strong production support experience, including after-hours availability when needed.
- Monitor system health and performance, and proactively address potential issues.
- Maintain and improve incident response playbooks and escalation procedures.
- Communication & Leadership
- Communicate effectively with technical and non-technical stakeholders.
- Provide leadership in planning, prioritizing, and executing infrastructure initiatives.
- Mentor junior engineers and foster a culture of accountability and continuous improvement.
Required Qualifications:
- 10+ years of experience in IT infrastructure, cloud operations, or related roles.
- Proven experience leading incident response and technical troubleshooting.
- Strong hands-on knowledge of:
- AWS (EC2, VPC, S3, CloudWatch, etc.)
- Salesforce administration and integration
- Databases (SQL, NoSQL)
- Datadog or similar observability platforms
- Cisco networking (switches, VLANs, routing)
- Familiarity with programming/scripting (Python, Bash, etc.) and infrastructure-as-code tools.
- Excellent analytical, problem-solving, and decision-making skills.
- Strong leadership and stakeholder management capabilities.
- Willingness to work extended hours during critical incidents.
Preferred Qualifications:
- Experience working in a hybrid cloud environment.
- Exposure to DevOps practices and CI/CD pipelines.
- Prior experience in a consulting or vendor-facing role