
Senior Director, Cloud Operations (SRE, SDM)
- Bangalore, Karnataka
- Permanent
- Full-time
- Cloud Infrastructure Operations
- Oversee the daily operations of cloud platforms (AWS, Azure, GCP), ensuring high availability and performance across global regions.
- Lead the development and execution of operational runbooks, SOPs, and escalation paths.
- Incident Management & Response
- Own the end-to-end incident management lifecycle: detection, triage, escalation, resolution, and post-incident review.
- Lead a global incident response team with 24/7 coverage, ensuring seamless handoffs across time zones.
- Implement real-time monitoring, alerting, and automated remediation to reduce MTTD and MTTR.
- Use data analytics to identify incident trends, recurring issues, and systemic risks.
- Conduct blameless postmortems and ensure corrective actions are prioritized and tracked to closure.
- Data-Driven Operational Leadership
- Build and lead a global team of cloud engineers, SREs, and operations analysts using a metrics-first approach.
- Define and track operational KPIs (e.g., uptime, incident frequency, resolution time, change success rate) to drive accountability and performance.
- Leverage dashboards and analytics platforms (e.g., Datadog, Grafana, Splunk, ServiceNow) to provide real-time visibility into system health and team performance.
- Use data to inform staffing models, on-call rotations, and workload balancing across regions.
- Foster a culture of continuous improvement through data-backed retrospectives and operational reviews.
- AI enabled Focus
- Drive AI and ML adoption in operational workflows (e.g., predictive monitoring, incident pattern analysis etc.,) to improve uptime and automate repetitive tasks.
- Define and execute AI-driven observability strategy using tools like AIOps platforms for intelligent alerting and root cause analysis.
- Collaborate with Engineering, Security, and Product teams to embed AI-enabled automation in deployment pipelines, change management etc.,.
- Establish and maintain SLOs/SLAs leveraging AI-generated insights to prioritize engineering work that improves reliability and customer experience.
- Oversee incident management, post-mortems, and continuous improvement, incorporating AI tools for impact analysis and knowledge retention.
- Operational Governance
- Define and enforce SLAs, SLOs, and operational KPIs.
- Ensure compliance with security, regulatory, and audit requirements.
- Manage change control, configuration management, and release processes to minimize operational risk.
- Cost & Vendor Management
- Monitor and optimize cloud spend through cost governance and usage analysis.
- Manage vendor relationships, contracts, and service-level agreements.
- Collaboration & Communication
- Partner with engineering, security, and business teams to align operations with product and service goals.
- Provide regular reporting and updates to executive leadership on operational health, risks, and incident trends.
- Education
- Bachelor's or master's degree in computer science, Information Systems, or related field.
- Experience
- 14+ years in IT operations, with 7+ years in cloud infrastructure and operations leadership.
- Proven experience leading global teams and managing high-severity incidents in large-scale environments.
- Skills
- Deep expertise in cloud operations, incident response, and service reliability.
- Strong knowledge of ITIL, SRE, and DevOps practices.
- Proficiency in operational analytics and observability tools.
- Excellent leadership, communication, and cross-functional collaboration skills.
- Strong presentation skills, including experience presenting to large global audiences.
- Certifications (Preferred)
- AWS Certified DevOps Engineer - Professional
- Azure Administrator Associate
- ITIL Foundation or Practitioner