
Site Reliability Engineering Manager
- Bangalore, Karnataka
- Permanent
- Full-time
- Lead by Example: Provide technical leadership and guidance to SRE team by applying hands-on skills and continuous learning. Build and mentor a world-class engineering team that partners closely with platform teams to design scalable, reliable systems, while contributing actively to both platform and application code.
- Drive Automation for Data Platforms and Infrastructure: Manage Infrastructure as Code (IaC) and develop tooling to enhance engineering productivity. Lead initiatives for cost optimization and operational efficiency at scale.
- Incident Response and On-Call Engagement: Actively participate in on-call rotations and resolve critical production issues. Lead response efforts during major incidents and serve as the primary escalation point for complex problems.
- Drive Post-Incident Analysis: Perform root cause investigations and ensure follow-up with actionable postmortems and infrastructure hardening initiatives. Implement fixes-in code, infrastructure, or processes-to prevent recurrence.
- Active Collaboration with Cross-Functional Teams: Partner closely with engineering teams to troubleshoot issues, deploy fixes, and enhance system reliability. Champion operational excellence through direct technical contributions.
- Establish Production Readiness Standards: Take ownership of Application Security, Disaster Recovery & Application Documentation to reflect latest system architecture and configurations.
- Hands-on experience supporting and maintaining applications in cloud or hybrid environments
- Expertise in cloud-native services, including ETL frameworks (Apache Spark, Flink), and messaging systems (Kafka)
- Strong knowledge of cloud infrastructure & services (e.g., AWS, GCP, Kubernetes), Observability tools (e.g: Prometheus, Grafana, CloudWatch)
- Programming experience in Python, Java, or Scala
- Proven ability to lead incident response, perform root cause analysis, and drive system reliability improvements
- Bachelor's degree or equivalent, with 10+ years of experience in the SRE domain and at least 2 years in a management role focused on leading, hiring, developing and building teams
- Hands-on experience supporting enterprise data systems on distributed architectures
- Exposure to data visualization tools such as Tableau, Business Objects, ThoughtSpot, with experience supporting and troubleshooting issues related to dashboards and reports
- Experience with modern & distributed databases such as Snowflake, Cassandra, SingleStore, and SAP HANA
- Experience using GenAI or automation tools for issue detection, alerting, or remediation
- Solid understanding of system design, data structures, and incident management best practices