
Manager Site Reliability Engineering
- Bangalore, Karnataka
- Permanent
- Full-time
- Lead and mentor a team of Site Reliability Engineers focused on: Cloud account lifecycle management (AWS, Azure & GCP) and access controls, Platform reliability and operational excellence, Infrastructure architecture and governance, RBAC and compliance enforcement etc and runs the core systems that each of our engineering teams leverage.
- Own the architecture and operational integrity of cloud-native platforms, including Databricks and Power BI.
- Define and enforce governance policies including tagging, RBAC, compliance, and security standards.
- Drive FinOps maturity through cost visibility, forecasting, anomaly detection, and optimisation.
- Collaborate with SRE, Security, Finance, and Engineering teams to align infrastructure with business and financial goals.
- Champion automation and Infrastructure-as-Code (IaC) to improve deployment velocity and reduce manual overhead.
- Partners with security and other “shared services” teams to align, automate, integrate and orchestrate specialist tooling into a common set of SRE best practices that supports the wider Software Delivery Lifecycle and Product Lifecycle.
- Plan and execute projects in support of the SRE objectives, and ensure projects are delivered with high quality, on time, and within budget
- Hire, develop and retain a highly skilled SRE team
- Evaluate hardware and software technologies to improve efficiency and performance
- Contribute to platform security
- Developer/DevOps/SRE/Platform experience and a strong interest in software delivery and ongoing operation.
- Owned and led the architecting and rolling out of automation, tools, technologies, patterns and guardrails across an organisation.
- Experience working in a globally distributed team.
- Deep & extensive public cloud knowledge & experience on either AWS, Azure or GCP.
- Deep knowledge of containers (Docker) orchestration (Kubernetes).
- Knowledge of tools and patterns around CI/CD (familiar with GitHub, Travis CI, Circle CI, Buildkite or similar).
- Cloud cost optimisation: Using automation to keep Cloud cost under control and within budget. Enabling individual Engineering teams with cloud cost optimisation.
- Knowledge of operations, including incident management, immutable infrastructure as code (esp. Terraform or CloudFormation), and problem-solving.
- Produced robust well-tested code preferably in Golang; however, we will also consider Python, JavaScript, Ruby, Java or C# if you are happy to learn Go.
- Excellent communication skills, including experience in writing good documentation and running workshops.
- Vendor selection and management experience.
- Bachelor's or higher degree in Computer Science, Information Technology, or a related field.
- Background in centralized Site Reliability Engineering or Platform Engineering supporting globally distributed engineering teams
- At least 1+ years’ experience leading a team of Site Reliability Engineers
- At least 2 years of experience working as a senior member of a centralized Cloud enablement / Platform or a similar team
- At least 8+ years’ experience in SRE/DevOps/Platform Engineering in cloud environments
- Experience with IaC and Containers to achieve scalable, reliable, performant and secure SaaS platform infrastructure
- Python / Golang / Java / C# / C / C++ / Bash experience
- Big Data, Machine Learning, AI (DataBricks, Snowflake etc.) Platforms
- Experience with Monitoring systems such as New Relic, ELK, Prometheus, Datadog, X-ray etc.
- Security background
- SQL, NOSQL and Graph databases
- Relevant Certification e.g. AWS, GCP, Azure (Professional or higher)