Lead Site Reliability Engineer
CCTECH
- Pune, Maharashtra
- Permanent
- Full-time
- Own reliability, uptime, and production health of critical systems
- Design systems for scale, failure handling, and unpredictable load
- Lead architecture decisions balancing scalability, cost, and reliability
- Drive SRE best practices (SLOs, observability, incident prevention)
- Lead and mentor engineers; set technical direction and standards
- Act as the go-to person for complex system design and debugging
- Build and evolve infrastructure using Infrastructure-as-Code
- Drive automation for testing, validation, and recovery systems
- Collaborate with product and platform teams to ensure systems are:
Reliable and observable
Easy to integrate and adopt
- Lead critical incident resolution, root cause analysis, and long-term fixes
- Proactively reduce system fragility and improve platform resilience
- 8–12 years of experience in SRE / DevOps / Cloud Engineering
AWS production systems (large-scale preferred)
Infrastructure-as-Code (Terraform or equivalent)
CI/CD pipelines and deployment automation
- Proven experience in:
Handling and debugging production incidents at scale
Improving system reliability, performance, and cost
- Strong programming ability:
Ability to write production-grade, maintainable code
- Experience leading engineers and driving technical direction
Distributed systems
Observability and monitoring
Scalability, performance, and cost optimizationGood to Have
- Experience with identity and access management systems (OAuth2, OIDC, etc.)
- Exposure to API platforms / API gateways
- Experience defining and enforcing SLOs and error budgets
- Exposure to multi-cloud environments
- Experience with chaos engineering
- Background in backend engineering before moving to SRE
- High ownership role with real impact on production systems
- Opportunity to shape SRE practices and platform direction
- Work on systems used by global engineering organizations
- Exposure to advanced areas like Identity platforms, Digital Twin, AI/ML systems, and large-scale cloud services