Principal Site Reliability Engineer
Cvent View all jobs
- Gurgaon, Haryana
- Permanent
- Full-time
In This Role, You Will:
- Set long-term technical direction for complex problems; communicate timeline, scope, risks, and the technical roadmap to leadership and stakeholders.
- Continuously evaluate emerging cloud and AI/automation technologies; run POCs to assess fit and pioneer intelligent copilots for support, incident response, and developer workflows.
- Architect, standardize, and scale SRE frameworks and best practices; drive adoption and continual improvement of SLIs/SLOs/SLAs across business-critical platforms.
- Lead design and integration of CI/CD, containerization (Docker, Kubernetes), and IaC (Terraform, AWS CDK) for large-scale environments; ensure security and regulatory compliance.
- Define and implement observability, monitoring, and alerting strategies; conduct deep-dive RCAs using Datadog, Prometheus, Grafana, and ELK; lead blameless postmortems.
- Lead capacity planning, cost optimization, and disaster recovery to ensure scalability, reliability, and system resilience.
- Translate business risk and product goals into actionable reliability and observability strategies; partner closely with SRE, Product, and Engineering teams.
- Mentor and upskill SRE/DevOps engineers; foster a culture of ownership, continuous learning, and operational excellence.
- Pioneer the use of AI-powered automation and intelligent copilots for alert triage, event grouping, and developer/operations workflow efficiencies.
- Serve as a mentor and organizational leader, influencing technical direction, upskilling teams, and fostering a culture of shared reliability ownership and blameless postmortems.
- Lead capacity planning, cost optimization, and disaster recovery initiatives to ensure seamless scalability and system resilience.
- Bridge business and technology stakeholders, translating business risk and product goals into actionable reliability and observability strategies.
- Represent the technology perspective and priorities to leadership and other stakeholders by continuously communicating timeline, scope, risks, and technical road map.
- 10+ years in SRE, cloud engineering, or DevOps with significant time in an architect, staff, or principal role.
- Deep fluency in AWS across multi-account, multi-region, and high-traffic environments; strong foundation in distributed systems architecture and infrastructure as code.
- Demonstrable leadership scaling organizational SRE practices: CI/CD, observability, incident management, RCAs, and blameless postmortems.
- Proven track record driving adoption of AI, automation, and ML to improve reliability, operational efficiency, and developer productivity.
- Expert programming/scripting (Python, Go, or similar) with Linux internals depth and advanced troubleshooting of distributed systems.
- Validated breadth across networking, cloud, databases, and scripting, experience with multitier architectures.
- Exceptional ability to influence, coach, and communicate across engineering and product, acts as a pragmatic technical conscience with a strong bias for execution.
- Mastery of incident management, postmortem culture, and root cause analysis for distributed systems.
- Experience with Unix/Linux environments with a deep grasp on system internals.
- Worked on large-scale distributed systems including multi-tiered architecture.
- Validated breadth of understanding and development of solutions based on multiple technologies, including networking, cloud, database, and scripting languages.
- Strong leadership, communication and interpersonal skills geared to getting things done.