Principal Site Reliability Engineer

Gurgaon, Haryana
Permanent
Full-time

10 days ago

Job Description:Overview:Cvent is a leading meetings, events, and hospitality technology provider with more than 5,000+ employees and 24,000+ customers worldwide, including 60% of the Fortune 500. Founded in 1999, Cvent delivers a comprehensive event marketing and management platform for marketers and event professionals and offers software solutions to hotels, special event venues and destinations to help them grow their group/MICE and corporate travel business. Our technology brings millions of people together at events around the world. In short, we’re transforming the meetings and events industry through innovative technology that powers the human connection.Cvent's strength lies in its people, fostering a culture where everyone is encouraged to think like entrepreneurs, taking risks and making decisions confidently. We value diverse perspectives and celebrate differences, working together with colleagues and clients to build strong connections.AI at Cvent: Leading the FutureAre you ready to shape the future of work at the intersection of human expertise and AI innovation? At Cvent, we’re committed to continuous learning and adaptation—AI isn’t just a tool for us, it’s part of our DNA. We’re looking for candidates who are eager to evolve alongside technology. If you love to experiment boldly, share your discoveries, and help define best practices for AI-augmented work, you’ll thrive here. Our team values professionals who thoughtfully integrate AI into their daily work, delivering exceptional results while relying on the human judgment and creativity that drive real innovation.Throughout our interview process, you’ll have the chance to demonstrate how you use AI to learn, iterate, and amplify your impact. If you’re excited to be part of a team that’s leading the way in AI-powered collaboration, we’d love to meet you.Disclaimer: Beware of Recruitment Scams – Legitimate Cvent recruiting communications will always come from an official ‘name@ ’ email. We never request any payments or ask for sensitive personal or financial information via chat or social media platforms. For more information, please visit:
In This Role, You Will:

Set long-term technical direction for complex problems; communicate timeline, scope, risks, and the technical roadmap to leadership and stakeholders.
Continuously evaluate emerging cloud and AI/automation technologies; run POCs to assess fit and pioneer intelligent copilots for support, incident response, and developer workflows.
Architect, standardize, and scale SRE frameworks and best practices; drive adoption and continual improvement of SLIs/SLOs/SLAs across business-critical platforms.
Lead design and integration of CI/CD, containerization (Docker, Kubernetes), and IaC (Terraform, AWS CDK) for large-scale environments; ensure security and regulatory compliance.
Define and implement observability, monitoring, and alerting strategies; conduct deep-dive RCAs using Datadog, Prometheus, Grafana, and ELK; lead blameless postmortems.
Lead capacity planning, cost optimization, and disaster recovery to ensure scalability, reliability, and system resilience.
Translate business risk and product goals into actionable reliability and observability strategies; partner closely with SRE, Product, and Engineering teams.
Mentor and upskill SRE/DevOps engineers; foster a culture of ownership, continuous learning, and operational excellence.
Pioneer the use of AI-powered automation and intelligent copilots for alert triage, event grouping, and developer/operations workflow efficiencies.
Serve as a mentor and organizational leader, influencing technical direction, upskilling teams, and fostering a culture of shared reliability ownership and blameless postmortems.
Lead capacity planning, cost optimization, and disaster recovery initiatives to ensure seamless scalability and system resilience.
Bridge business and technology stakeholders, translating business risk and product goals into actionable reliability and observability strategies.
Represent the technology perspective and priorities to leadership and other stakeholders by continuously communicating timeline, scope, risks, and technical road map.

Here's What You Need:

10+ years in SRE, cloud engineering, or DevOps with significant time in an architect, staff, or principal role.
Deep fluency in AWS across multi-account, multi-region, and high-traffic environments; strong foundation in distributed systems architecture and infrastructure as code.
Demonstrable leadership scaling organizational SRE practices: CI/CD, observability, incident management, RCAs, and blameless postmortems.
Proven track record driving adoption of AI, automation, and ML to improve reliability, operational efficiency, and developer productivity.
Expert programming/scripting (Python, Go, or similar) with Linux internals depth and advanced troubleshooting of distributed systems.
Validated breadth across networking, cloud, databases, and scripting, experience with multitier architectures.
Exceptional ability to influence, coach, and communicate across engineering and product, acts as a pragmatic technical conscience with a strong bias for execution.
Mastery of incident management, postmortem culture, and root cause analysis for distributed systems.
Experience with Unix/Linux environments with a deep grasp on system internals.
Worked on large-scale distributed systems including multi-tiered architecture.
Validated breadth of understanding and development of solutions based on multiple technologies, including networking, cloud, database, and scripting languages.
Strong leadership, communication and interpersonal skills geared to getting things done.

Cvent

Apply Now