
Principal Site Reliability Engineer
- Hyderabad, Telangana
- Permanent
- Full-time
- Infrastructure Design & Maintenance
- Lead the design, build, and maintenance of our core infrastructure using infrastructure-as-code (IaC) tools (e.g., Terraform, CloudFormation).
- Own the provisioning and lifecycle management of production, staging, and other critical environments.
- Architect and implement shared infrastructure components (e.g., logging, metrics, service mesh, load balancing).
- Drive continuous improvements to infrastructure scalability, availability, and performance.
- Act as a key partner to development teams, providing infrastructure primitives and strategic guidance on deployment needs.
- Deployment Systems & CI/CD
- Design, own, and enhance our CI/CD pipelines (GitHub Actions, Argo CD) to maximize reliability, velocity, and automation.
- Establish and enforce best practices across all environments for deployment, rollback, and observability.
- Partner with developers to architect and streamline the testing and delivery of code to production.
- Champion the elimination of manual steps in deployment and operations workflows.
- Reliability, Observability & Tooling
- Architect and manage our monitoring, alerting, and logging infrastructure (Kube-Prometheus-Grafana stack).
- Define, implement, and track SLOs/SLIs for core services, holding service owners accountable.
- Proactively identify and eliminate single points of failure, performance bottlenecks, and sources of instability.
- Lead reliability reviews, blameless post-incident analyses, and capacity planning initiatives.
- Perform basic debugging of Java applications to assist development teams in troubleshooting.
- Documentation & Knowledge Sharing
- Ensure all systems and processes built or maintained by the SRE team are accompanied by thorough, up-to-date documentation.
- Mentor other engineers and contribute to shared knowledge bases, runbooks, and developer-facing materials.
- Lead internal training sessions, walkthroughs, and pairings to cross-train teammates and reduce knowledge silos.
- Collaboration & Culture
- Work closely with the SRE Lead to define team strategy, prioritize work, and execute on team goals.
- Mentor junior team members and act as a technical leader across engineering.
- Participate in on-call rotations, acting as an escalation point for complex issues.
- Champion a culture of blameless learning, transparency, and continuous improvement.
- Experience: 8+ years in a senior SRE, DevOps, or related infrastructure role.
- Cloud: Deep, hands-on expertise with AWS, including services like ECS, EKS, Aurora (Postgres), EC2, S3, and VPC.
- Containers & Orchestration: Strong, production-level proficiency with Kubernetes and Helm. Deep understanding of container runtimes and networking.
- CI/CD: Extensive experience designing, building, and managing complex CI/CD pipelines using tools like GitHub Actions and Argo CD. Experience with container registries like GHCR.
- IaC: Expertise in Infrastructure as Code, with strong proficiency in Terraform or CloudFormation.
- Observability: Proven experience with observability stacks, particularly the Kube-Prometheus-Grafana stack, including custom metric instrumentation and advanced dashboarding.
- Debugging: Ability to perform basic performance analysis and debugging of applications (Java experience is a strong plus).
- Leadership: Demonstrated ability to mentor junior engineers, lead technical projects, and drive architectural decisions.
- Incident Management: Experience leading incident response, conducting blameless post-mortems, and driving resulting action items to completion.