Lead Site Reliability Engineer, DevOps
Qualys View all jobs
- Pune, Maharashtra
- Permanent
- Full-time
- Design, implement, and maintain end-to-end observability using:
- Prometheus for metrics collection
- Alertmanager for alert routing, deduplication, and escalation
- Grafana for visualization and dashboards
- AppDynamics for APM, transaction tracing, and application health
- Build actionable dashboards for:
- SLIs, SLOs, and error budgets
- Application, infrastructure, and platform health
- Reduce alert fatigue by implementing signal-based alerting and proper severity models
- Manage and optimize ClickHouse for:
- High-volume metrics, logs, or traces
- Long-term retention and fast analytical queries
- Work on schema design, performance tuning, and cost optimization
- Define and measure SRE best practices (SLIs, SLOs, SLAs)
- Participate in incident response, postmortems, and root cause analysis
- Drive reliability improvements through automation and capacity planning
- Develop tooling and automation using at least one scripting/programming language
- Automate monitoring onboarding, alert generation, dashboard creation
- Improve operational efficiencies across DevOps tooling
- Strong Linux fundamentals
- Troubleshooting, performance tuning, networking, system internals
- Scripting / Programming (Any one or more):
- Python (preferred), Bash, Go, or similar
- Observability Tools (Hands-on):
- Prometheus
- Alertmanager
- Grafana
- AppDynamics
- Data Platform:
- Hands-on experience with ClickHouse
- Metrics vs logs vs traces
- Golden signals (latency, traffic, errors, saturation)
- Alert thresholds, routing policies, escalation strategies
- Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
- Infrastructure as Code (Terraform, Helm)
- CI/CD observability
- Cloud platforms (AWS / Azure / GCP)
- Experience managing observability at scale (100+ services / platforms)
- Ability to architect observability solutions, not just operate them
- Strong production troubleshooting and incident ownership
- Mentoring junior engineers
- Influence DevOps and SRE best practices across teams
- Communicate clearly with developers and leadership
- 5-7 years of experience in SRE / DevOps / Production Engineering
- Experience operating high-availability, large-scale systems
- Proven background in observability-driven reliability improvements