Lead Site Reliability Engineer, DevOps

Qualys View all jobs

  • Pune, Maharashtra
  • Permanent
  • Full-time
  • 2 months ago
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!Job TitleSenior Site Reliability Engineer (SRE) – Observability & DevOpsRole SummaryWe are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.This role requires both deep technical expertise and production ownership mindset.Primary ResponsibilitiesObservability & Monitoring
  • Design, implement, and maintain end-to-end observability using:
  • Prometheus for metrics collection
  • Alertmanager for alert routing, deduplication, and escalation
  • Grafana for visualization and dashboards
  • AppDynamics for APM, transaction tracing, and application health
  • Build actionable dashboards for:
  • SLIs, SLOs, and error budgets
  • Application, infrastructure, and platform health
  • Reduce alert fatigue by implementing signal-based alerting and proper severity models
Data & Metrics Platform
  • Manage and optimize ClickHouse for:
  • High-volume metrics, logs, or traces
  • Long-term retention and fast analytical queries
  • Work on schema design, performance tuning, and cost optimization
Reliability & Operations
  • Define and measure SRE best practices (SLIs, SLOs, SLAs)
  • Participate in incident response, postmortems, and root cause analysis
  • Drive reliability improvements through automation and capacity planning
Automation & Engineering
  • Develop tooling and automation using at least one scripting/programming language
  • Automate monitoring onboarding, alert generation, dashboard creation
  • Improve operational efficiencies across DevOps tooling
Required Technical Skills (Must-Have)Core Skills
  • Strong Linux fundamentals
  • Troubleshooting, performance tuning, networking, system internals
  • Scripting / Programming (Any one or more):
  • Python (preferred), Bash, Go, or similar
  • Observability Tools (Hands-on):
  • Prometheus
  • Alertmanager
  • Grafana
  • AppDynamics
  • Data Platform:
  • Hands-on experience with ClickHouse
Monitoring & Alerting Concepts
  • Metrics vs logs vs traces
  • Golden signals (latency, traffic, errors, saturation)
  • Alert thresholds, routing policies, escalation strategies
Preferred / Nice-to-Have Skills
  • Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
  • Infrastructure as Code (Terraform, Helm)
  • CI/CD observability
  • Cloud platforms (AWS / Azure / GCP)
  • Experience managing observability at scale (100+ services / platforms)
Senior-Level Expectations
  • Ability to architect observability solutions, not just operate them
  • Strong production troubleshooting and incident ownership
  • Mentoring junior engineers
  • Influence DevOps and SRE best practices across teams
  • Communicate clearly with developers and leadership
Experience & Qualification
  • 5-7 years of experience in SRE / DevOps / Production Engineering
  • Experience operating high-availability, large-scale systems
  • Proven background in observability-driven reliability improvements

Qualys