
Middle QA engineer (Networking, Linux) for scrum team IRC270498
- Hyderabad, Telangana
- Permanent
- Full-time
- Bachelor’s or Master’s degree in Computer Science, Information Systems, Engineering, or a related technical field.
- 12+ years of total experience in infrastructure, platform engineering, or software development roles, including at least 3–5 years in an SRE or DevOps leadership role.
- Deep understanding of Linux/Unix systems, networking fundamentals, and containerized environments (Docker, Kubernetes).
- Proven experience managing large-scale production systems, including high-availability, distributed, and event-driven architectures.
- Strong hands-on experience with cloud platforms such as AWS, GCP, or Azure and infrastructure-as-code tools (e.g., Terraform, CloudFormation).
- Proficiency in at least one scripting or programming language (Python, Go, Shell, Java, etc.).
- Demonstrated experience building observability solutions (metrics, logs, traces) and integrating them into proactive monitoring and alerting systems.
- Solid understanding of incident response practices, runbook automation, on-call rotation management, and disaster recovery planning.
- Familiarity with modern CI/CD tools (Jenkins, GitLab CI, Argo CD, Spinnaker) and release automation best practices.
- Strong problem-solving and debugging skills, especially in high-pressure, production-critical environments.
- Excellent leadership, communication, and cross-functional collaboration skills.
- Lead the SRE function, owning end-to-end service reliability, observability, incident management, capacity planning, and production readiness.
- Establish SLOs, SLIs, and error budgets in collaboration with product and engineering teams to drive service quality goals.
- Build and maintain highly available, fault-tolerant, and self-healing infrastructure leveraging IaC, automation, and scalable architectures.
- Design and implement monitoring, alerting, and observability platforms using tools like Prometheus, Grafana, Datadog, ELK/EFK stack, or equivalent.
- Drive the evolution of CI/CD pipelines, release automation, and safe deployment practices using GitOps or similar methodologies.
- Lead and refine the incident management lifecycle, including root cause analysis (RCA), incident postmortems, and production runbooks.
- Optimize cost, performance, and scalability of cloud infrastructure across hybrid or multi-cloud environments (AWS, GCP, Azure).
- Champion DevSecOps and SRE best practices, advocating for early detection, chaos engineering, and continuous improvement in resilience engineering.
- Mentor and develop a team of SREs and platform engineers; conduct performance reviews and technical coaching.
- Serve as a key advisor in architectural reviews to ensure systems are built with reliability, scalability, and observability in mind.
- Maintain strong partnerships with Security, Product, QA, and Engineering teams to support agile development and delivery.