Sr Engineer, Site Reliability
TMUS Global Solutions
- Hyderabad, Telangana
- Permanent
- Full-time
- Design, build, and maintain large-scale, production-grade Kubernetes (K8s) clusters to ensure high availability, scalability, and security across T-Mobiles hybrid infrastructure.
- Develop and manage Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, and ARM templates, enabling consistent and automated infrastructure deployment across AWS, Azure, and on-premises data centers.
- Resolve platform-related customer tickets by diagnosing and addressing infrastructure, deployment, and performance issues to ensure reliability and seamless user experience.
- Lead incident response, root cause analysis (RCA), and post-mortems, implementing automation to prevent recurrence.
- Implement and optimize CI/CD pipelines leveraging GitLab, Argo, and Flux to support seamless software delivery, continuous integration, and progressive deployment strategies.
- Automate system operations through scripting and development in Go, Bash, and Python, driving efficiency, repeatability, and reduced operational overhead.
- Monitor, analyze, and enhance system performance, proactively identifying bottlenecks and ensuring reliability through data-driven observability and capacity planning.
- Troubleshoot complex issues across the full stacknetwork, storage, compute, and application layers using advanced tools like pcap, telnet, and Linux-native diagnostics.
- Apply deep Kubernetes expertise to diagnose, resolve, and prevent infrastructure-related incidents while mentoring team members on container orchestration best practices.
- Drive a culture of automation, resilience, and continuous improvement, contributing to the evolution of T-Mobiles platform engineering and cloud infrastructure strategies.
- Drive innovation by recommending new technologies, frameworks, and tools.
- Perform additional duties and strategic projects as assigned.
- Bachelors degree in computer science, Software Engineering, or related field.
- 59 years of hands-on experience in Site Reliability Engineering roles supporting large-scale, production-grade systems.
- Extensive hands-on experience with Kubernetes (K8s)including cluster provisioning, scaling, upgrades, and performance tuning in both on-premises and multi-cloud environments
- Proficiency in scripting and programming with Go, Bash, or Python to automate operational tasks and develop scalable infrastructure solutions.
- Hands-on experience with observability tools (monitoring, alerting, logging, and tracing) to maintain reliability and operational excellence.
- Solid understanding of CI/CD concepts and practical experience with GitLab pipelines or similar tools for automated deployments and continuous delivery.
- Strong understanding of cloud architecture and DevOps best practices
- Strong analytical thinking and collaborative problem-solving skills.
- Excellent communication and documentation abilities.
- Advanced Kubernetes Expertise in managing large-scale, production Kubernetes clusters
- Infrastructure as Code (IaC) & Automation using Terraform,Ansible, Cloudform, ARM Templates
- Cloud Platform Proficiency (AWS , Azure)
- Expert-level knowledge of Linux administration, performance tuning, and troubleshooting
- Strong programming and automation skills in Go, Python, or Bash, coupled with hands-on experience in CI/CD pipelines
- Experience with monitoring and logging tools like Prometheus, Grafana
- Certifications in Kubernetes, cloud platforms, or DevOps practices.
- Performance Optimization and Capacity Planning