Sr Engineer, Site Reliability

TMUS Global Solutions

Hyderabad, Telangana
Permanent
Full-time

2 days ago
Apply easily

About TMUS Global SolutionsT-Mobile is America’s supercharged Un-carrier, challenging conventions and setting new standards in wireless. With the nation’s largest and fastest 5G network, T-Mobile delivers advanced connectivity and unmatched value to millions across the U.S. We’re unwaveringly obsessed with providing the best possible service experience, driven by a spirit of disruption that fuels competition and innovation in wireless and beyond.Job DescriptionAbout the Role:The Senior Site Reliability Engineer for Container Platforms at T-Mobile plays a crucial role in shaping and maintaining the foundational infrastructure that powers our next-generation platforms and services. They design, implement, and manage large-scale Kubernetes clusters and related automation systems that ensure high availability, scalability, and reliability across T-Mobiles technology ecosystem. They also utilize their strong problem-solving and analytical skills to automate processes, reducing manual effort and preventing operational incidents. Their expertise in Kubernetes and scripting languages, incident response management, and various tech tools contributes to the robustness and efficiency of our systems. By continuously learning new skills and technologies, they adapt to changing circumstances and drive innovation. Their work and expertise contribute significantly to the stability and performance of T-Mobile's digital infrastructure. They are also responsible for diagnosing and resolving complex issues across networking, storage, and compute layers, driving continuous improvement through data-driven insights and DevOps best practices.This engineer is also responsible for contributing to the overall architecture and strategy of technical systems, mentoring junior engineers, and ensuring solutions are aligned with T-Mobile's business and technical goals.What Youll Do:

Design, build, and maintain large-scale, production-grade Kubernetes (K8s) clusters to ensure high availability, scalability, and security across T-Mobiles hybrid infrastructure.
Develop and manage Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, and ARM templates, enabling consistent and automated infrastructure deployment across AWS, Azure, and on-premises data centers.
Resolve platform-related customer tickets by diagnosing and addressing infrastructure, deployment, and performance issues to ensure reliability and seamless user experience.
Lead incident response, root cause analysis (RCA), and post-mortems, implementing automation to prevent recurrence.
Implement and optimize CI/CD pipelines leveraging GitLab, Argo, and Flux to support seamless software delivery, continuous integration, and progressive deployment strategies.
Automate system operations through scripting and development in Go, Bash, and Python, driving efficiency, repeatability, and reduced operational overhead.
Monitor, analyze, and enhance system performance, proactively identifying bottlenecks and ensuring reliability through data-driven observability and capacity planning.
Troubleshoot complex issues across the full stacknetwork, storage, compute, and application layers using advanced tools like pcap, telnet, and Linux-native diagnostics.
Apply deep Kubernetes expertise to diagnose, resolve, and prevent infrastructure-related incidents while mentoring team members on container orchestration best practices.
Drive a culture of automation, resilience, and continuous improvement, contributing to the evolution of T-Mobiles platform engineering and cloud infrastructure strategies.
Drive innovation by recommending new technologies, frameworks, and tools.
Perform additional duties and strategic projects as assigned.

What Youll Bring:

Bachelors degree in computer science, Software Engineering, or related field.
59 years of hands-on experience in Site Reliability Engineering roles supporting large-scale, production-grade systems.
Extensive hands-on experience with Kubernetes (K8s)including cluster provisioning, scaling, upgrades, and performance tuning in both on-premises and multi-cloud environments
Proficiency in scripting and programming with Go, Bash, or Python to automate operational tasks and develop scalable infrastructure solutions.
Hands-on experience with observability tools (monitoring, alerting, logging, and tracing) to maintain reliability and operational excellence.
Solid understanding of CI/CD concepts and practical experience with GitLab pipelines or similar tools for automated deployments and continuous delivery.
Strong understanding of cloud architecture and DevOps best practices
Strong analytical thinking and collaborative problem-solving skills.
Excellent communication and documentation abilities.

Must Have Skills:

Advanced Kubernetes Expertise in managing large-scale, production Kubernetes clusters
Infrastructure as Code (IaC) & Automation using Terraform,Ansible, Cloudform, ARM Templates
Cloud Platform Proficiency (AWS , Azure)
Expert-level knowledge of Linux administration, performance tuning, and troubleshooting
Strong programming and automation skills in Go, Python, or Bash, coupled with hands-on experience in CI/CD pipelines
Experience with monitoring and logging tools like Prometheus, Grafana

Nice To Have

Certifications in Kubernetes, cloud platforms, or DevOps practices.
Performance Optimization and Capacity Planning

TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That meansthat any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.TMUS Global Solutions willnever seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidates acceptance of a formal offer.

TMUS Global Solutions