SRE Engineer--Lead I - DevOps Engineering

UST View all jobs

  • Pune, Maharashtra
  • Permanent
  • Full-time
  • 8 days ago
Job Description:Job Description: Site Reliability Engineer (SRE)Role SummaryThe Site Reliability Engineer (SRE) role combines software engineering and systems engineering to build, operate, and support large‑scale, distributed, fault‑tolerant systems. This position focuses on ensuring high availability, performance, security, and reliability across cloud‑native and hybrid environments through automation, observability, and operational excellence.Key Responsibilities
  • Manage system uptime and reliability across cloud‑native (AWS, GCP) and hybrid architectures
  • Design and implement Infrastructure as Code (IaC) solutions that meet security and engineering standards using tools such as Terraform, cloud CLIs, and cloud SDKs
  • Build and maintain CI/CD pipelines for application and infrastructure deployment using tools like Jenkins and cloud‑native toolchains
  • Develop automated tooling to deploy production changes and manage service requests effectively
  • Create and maintain comprehensive runbooks to detect, remediate, and restore services
  • Troubleshoot and triage complex issues in distributed systems, including participation in on‑call rotations for high‑severity incidents
  • Continuously improve runbooks and operational processes to reduce Mean Time to Recovery (MTTR)
  • Lead blameless postmortems for availability incidents and own remediation actions to prevent recurrence
Key Skills to Develop
  • DevSecOps
  • Operational Excellence
  • Systems Thinking
  • Troubleshooting
  • Technical Communication and Presentation
Required Experience & Qualifications
  • Bachelor’s degree in Computer Science or a related technical field involving coding (or equivalent practical experience)
  • 5–7 years of experience across software engineering, systems administration, database administration, or networking
  • Minimum 2+ years of experience developing or administering systems on public cloud platforms
  • Experience monitoring infrastructure and application availability to meet performance and reliability objectives
  • Proficiency in one or more programming/scripting languages such as Python, Bash, Java, Go, JavaScript, or Node.js
  • Strong cross‑functional understanding of systems, networking, storage, security, and databases
  • System administration and automation experience using tools such as Terraform, Chef, Ansible, and containers (Docker, Kubernetes)
  • Strong experience with CI/CD tools and practices
  • Cloud certifications are strongly preferred
What Could Set You ApartDevSecOps
  • Applies DevSecOps principles to improve system resilience and service reliability
  • Designs, codes, tests, documents, and supports complex scripts and integrated services
  • Contributes to selecting development tools, methods, and SRE standards
  • Leads code reviews and participates in peer reviews to ensure quality and reliability
Operational Excellence
  • Develops and executes work plans for moderate‑complexity assignments
  • Continuously monitors system metrics to ensure availability and performance
  • Proactively improves processes to enhance efficiency, reliability, and scalability
Systems Thinking
  • Applies best practices to understand how systems interact and impact reliability
  • Maintains awareness of technology trends to improve system availability and performance
  • Mentors less experienced team members through architectural and operational insights
Technical Communication & Presentation
  • Clearly communicates complex technical concepts and operational impacts to stakeholders
  • Demonstrates strong written and verbal communication skills tailored to diverse audiences
  • Collaborates effectively across teams to resolve conflicts and achieve shared goals
Troubleshooting
  • Uses a structured approach to diagnose and resolve system and service issues
  • Coordinates investigation and implementation of corrective actions
  • Analyzes trends and recurring issues to drive long‑term preventive solutions
Skills:terraform,aws,ci/cd,jenkins,About Company:UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.

UST