SRE Engineer--Lead I - DevOps Engineering
UST View all jobs
- Pune, Maharashtra
- Permanent
- Full-time
- Manage system uptime and reliability across cloud‑native (AWS, GCP) and hybrid architectures
- Design and implement Infrastructure as Code (IaC) solutions that meet security and engineering standards using tools such as Terraform, cloud CLIs, and cloud SDKs
- Build and maintain CI/CD pipelines for application and infrastructure deployment using tools like Jenkins and cloud‑native toolchains
- Develop automated tooling to deploy production changes and manage service requests effectively
- Create and maintain comprehensive runbooks to detect, remediate, and restore services
- Troubleshoot and triage complex issues in distributed systems, including participation in on‑call rotations for high‑severity incidents
- Continuously improve runbooks and operational processes to reduce Mean Time to Recovery (MTTR)
- Lead blameless postmortems for availability incidents and own remediation actions to prevent recurrence
- DevSecOps
- Operational Excellence
- Systems Thinking
- Troubleshooting
- Technical Communication and Presentation
- Bachelor’s degree in Computer Science or a related technical field involving coding (or equivalent practical experience)
- 5–7 years of experience across software engineering, systems administration, database administration, or networking
- Minimum 2+ years of experience developing or administering systems on public cloud platforms
- Experience monitoring infrastructure and application availability to meet performance and reliability objectives
- Proficiency in one or more programming/scripting languages such as Python, Bash, Java, Go, JavaScript, or Node.js
- Strong cross‑functional understanding of systems, networking, storage, security, and databases
- System administration and automation experience using tools such as Terraform, Chef, Ansible, and containers (Docker, Kubernetes)
- Strong experience with CI/CD tools and practices
- Cloud certifications are strongly preferred
- Applies DevSecOps principles to improve system resilience and service reliability
- Designs, codes, tests, documents, and supports complex scripts and integrated services
- Contributes to selecting development tools, methods, and SRE standards
- Leads code reviews and participates in peer reviews to ensure quality and reliability
- Develops and executes work plans for moderate‑complexity assignments
- Continuously monitors system metrics to ensure availability and performance
- Proactively improves processes to enhance efficiency, reliability, and scalability
- Applies best practices to understand how systems interact and impact reliability
- Maintains awareness of technology trends to improve system availability and performance
- Mentors less experienced team members through architectural and operational insights
- Clearly communicates complex technical concepts and operational impacts to stakeholders
- Demonstrates strong written and verbal communication skills tailored to diverse audiences
- Collaborates effectively across teams to resolve conflicts and achieve shared goals
- Uses a structured approach to diagnose and resolve system and service issues
- Coordinates investigation and implementation of corrective actions
- Analyzes trends and recurring issues to drive long‑term preventive solutions