
Site Reliability Engineer
- Pune, Maharashtra
- Permanent
- Full-time
- Design, implement, and maintain highly available and fault-tolerant systems in a financial environment.
- Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure system reliability and customer satisfaction.
- Passionately identify, measure, and reduce TOIL, with a proactive approach to eliminating repetitive manual tasks through automation.
- Lead incident response, post-mortems, and root cause analysis for production issues.
- Collaborate with development teams to embed reliability into the software development lifecycle.
- Integrate with observability platforms (e.g., Prometheus, Grafana, ELK, Datadog) to ensure end-to-end visibility of systems and services.
We value transparency, shared responsibility, and continuous learning. You'll work alongside talented engineers who are passionate about building reliable systems and solving complex problems.Your expertise
- Proven expertise in Site Reliability Engineering, with a background in software engineering, infrastructure, or operations.
- Hands-on experience with cloud platforms (e.g. Azure), operating systems (e.g. Linux RHEL7+ ), and networking fundamentals.
- Solid understanding of networking and storage technologies (e.g. NFS, SAN, NAS).
- Strong working knowledge of authentication and naming services (e.g. DNS, LDAP, Kerberos, Centrify).
- Proficiency in scripting and automation (e.g., Python, Go, Bash).
- Practical experience with infrastructure as code tools (e.g., Terraform, Ansible).
- Demonstrated ability to define and manage SLIs, SLOs, SLAs, and to systematically reduce TOIL.
- Ability to integrate with observability platforms to ensure system visibility.
- A metrics- and automation-driven mindset, with a strong focus on measurable reliability.
- Calm under pressure, especially during incidents and outages, with a structured approach to incident response and post-mortems.
- Strong collaboration and communication skills, with the ability to work across engineering and business teams.
- A proactive, ownership-driven attitude, always seeking opportunities to improve systems and processes.
- Experience with chaos engineering, resilience testing, or disaster recovery planning.
- Familiarity with financial transaction systems, real-time data pipelines, or core banking platforms.
- An understanding of CI/CD pipelines, containerization (AKS), and orchestration (Kubernetes).