
Senior Site Reliability Engineer
- Bangalore, Karnataka
- Permanent
- Full-time
- Maintain our production infrastructure hosted on AWS via code.
- Create pipelines to deploy and manage global infrastructure.
- Analyze complex system behavior, performance and application issues.
- Develop observability, alerts and runbooks
- Develop, maintain and administer modern infrastructure deployment tools.
- Linux systems administration, configuration, troubleshooting and automation.
- Capacity analysis and planning, traffic routing, and security policies for Ping's market leading Single Sign-On SaaS applications.
- This position is part of an on-call rotation of 8 hours by 7 days a week.
- 5-13 years of experience in Software Engineering, focusing on Site Reliability Engineering (SRE) or DevOps principles
- At least 3-8 years of hands-on experience designing, deploying, and managing complex systems on Amazon Web Services (AWS).
- Expert-level proficiency in provisioning and managing public cloud infrastructure using Infrastructure as Code (IaC) frameworks such as AWS CloudFormation and Terraform.
- Proven ability to develop, test, and maintain robust automation scripts and tools to improve operational efficiency and reliability. That includes good experience with Python scripting.
- Extensive hands-on experience with Containerization (Docker) and Container Orchestration (Kubernetes), including deployment, scaling, and troubleshooting of containerized applications.
- Proficiency in designing and implementing server configuration management using tools like Puppet, Chef, or SaltStack, with a focus on idempotent and declarative configurations.
- Strong experience with CI/CD pipeline design and implementation using tools such as GitLab CI/CD, Argo CD, Jenkins, or similar, promoting automated testing, deployment, and release strategies.
- In-depth knowledge of Relational Databases (e.g., PostgreSQL, MySQL)
- Solid understanding and practical application of Site Reliability Engineering (SRE) principles including SLOs, SLIs, error budgets, post-mortems, and incident response.
- Demonstrated experience in a high-volume, mission-critical production service environment, with a strong focus on system resilience, fault tolerance, and disaster recovery.
- Knowledge with observability tooling such as NewRelic, Grafana, and Cloudwatch.
- Knowledge of Cassandra
- Experience with distributed data systems and their unique challenges in a cloud environment.
- Experience with security design principles and best practices for building secure, scalable, and resilient cloud-native applications.
- A company culture that empowers you to do your best work.
- Employee Resource Groups that create a sense of belonging for everyone.
- Regular company and team bonding events.
- Competitive benefits and perks.
- Global volunteering and community initiatives
- Generous PTO & Holiday Schedule
- Parental Leave
- Progressive Healthcare Options
- Retirement Programs
- Opportunity for Education Reimbursement
- Commuter Offset (Specific locations)