
Senior Site Reliability Engineer
- Bangalore, Karnataka
- Permanent
- Full-time
- Design, implement, and maintain highly available and scalable infrastructure systems, ensuring maximum uptime and performance.
- Collaborate with software engineering teams to build and deploy applications using best practices in reliability, scalability, and security.
- Develop and implement automation tools and frameworks to streamline operational processes, reduce manual intervention, and improve efficiency.
- Monitor and analyse system performance, identifying bottlenecks, and implementing solutions to optimize performance and scalability.
- Implement and maintain effective monitoring, alerting, and logging systems to proactively identify and resolve issues before they impact users.
- HandsOn Experience in building CI/CD automated pipelines using GitHUB Actions/Jenkins/GitLab or equivalent platform
- Excellent in Automating workflows or solutions using Python/Go/Shell
- Lead incident response and root cause analysis efforts, driving continuous improvement and preventing future incidents.
- Collaborate with cross-functional teams to define and enforce best practices, standards, and guidelines for system reliability and performance.
- Participate in on-call rotations and respond to incidents, ensuring timely resolution and minimal impact to users and thereby meeting SLAs.
- Plan and devise Disaster Recovery (DR) strategies and implement DR Plans.
- Mentor and provide guidance to junior team members, fostering a culture of learning and growth.
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Build software and systems to manage platform infrastructure and applications.
- Improve reliability, quality, and time-to-market of our suite of software solutions.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
- Provide primary operational support and engineering for multiple large-scale distributed software applications.
- Proven experience as a Site Reliability Engineer or similar role, with a focus on designing and maintaining highly available and scalable systems.
- Strong programming and scripting skills (Python, Bash, etc.) to automate operational tasks and develop tooling.
- Experience with cloud platforms (AWS) and containerization technologies (Docker, EKS).
- Proficient in configuration management tools like Ansible and infrastructure-as-code frameworks such as Terraform and CloudFormation.
- Experience with monitoring and logging tools (Prometheus, Grafana, Loki, Sentry.io, CloudWatch, etc.) for proactive system monitoring and troubleshooting.
- Ability to program (Structured and OOP) using one or more high-level languages, such as Java and JavaScript
- Solid understanding of networking principles, protocols, and security best practices.
- Strong problem-solving skills and the ability to work effectively in a fast-paced, dynamic environment.
- Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams.
- Experience with distributed storage technologies such as NFS, Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
- Experience in Agile methodologies
- Strong skills in software design, design patterns
- Experience in different architecture patterns like client-server/server less computing.
- Effective written, verbal and presentation skills with the ability to clearly articulate ideas and concepts.
- Self-directed and able to direct others.
- Experience with setting up performance/load test environments.
- Familiarity with SOC2 audit processes
- BE/B Tech/M Tech/MCA/MSc in Computer Science Engineering
- 7 to 11 Years of experience in Software Application Development/CloudOps/SRE