
Senior site reliability engineer
- Pune, Maharashtra
- Permanent
- Full-time
- Ensure the reliability, performance, and scalability of large-scale, cloud-based applications and infrastructure.
- Creating automated solutions to improve operational aspects of the site.
- Ensure that applications and websites run smoothly and efficiently.
- Detect issues and automatically managing failures to keep systems up and running.
- Work with software developers, engineers, and operations teams to improve system performance.
- Analyse incidents to prevent future disruptions.
- A bachelor's degree in computer science, engineering, or a related field or equivalent work experience.
- Relevant certifications (e.g., AWS / Azure cloud engineering, fundamentals, DevOps, architect certifications) can be beneficial. Knowledge of networking concepts, protocols, and tools, willingness to learn new technologies and adapt to changing environments.
- Skilled in managing configuration, deployments, observability, handling and resolving incidents, including root cause analysis, managing and operating complex systems for scalability, availability and performance.
- Proficient in communication and collaboration skills to work effectively with development and operations teams.
- Proficient in TypeScript, C#, and Python; comfortable working across platforms.
- Skilled in writing secure, stable, testable, and maintainable code.
- Familiar with systems design principles.
- 5+ years of software development experience, ideally in platform or service engineering.
- Familiar with software engineering best practices across the full SDLC—coding standards, code reviews, source control, CI/CD, testing, and operations.
- Experience supporting and operating production systems, with exposure to monitoring, logging, alerting, and basic security practices.
- Skilled knowledge of Linux/Unix systems, including system configuration, networking, and debugging.
- Expert in building and scaling infrastructure services using Amazon Web Services or Microsoft Azure
- Skilled with infrastructure tools like Ansible, Puppet, Chef, or Terraform for infrastructure as code, monitoring tools (e.g., Prometheus, Grafana) and logging systems (e.g., ELK stack).
- Skilled in the understanding of using core cloud application infrastructure services including identity platforms, networking, storage, databases, containers, and serverless.
- Skillful knowledge of databases, such as relational, graph, document, and key-value, including performance tuning and improvement.