
Engineering Manager SRE
- Bangalore, Karnataka
- Permanent
- Full-time
- Design, build, and maintain cloud infrastructure on AWS and/or GCP.
- Develop infrastructure as code using Terraform and configuration management tools like Ansible.
- Manage access controls and security configurations in the cloud.
- Implement and improve observability frameworks, leveraging Victoria Metrics/Prometheus, Grafana, and ELK Stack (Elasticsearch, Log stash, Kibana) for monitoring, logging, and metrics.
- Deploy, manage, and scale Kubernetes clusters to ensure high availability and performance.
- Automate operational processes using Bash and Python scripts to enhance efficiency and reduce manual intervention.
- Troubleshoot, diagnose, and resolve complex system issues related to networking, operating systems, and distributed services.
- Collaborate with development and product teams to improve system reliability and release pipelines.
- Optimize performance and scalability of cloud environments to meet business requirements.
- Participate in on-call rotations to provide support for production systems.
- We are looking for a highly skilled SRE 3 professional with a strong background in both People Management and Project Management.
- Ideal candidate must have hands-on experience in leading teams, mentoring engineers, and driving complex projects to successful completion
- Coding Skills: Proficiency in Bash and Python for automation and scripting.
- Cloud Platforms: Hands-on experience with AWS and/or GCP.
- Infrastructure as Code (IaC): Strong knowledge of Terraform for automating infrastructure provisioning.
- Configuration Management: Experience with Ansible or equivalent tools.
- Observability: Hands-on experience with monitoring tools (VictoriaMetrics, Grafana) and logging systems (ELK Stack).
- Kubernetes: Practical experience deploying, managing, and troubleshooting Kubernetes clusters.
- Access Management: Strong understanding of AWS IAM for managing user permissions and security policies.
- Debugging & Troubleshooting: Strong problem-solving skills to debug and resolve complex system and network issues.
- Fundamentals: In-depth understanding of operating system concepts, networking protocols, and large-scale cloud infrastructure management.
- Experience: Minimum of 9 - 13 years preferred working in an SRE role, managing large-scale cloud infrastructure in a production environment.
- People & Project Management: Manage, mentor, and grow a team of SREs and developers.
- Experience with additional tools like Prometheus, Loki, or other monitoring solutions.
- Familiarity with CI/CD pipelines and DevOps best practices.
- Certifications in AWS, GCP, or Kubernetes are a plus.
- Previous experience working in fast-paced environments with a focus on automation and reliability.