
Site Reliability Engineer
- Bangalore, Karnataka
- Permanent
- Full-time
- Self-Hosted Infrastructure Ownership: Design, deploy, and maintain self-hosted systems and services, ensuring reliability, scalability, and security.
- Kubernetes & Container Orchestration: Architect, manage, and scale Kubernetes clusters in production and self-hosted environments.
- Infrastructure as Code (IaC): Write and manage Terraform modules to provision and manage infrastructure across AWS/GCP and on-prem setups.
- CI/CD Automation: Build and maintain reliable CI/CD pipelines using tools like Jenkins, GitLab CI, or ArgoCD to ensure fast and safe deployments.
- Monitoring & Observability: Set up and fine-tune observability tools like Grafana, Prometheus, and Graylog to monitor infrastructure, detect anomalies, and ensure uptime SLAs.
- Scripting & Engineering: Write clean, modular automation scripts in Python, Bash, or Go to support operational needs and improve team productivity.
- System Reliability & Incident Response: Own on-call responsibilities, drive root cause analysis, and continuously improve incident handling and system resilience.
- Security & Compliance: Implement security and access controls within infrastructure, focusing on hardened self-hosted environments.
- 4-6 years of experience
- Proven experience managing self-hosted systems and internal tooling at scale
- Deep hands-on knowledge of Kubernetes, including Helm, Ingress, scaling, and custom operators
- Solid experience with Terraform for IaC; optionally Ansible for configuration
- Expertise in monitoring/logging stacks: Grafana, Prometheus, Graylog, ELK
- Hands-on with AWS, GCP, or Azure; strong understanding of cloud-native + on-prem hybrid setups
- Strong scripting experience in Python, Bash, or Go
- Proficiency in version control systems like Git for managing code repositories and facilitating collaboration among development teams