Infrastructure Lead (DevOps & Cloud)
Weekday AI View all jobs
- Mumbai, Maharashtra
- Permanent
- Full-time
- Design, develop, and maintain scalable cloud infrastructure on AWS and Azure platforms.
- Lead architectural decisions to ensure high availability, fault tolerance, and optimal performance.
- Promote infrastructure automation through Infrastructure as Code (Terraform).
- Develop and enhance CI/CD pipelines using tools such as Jenkins, GitLab CI, CircleCI, and ArgoCD.
- Adopt GitOps methodologies for consistent and dependable deployments.
- Increase deployment frequency, shorten lead times, and reduce failure rates.
- Oversee and scale Kubernetes clusters across EKS, AKS, and on-premises environments.
- Implement container orchestration, service mesh solutions, and cluster optimization techniques.
- Ensure platform reliability and conduct performance tuning.
- Establish and uphold SLOs, SLAs, and reliability benchmarks.
- Deploy observability tools such as Prometheus, Grafana, Datadog, and ELK stack.
- Lead incident management processes including root cause analysis and reducing mean time to recovery (MTTR).
- Promote automation across infrastructure provisioning, monitoring, and recovery workflows.
- Create reusable infrastructure modules and accelerators.
- Minimize manual tasks through scripting using Python and Bash, along with supporting tools.
- Apply cloud security best practices involving IAM, network security, and policy enforcement.
- Maintain compliance via Kubernetes policies and governance frameworks.
- Champion secure-by-design principles in infrastructure development.
- Monitor cloud resource consumption and implement cost-saving strategies.
- Utilize right-sizing, auto-scaling, and efficient resource utilization methods.
- Lead and mentor DevOps and SRE teams.
- Collaborate effectively with engineering, product, and architecture teams.
- Promote infrastructure best practices across various projects and teams.
- Explore AI and machine learning-driven infrastructure enhancements and AIOps capabilities.
- Implement intelligent monitoring, anomaly detection, and automate root cause analysis.
- At least 8 years of experience in Infrastructure, DevOps, or SRE roles.
- Strong expertise in AWS (preferred).
- Hands-on experience with Terraform (Infrastructure as Code).
- Comprehensive knowledge of Kubernetes and containerization (Docker).
- Experience working with CI/CD tools such as Jenkins, GitLab CI, CircleCI, and ArgoCD.
- Strong understanding of monitoring and observability tools.
- Proficient in scripting languages including Python and Bash.
- Experience managing high-availability, large-scale systems.