Sr DevOps Engineer
HMH
- Pune, Maharashtra
- Permanent
- Full-time
- Cloud & Infrastructure: AWS EC2, Terraform Enterprise, Docker, Aurora, Mesos, Kubernetes, ELK (Elastic Search, Logstash & Kibana).
- Observability & Automation: Grafana, Prometheus, Datadog, Telegraf, Runscope, Apollo, GraphQL.
- Development Stack: Microservices architecture, Spring, Java & NodeJS, React, Express.js.
- Data & Storage: Amazon RDS, Dynamo DB, Postgres, Oracle, MySQL, Influx DB, Linux, Jenkins, GitHub.
- AI & Agentic Automation: AWS Bedrock LLMs and AWS Bedrock Engineer for building and integrating scalable, low-latency AI-driven automation capabilities.
- You can read more on our Engineering Blog -
- Identify and solve the most critical infrastructure challenges to improve system reliability, scalability, and performance.
- Design, test, and implement AI-enhanced DevOps workflows, including autonomous agents for monitoring, remediation, and optimization.
- Partner with SRE and development teams to build robust, self-service deployment pipelines and infrastructure tooling.
- Evaluate new technologies to continuously improve system automation, cost efficiency, and security.
- Work with AI-enhanced monitoring and self-healing infrastructure components powered by agentic patterns.
- Build, maintain, and evolve cloud infrastructure with Infrastructure as Code (Terraform, CloudFormation).
- Manage containerized workloads (Docker, Kubernetes) at scale, with a focus on extending capabilities through AI-driven orchestration.
- Implement and maintain advanced monitoring, observability, and alerting systems enhanced with agent-based analytics.
- Automate workflows to reduce manual intervention and accelerate delivery cycles.
- Collaborate with cross-functional teams to ensure infrastructure meets the needs of high-availability, low-latency applications.
- Regularly review and optimize existing architecture for cost, security, and performance improvements.
- 6 to 10 years of hands-on SRE/DevOps experience in an Agile environment.
- Proven ability to collaborate across engineering and operations, with pragmatic problem-solving.
- Deep experience with AWS and infrastructure design patterns, and in recommending appropriate AWS services, including newer AI-focused tools like Bedrock.
- Strong knowledge and skills of AI-enhanced DevOps workflows and agentic infrastructure models.
- Able to quickly resolve outages, lead incident response, and restore service reliability.
- Proficiency in diagnosing outages and restoring service with urgency.
- Infrastructure as Code expertise (Terraform, CloudFormation).
- Experience with containerization (Docker, Kubernetes).
- Familiarity with CI/CD tools, scripting languages, and observability platforms.
- Strong collaboration skills, with the ability to influence and guide best practices
- Solid RDBMS experience (Postgres, MySQL, etc.), with tuning and performance expertise.
- Strong Linux fundamentals.
- Event-driven systems and message queue management
- Security, including firewalls, load balancing, secret management.