Senior DevOps Engineer - AI/ML Infrastructure 7-10Yrs

Bangalore, Karnataka
Permanent
Full-time

9 days ago

Job DescriptionSenior DevOps Engineer - AI/ML InfrastructurePosition Overview: We are seeking an experienced Senior DevOps Engineer to build and maintain the production infrastructure for our enterprise AI automation platform. This role combines traditional DevOps expertise with specialized knowledge of AI/ML workloads, focusing on reliability, scalability, and cost optimization of agentic AI systems. The successful candidate will work as part of our Agentic AI development team to ensure robust, production-ready deployments of complex AI workflows.Key Responsibilities:

Design and implement CI/CD pipelines for AI applications including model deployment and agent workflows
Build and maintain Kubernetes clusters optimized for AI workloads including GPU resource management
Implement comprehensive monitoring and observability for AI systems including custom metrics for model performance
Develop infrastructure-as-code solutions for scalable AI service deployments
Establish reliability engineering practices including SLA management and incident response for AI systems
Optimize cloud infrastructure costs with focus on GPU utilization and LLM API usage
Implement security and compliance frameworks for AI applications and data pipelines
Collaborate with development teams to ensure production readiness of AI agents and RAG systems
Manage multi-cloud deployments and vendor integrations for AI services

Required Qualifications:

Bachelor's degree in Computer Science, Engineering, or related technical field
7-10 years of DevOps/Infrastructure experience with demonstrated production system ownership
Strong expertise in Kubernetes orchestration and container management (Docker)
Proficient in Python scripting and automation
Extensive experience with Linux system administration and performance tuning
Hands-on experience with Jenkins or similar CI/CD platforms
Production experience with cloud platforms (AWS, GCP, or Azure)
Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, or similar)

AI/ML Infrastructure Requirements:

Experience deploying and managing AI/ML workloads in production environments
Understanding of RAG system infrastructure requirements and vector database operations
Knowledge of LLM API integration patterns and rate limiting strategies
Experience with GPU cluster management and resource optimization
Familiarity with AI agent workflows and their operational characteristics

Site Reliability Engineering Skills:

Production monitoring and alerting experience with tools like Prometheus, Grafana, or DataDog
Incident response and post-mortem experience with complex distributed systems
Capacity planning and performance optimization for high-traffic applications
Experience with log aggregation and distributed tracing systems
Understanding of reliability patterns including circuit breakers and graceful degradation

Preferred Qualifications:

Experience with MLOps practices and model deployment pipelines
Knowledge of AI-specific monitoring including model drift detection and performance metrics
Experience with cost optimization strategies for AI workloads
Background in financial services, gaming, or other high-availability environments
Certification in major cloud platforms (AWS Solutions Architect, GCP Professional, etc.)
Experience with service mesh technologies (Istio, Linkerd)

Technical Environment:

Multi-cloud infrastructure with primary focus on AWS/GCP
Kubernetes-based container orchestration
Modern observability stack with custom AI metrics
GitOps workflows and infrastructure automation
Integration with enterprise security and compliance frameworks

Rangam Infotech

Apply Now