
Senior DevOps Engineer - AI/ML Infrastructure 7-10Yrs
- Bangalore, Karnataka
- Permanent
- Full-time
- Design and implement CI/CD pipelines for AI applications including model deployment and agent workflows
- Build and maintain Kubernetes clusters optimized for AI workloads including GPU resource management
- Implement comprehensive monitoring and observability for AI systems including custom metrics for model performance
- Develop infrastructure-as-code solutions for scalable AI service deployments
- Establish reliability engineering practices including SLA management and incident response for AI systems
- Optimize cloud infrastructure costs with focus on GPU utilization and LLM API usage
- Implement security and compliance frameworks for AI applications and data pipelines
- Collaborate with development teams to ensure production readiness of AI agents and RAG systems
- Manage multi-cloud deployments and vendor integrations for AI services
- Bachelor's degree in Computer Science, Engineering, or related technical field
- 7-10 years of DevOps/Infrastructure experience with demonstrated production system ownership
- Strong expertise in Kubernetes orchestration and container management (Docker)
- Proficient in Python scripting and automation
- Extensive experience with Linux system administration and performance tuning
- Hands-on experience with Jenkins or similar CI/CD platforms
- Production experience with cloud platforms (AWS, GCP, or Azure)
- Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, or similar)
- Experience deploying and managing AI/ML workloads in production environments
- Understanding of RAG system infrastructure requirements and vector database operations
- Knowledge of LLM API integration patterns and rate limiting strategies
- Experience with GPU cluster management and resource optimization
- Familiarity with AI agent workflows and their operational characteristics
- Production monitoring and alerting experience with tools like Prometheus, Grafana, or DataDog
- Incident response and post-mortem experience with complex distributed systems
- Capacity planning and performance optimization for high-traffic applications
- Experience with log aggregation and distributed tracing systems
- Understanding of reliability patterns including circuit breakers and graceful degradation
- Experience with MLOps practices and model deployment pipelines
- Knowledge of AI-specific monitoring including model drift detection and performance metrics
- Experience with cost optimization strategies for AI workloads
- Background in financial services, gaming, or other high-availability environments
- Certification in major cloud platforms (AWS Solutions Architect, GCP Professional, etc.)
- Experience with service mesh technologies (Istio, Linkerd)
- Multi-cloud infrastructure with primary focus on AWS/GCP
- Kubernetes-based container orchestration
- Modern observability stack with custom AI metrics
- GitOps workflows and infrastructure automation
- Integration with enterprise security and compliance frameworks