
Site Reliability Engineer II (Python, Panda, GenAI)
- Bangalore, Karnataka
- Permanent
- Full-time
- Design, deploy, and maintain highly available RAG pipelines including vector databases, embedding services, and LLM inference infrastructure
- Ensure reliable operation of agentic AI systems including multi-agent orchestration platforms, tool integration frameworks, and decision-making workflows
- Implement comprehensive monitoring and observability for AI model performance, token usage, latency, and accuracy metrics
- Lead incident response for AI system outages, including model degradation, vector search failures, and agent execution issues
- Optimize and maintain vector database infrastructure (Pinecone, Weaviate, Chroma, or similar) for high-performance similarity search at scale
- Manage embedding model deployments and ensure consistent document ingestion pipelines with proper chunking and preprocessing
- Implement retrieval quality monitoring, including relevance scoring and context window optimization
- Design and maintain hybrid search systems combining vector and traditional search methodologies
- Build and maintain infrastructure for autonomous agent systems including planning, reasoning, and tool execution frameworks
- Implement robust error handling and fallback mechanisms for agent decision chains and multi-step workflows
- Monitor and optimize agent performance metrics including success rates, execution time, and resource utilization
- Ensure secure and reliable integration between agents and external APIs, databases, and services
- Develop Infrastructure as Code solutions for AI/ML workloads including GPU clusters, model serving infrastructure, and data pipelines
- Build automated deployment pipelines for LLM fine-tuning, RAG system updates, and agent workflow modifications
- Implement A/B testing frameworks for AI system improvements and model version management
- Design capacity planning and auto-scaling solutions for variable AI workloads and inference demands
- 5+ years of SRE/DevOps experience with 2+ years specifically focused on AI/ML production systems
- Deep hands-on experience with RAG architecture implementation including vector databases, embedding models, and retrieval systems
- Proven experience with agentic AI frameworks (LangChain, LlamaIndex, AutoGPT, CrewAI, or similar) and multi-agent orchestration
- Strong understanding of LLM deployment and optimization including model serving frameworks (vLLM, TensorRT-LLM, Triton) and GPU infrastructure management
- Proficiency with vector database technologies (PgVector, Pinecone, Weaviate, Qdrant, Chroma, Milvus) and their operational requirements
- Experience with embedding models (OpenAI, Sentence Transformers, Cohere) and semantic search optimization
- Knowledge of hybrid search implementations combining vector, keyword, and graph-based retrieval methods
- Understanding of chunking strategies, document preprocessing, and knowledge graph integration
- Experience implementing AI-specific monitoring including model drift detection, hallucination tracking, and response quality metrics
- Proficiency with MLOps tools (MLflow, Weights & Biases, Neptune) and experiment tracking systems
- Knowledge of AI system debugging including prompt tracing, agent execution visualization, and performance bottleneck identification
- Understanding of AI safety monitoring including content filtering, bias detection, and usage pattern analysis
- Proficiency with cloud AI services (AWS SageMaker, Google Vertex AI, Azure ML) and their operational aspects
- Advanced Kubernetes experience including GPU scheduling, resource quotas, and AI workload optimization
- Experience with container technologies optimized for ML workloads and model serving
- Proficient in Python with deep understanding of AI/ML libraries (transformers, langchain, llamaindex, torch, numpy)
- Experience with Infrastructure as Code tools (Terraform, Helm) specifically for AI infrastructure provisioning
- Strong API design and integration skills for AI service orchestration and tool integration
- Knowledge of streaming and async processing for real-time AI applications
- Production experience with document ingestion pipelines, chunking strategies, and metadata management
- Understanding of retrieval quality optimization including re-ranking, query expansion, and context selection
- Experience with multi-modal RAG systems incorporating text, images, and structured data
- Knowledge of RAG evaluation frameworks and automated quality assessment
- Hands-on experience with agent planning algorithms, tool selection mechanisms, and execution engines
- Understanding of multi-agent coordination, communication protocols, and distributed agent systems
- Experience with agent memory systems, state management, and long-running workflow orchestration
- Knowledge of agent safety mechanisms including execution sandboxing and output validation
- Experience with fine-tuning and RLHF (Reinforcement Learning from Human Feedback) infrastructure
- Knowledge of edge AI deployment and model optimization techniques
- Familiarity with AI governance, compliance frameworks, and ethical AI implementation
- Experience with conversational AI platforms and dialogue management systems
- Understanding of knowledge graphs and symbolic reasoning integration with neural systems
- Competitive base salaries
- Bonus incentives
- Support for financial-well-being and retirement
- Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location)
- Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
- Generous paid parental leave policies (depending on your location)
- Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
- Free and confidential counseling support through our Healthy Minds program
- Career development and training opportunities