
Machine Learning Engineer - ML Ops
- Pune, Maharashtra
- Permanent
- Full-time
- Monitor, troubleshoot, and maintain ML pipelines and services in production, ensuring high availability and minimal downtime.
- Work closely with Data Scientists and Engineers to operationalize ML/LLM models, from development through deployment.
- Build and maintain observability tools for tracking data quality, model performance, drift detection, and inference metrics.
- Support LLM and Agentic AI features in production, focusing on stability, optimization, and seamless integration into the platform.
- Develop and enhance internal ML tooling for faster experimentation, deployment, and feature integration.
- Collaborate with Product teams to roll out new ML-driven features and improve existing ones.
- Work with DevOps to improve CI/CD workflows for ML code, data pipelines, and models.
- Optimize resource usage and costs for large-scale model hosting and inference.
- Document workflows, troubleshooting guides, and best practices for ML systems support.
- B.E./
- Familiarity with containerized microservices (Docker, Kubernetes) and CI/CD pipelines.
- Experience monitoring ML systems using tools like Prometheus, Grafana, ELK, Sentry, or equivalent.
- Understanding of model packaging and serving frameworks (FastAPI, TorchServe, Triton Inference Server, Hugging Face Inference API).
- Strong collaboration skills with cross-functional teams.
- Exposure to LLM operations (prompt engineering, fine-tuning, inference optimization).Familiarity with Agentic AI workflows and multi-step orchestration (LangChain, LlamaIndex).
- Experience with data versioning (DVC, Delta Lake) and experiment tracking (MLflow, Weights & Biases).
- Knowledge of vector databases (Pinecone, Weaviate, FAISS).
- Experience with streaming data (Kafka) and caching (Redis).
- Skills in cost optimization for GPU workloads.
- Basic understanding of system design for large-scale AI infrastructure.