
ML Ops Engineer
- Pune, Maharashtra
- Permanent
- Full-time
- Design, build, and maintain scalable, secure ML pipelines for model training, validation, deployment, and monitoring
- Automate deployment workflows using CI/CD pipelines and infrastructure-as-code tools
- Partner with Infrastructure Teams to manage (Azure) cloud-based ML infrastructure, ensuring compliance with InfoSec and AI policies
- Ensure applications run at peak efficiency
- Develop rigorous testing frameworks for ML models, including clinical validation, traditional model performance measures, population segmentation, and edge-case analysis
- Build monitoring systems to detect model drift, overfitting, data anomalies, and performance degradation in real-time
- Continuously analyze model performance metrics and operational logs to identify improvement opportunities
- Translate monitoring insights into actionable recommendations for data scientists to improve model precision, recall, fairness, and efficiency
- Maintain detailed audit trails, logs, and metadata for all model versions, training datasets, and configurations to ensure full traceability and support internal audits
- Ensure models meet transparency and explainability standards using tools like SHAP, LIME, or integrated explainability APIs.
- Collaborate with data scientists and clinical teams to ensure models are interpretable, actionable, and aligned with practical applications
- Support corporate Compliance and AI Governance policies
- Advocate for best practices in ML engineering, including reproducibility, version control, and ethical AI
- Develop product guides, model documentation, and model cards for internal and external stakeholders
- Bachelor’s Degree in Computer Science, Machine Learning, Data Science, or a related field
- 4+ years of experience in MLOps, DevOps, or ML engineering
- Proficiency in Python and ML frameworks such as Keras, PyTorch, Scikit-Learn, TensorFlow, and XGBoost
- Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD tools
- Familiarity with healthcare datasets and privacy regulations
- Strong analytical skills to interpret model performance data and identify optimization opportunities
- Proven ability to optimize application performance, including improving code efficiency, right-sizing infrastructure usage, and reducing system latency
- Experience implementing rollback strategies, including version control, rollback triggers, and safe deployment practices across lower and upper environments
- 3+ years of experience developing in a cloud environment (AWS, GCS, Azure)
- 3+ years of experience with Github, Github Actions, CI/CD, and source control
- 3+ years working within an Agile environment
- Experience with MLOps platforms like MLflow, TFX, or Kubeflow
- Healthcare experience, particularly using administrative and prior authorization data
- Proven experience with developing and deploying ML systems into production environments
- Experience working with Product, Engineering, Infrastructure, and Architecture teams
- Proficiency using Azure cloud-based services and infrastructure such as Azure MLOps
- Experience with feature flagging tools and strategies