
Manager
- Gurgaon, Haryana
- Permanent
- Full-time
Location: Gurugram
Relevant Experience Required: 6+ years
Employment Type: Full-timeAbout the RoleWe are seeking a Senior MLOps Engineer with deep expertise in Machine Learning Operations, Data Engineering, and Cloud-Native Deployments. This role requires building and maintaining scalable ML pipelines, ensuring robust data integration and orchestration, and enabling real-time and batch AI systems in production. The ideal candidate will be skilled in state-of-the-art MLOps tools, data clustering, big data frameworks, and DevOps best practices, ensuring high reliability, performance, and security for enterprise AI workloads.Key ResponsibilitiesMLOps & Machine Learning Deployment
- Design, implement, and maintain end-to-end ML pipelines from experimentation to production.
- Automate model training, evaluation, versioning, deployment, and monitoring using MLOps frameworks.
- Implement CI/CD pipelines for ML models (GitHub Actions, GitLab CI, Jenkins, ArgoCD).
- Monitor ML systems in production for drift detection, bias, performance degradation, and anomaly detection.
- Integrate feature stores (Feast, Tecton, Vertex AI Feature Store) for standardized model inputs.
- Design and implement data ingestion pipelines for structured, semi-structured, and unstructured data.
- Handle batch and streaming pipelines with Apache Kafka, Apache Spark, Apache Flink, Airflow, or Dagster.
- Build ETL/ELT pipelines for data preprocessing, cleaning, and transformation.
- Implement data clustering, partitioning, and sharding strategies for high availability and scalability.
- Work with data warehouses (Snowflake, BigQuery, Redshift) and data lakes (Delta Lake, Lakehouse architectures).
- Ensure data lineage, governance, and compliance with modern tools (DataHub, Amundsen, Great Expectations).
- Deploy ML workloads on AWS, Azure, or GCP using Kubernetes (K8s) and serverless computing (AWS Lambda, GCP Cloud Run).
- Manage containerized ML environments with Docker, Helm, Kubeflow, MLflow, Metaflow.
- Optimize for cost, latency, and scalability across distributed environments.
- Implement infrastructure as code (IaC) with Terraform or Pulumi.
- Build real-time inference pipelines with low latency using gRPC, Triton Inference Server, or Ray Serve.
- Work on vector database integrations (Pinecone, Milvus, Weaviate, Chroma) for AI-powered semantic search.
- Enable retrieval-augmented generation (RAG) pipelines for LLMs.
- Optimize ML serving with GPU/TPU acceleration and ONNX/TensorRT model optimization.
- Implement robust access control, encryption, and compliance with SOC2/GDPR/ISO27001.
- Monitor system health with Prometheus, Grafana, ELK/EFK, and OpenTelemetry.
- Ensure zero-downtime deployments with blue-green/canary release strategies.
- Manage audit trails and explainability for ML models.
- Programming: Python (Pandas, PySpark, FastAPI), SQL, Bash; familiarity with Go or Scala a plus.
- MLOps Frameworks: MLflow, Kubeflow, Metaflow, TFX, BentoML, DVC.
- Data Engineering Tools: Apache Spark, Flink, Kafka, Airflow, Dagster, dbt.
- Databases: PostgreSQL, MySQL, MongoDB, Cassandra, DynamoDB.
- Vector Databases: Pinecone, Weaviate, Milvus, Chroma.
- Visualization: Plotly Dash, Superset, Grafana.
- Orchestration: Kubernetes, Helm, Argo Workflows, Prefect.
- Infrastructure as Code: Terraform, Pulumi, Ansible.
- Cloud Platforms: AWS (SageMaker, S3, EKS), GCP (Vertex AI, BigQuery, GKE), Azure (ML Studio, AKS).
- Model Optimization: ONNX, TensorRT, Hugging Face Optimum.
- Streaming & Real-Time ML: Kafka, Flink, Ray, Redis Streams.
- Monitoring & Logging: Prometheus, Grafana, ELK, OpenTelemetry.