Manager

Gurgaon, Haryana
Permanent
Full-time

1 month ago

Job Description:Job Description: Senior MLOps EngineerPosition: Senior MLOps Engineer
Location: Gurugram
Relevant Experience Required: 6+ years
Employment Type: Full-timeAbout the RoleWe are seeking a Senior MLOps Engineer with deep expertise in Machine Learning Operations, Data Engineering, and Cloud-Native Deployments. This role requires building and maintaining scalable ML pipelines, ensuring robust data integration and orchestration, and enabling real-time and batch AI systems in production. The ideal candidate will be skilled in state-of-the-art MLOps tools, data clustering, big data frameworks, and DevOps best practices, ensuring high reliability, performance, and security for enterprise AI workloads.Key ResponsibilitiesMLOps & Machine Learning Deployment

Design, implement, and maintain end-to-end ML pipelines from experimentation to production.
Automate model training, evaluation, versioning, deployment, and monitoring using MLOps frameworks.
Implement CI/CD pipelines for ML models (GitHub Actions, GitLab CI, Jenkins, ArgoCD).
Monitor ML systems in production for drift detection, bias, performance degradation, and anomaly detection.
Integrate feature stores (Feast, Tecton, Vertex AI Feature Store) for standardized model inputs.

Data Engineering & Integration

Design and implement data ingestion pipelines for structured, semi-structured, and unstructured data.
Handle batch and streaming pipelines with Apache Kafka, Apache Spark, Apache Flink, Airflow, or Dagster.
Build ETL/ELT pipelines for data preprocessing, cleaning, and transformation.
Implement data clustering, partitioning, and sharding strategies for high availability and scalability.
Work with data warehouses (Snowflake, BigQuery, Redshift) and data lakes (Delta Lake, Lakehouse architectures).
Ensure data lineage, governance, and compliance with modern tools (DataHub, Amundsen, Great Expectations).

Cloud & Infrastructure

Deploy ML workloads on AWS, Azure, or GCP using Kubernetes (K8s) and serverless computing (AWS Lambda, GCP Cloud Run).
Manage containerized ML environments with Docker, Helm, Kubeflow, MLflow, Metaflow.
Optimize for cost, latency, and scalability across distributed environments.
Implement infrastructure as code (IaC) with Terraform or Pulumi.

Real-Time ML & Advanced Capabilities

Build real-time inference pipelines with low latency using gRPC, Triton Inference Server, or Ray Serve.
Work on vector database integrations (Pinecone, Milvus, Weaviate, Chroma) for AI-powered semantic search.
Enable retrieval-augmented generation (RAG) pipelines for LLMs.
Optimize ML serving with GPU/TPU acceleration and ONNX/TensorRT model optimization.

Security, Monitoring & Observability

Implement robust access control, encryption, and compliance with SOC2/GDPR/ISO27001.
Monitor system health with Prometheus, Grafana, ELK/EFK, and OpenTelemetry.
Ensure zero-downtime deployments with blue-green/canary release strategies.
Manage audit trails and explainability for ML models.

Preferred Skills & QualificationsCore Technical Skills

Programming: Python (Pandas, PySpark, FastAPI), SQL, Bash; familiarity with Go or Scala a plus.
MLOps Frameworks: MLflow, Kubeflow, Metaflow, TFX, BentoML, DVC.
Data Engineering Tools: Apache Spark, Flink, Kafka, Airflow, Dagster, dbt.
Databases: PostgreSQL, MySQL, MongoDB, Cassandra, DynamoDB.
Vector Databases: Pinecone, Weaviate, Milvus, Chroma.
Visualization: Plotly Dash, Superset, Grafana.

Tech Stack

Orchestration: Kubernetes, Helm, Argo Workflows, Prefect.
Infrastructure as Code: Terraform, Pulumi, Ansible.
Cloud Platforms: AWS (SageMaker, S3, EKS), GCP (Vertex AI, BigQuery, GKE), Azure (ML Studio, AKS).
Model Optimization: ONNX, TensorRT, Hugging Face Optimum.
Streaming & Real-Time ML: Kafka, Flink, Ray, Redis Streams.
Monitoring & Logging: Prometheus, Grafana, ELK, OpenTelemetry.

EXL Service

Apply Now