The purpose of this role is to lead the collaboration with ML Engineers and DevOps Engineers to formulate AI designs that can be built, tested and deployed through the Route to Live and into Production using continuous integration / deployment.Job Description:Model Development & DeploymentModel fine-tuning: Use open-source libraries like DeepSpeed, Hugging Face Transformers, JAX, PyTorch, and TensorFlow to improve model performance Large Language Model Operations (LLMOps)Model deployment and maintenance: deploying and managing LLMs on cloud platformsModel training and fine-tuning: training and refining LLMs to improve their performance on specific taskswork out how to scale LLMs up and down, do blue/green deployments and roll back bad releasesData Management & Pipeline OperationsCurating and preparing training data, as well as monitoring and maintaining data qualityData prep and prompt engineering: Iteratively transform, aggregate, and de-duplicate data, and make the data visible and shareable across data teamsBuilding vector databases to retrieve contextually relevant informationMonitoring & EvaluationMonitoring and evaluation: tracking LLM performance, identifying errors, and optimizing modelsModel monitoring with human feedback: Create model and data monitoring pipelines with alerts both for model drift and for malicious user behaviorEstablish monitoring metricsInfrastructure & DevOpsContinuous integration and delivery (CI/CD), where CI/CD pipelines automate the model development process and streamline testing and deploymentDevelop and manage infrastructure for distributed model training (e.g., SageMaker, Ray, Kubernetes). Deploy ML models using containerization (Docker)Required Technical SkillsProgramming & FrameworksUse open-source libraries like DeepSpeed, Hugging Face Transformers, JAX, PyTorch, and TensorFlowLLM pipelines, built using tools like LangChain or LlamaIndexPython programming expertise for ML model developmentExperience with containerization technologies (Docker, Kubernetes)Cloud Platforms & InfrastructureFamiliarity with cloud platforms like AWS, Azure, or GCP, including knowledge of services like EC2, S3, SageMaker, or Google Cloud ML Engine for scalable and efficient model deploymentDeploying large language models on Azure and AWS clouds or services such as DatabricksExperience with distributed training infrastructureLLM-Specific TechnologiesVector databases for RAG implementationsPrompt engineering and template managementTechniques such as few-shot and chain-of-thought (CoT) prompting enhance the model's accuracy and response qualityFine-tuning and model customization techniquesKnowlege GraphsRelevance EngineeringLocation: DGS India - Pune - Baner M- AgileBrand: MerkleTime Type: Full timeContract Type: Permanent