Machine Learning Engineer - Computer Vision & Vision-Language Models (VLMs)About Sarvam AISarvam.ai is a pioneering generative-AI startup headquartered in Bengaluru, India. We are dedicated to transformative R & D in language technologies, building scalable and efficient Large Language Models (LLMs) that serve a wide spectrum of languages-especially Indic languages. Our mission is to re-imagine human-computer interaction and craft novel AI-driven solutions that make language technology inclusive for diverse communities worldwide.Role OverviewAs a Machine Learning Engineer (MLE) in the Vision-Language team, you will build and refine vision, OCR, and language models for varied use-cases. Your work will span research, scalable training, and rigorous evaluation of cutting-edge computer-vision and VLM systems.Key ResponsibilitiesModel R & DPrototype and fine-tune state-of-the-art vision architectures and vision-language models.Design and evaluate multimodal fusion strategies for robust image-text understanding.Data & Training PipelinesBuild distributed pipelines (PySpark / Ray) to curate and preprocess large-scale multimodal datasets (images, geospatial rasters, PDFs, video frames, captions).Implement efficient training loops in PyTorch/Lightning with mixed precision, gradient accumulation, and multi-GPU (≥ 4) parallelism.Domain-Focused ApplicationsDevelop models for geospatial analysis, Indic document intelligence (OCR + layout), visual question answering (VQA), and broader computer-vision use-cases.Evaluation & BenchmarkingDefine and automate task-specific metrics for OCR accuracy, retrieval, dense captioning, and VQA; maintain regression dashboards and ablation suites.Required QualificationsExperience: 2-3 years in ML engineering with emphasis on classical computer vision and modern vision-language models.Education: Bachelor's or Master's in Computer Science, AI/ML, or related fields.Technical SkillsStrong Python & PyTorch; comfortable with CUDA profiling and tensor debugging.Hands-on experience training CV models (CNNs, ViTs) and/or VLMs on ≥ 4-GPU nodes.Proven ability to build, deploy, and monitor pipelines for OCR, object detection, and segmentation.Solid grasp of computer-vision fundamentals (detection, segmentation, representation learning) and transformer mechanics.Software-Engineering Fundamentals:Proficiency with Git, unit tests, structured logging, Docker, and CI/CD.Ability to select and integrate appropriate databases (SQL, NoSQL, vector stores) for large-scale multimodal data.Experience designing scalable backend APIs/micro-services (gRPC/REST), including monitoring and observability best practices.Preferred QualificationsPublications or submissions in CVPR/ICCV/ECCV, EMNLP, ACL.Prior work on multilingual or low-resource vision-language tasks.Experience with data-centric AI (active learning, synthetic augmentation).Contributions to open-source vision/NLP libraries (Hugging Face, OpenCV, Detectron2).Familiarity with distributed schedulers (KubeFlow, Slurm).