Machine Learning Intern
WadhwaniAI LEHS
- Delhi
- Contract
- Full-time
- Conduct experiments and reports results reliably, with guidance
- Experiments includes (but not limited to):
- Benchmark open-source LLMs (gpt-oss-120/20b, Llama models, etc.) against the proprietary LLMs (like OpenAI's GPT-4o/-mini, Gemini, etc.)
- Experiment with LLMs to create the scalable and search-efficient KB, synthetic QA generation from the digital documents, and Prompting optimizations to build the scalable chatbots
- Evaluate language translation models/services for Indian languages (e.g., Bhashini, Sarvam, Google Translate, etc.)
- Assess Speech models for Indian languages for ASR (STT) and TTS tasks (e.g., Amazon Polly, AI4Bharat's Conformers, Sarvam, etc.)
- Improve the existing RAG-QA pipeline by finding performance gaps, benchmark different Retrieval (Embedding) and chunking techniques
- Gather, clean, analyze, and process the text and speech data for building the knowledge base (KB) for conversational chatbots.
- Learns to derive insights from experiments and next steps
- Watches incoming data regularly and performs quality checks
- Collaborates with cross-functional teams to complete tasks on time
- Proactively seeks help and required information from peers
- Communicate the research findings in a clean and compact manner
- Supports development of good and clean codebases with documentation of code and work consistently with high standards
- Communicates and presents results effectively with peers
- Stay updated with recent advancements in GenAI-LLMs, ASR (STT and TTS), RAG-QA, Evaluation of LLMs, etc. (that can be applied in our product)
- Develops expertise with typical ML tooling such as Pandas, ML frameworks (Pytorch, Scikit-Learn), Excel (Pivot tables), Visualization libraries, Experiment monitoring (Weights & Biases), GitHub
- Learns to work efficiently with tooling: Unix, VSCode, Google office suite, Calendar, Slack
- Ability to work in a fast-paced startup environment
- Eagerness to learn, and apply the latest research/work happening in the domain in the solution
- Good at AI/ML fundamentals
- Good at LLM fundamentals like RAG-QA, Prompt Engineering, Evaluations, Vector Stores, Retrievals and Chunking, Fine-tuing, Synthetic data generation, model deployment, etc.
- Experience with GenAI tools (but not limited to) like LangChain, LlamaIndex, LlamaParse, Langfuse, FAISS, Chroma, vLLM, OpenAI toolkits and SDK, etc.
- Familiarity with OpenAI models and tools, Open-source LLM models like gpt-oss, Llama, Gemma, Mistral, Wave2Vec2, Bhashini's or AI4Bharat's language translation model, etc.
- Familiarity with Docker, AWS, GCP is a plus
- Strong Python coding and debugging skills, hands-on experience with some of the Data Science toolkits like Pandas, Numpy, Matplotlib/Seaborn, etc. and preferably at least one Deep Learning Framework among Pytorch (preferably), Keras, TensorFlow, etc.
- Should have completed coursework in Probability, Linear Algebra, Calculus and preferably has some exposure to AI / Machine Learning.
- Highly preferred to have demonstrated experience of working in the field via an internship or project. Do provide links to some of your open-source projects.
- Prior exposure to Linux/Unix is expected before joining for the internship.