
Senior Platform Engineer
- Pune, Maharashtra
- Permanent
- Full-time
- Lead the design and implementation of scalable AI/ML infrastructure on Azure and AWS.
- Build and manage cloud-native infrastructure (Azure, AWS, Databricks) for AI workloads using Infrastructure-as-Code (IaC) tools like Terraform and Bicep.
- Create reusable self-service tooling, templates, and CI/CD workflows for data scientists and ML engineers.
- Govern AI systems with access control, audit trails, policy enforcement, and compliance monitoring (e.g., GDPR).
- Implement GenAI workloads using Azure AI Foundry Azure AI Hub, Azure OpenAI, Amazon Bedrock, Anthropic Claude, Hugging Face, LangChain, etc.
- Implement infrastructure and DevOps practices for Agentic AI solutions using native Azure and AWS AI services.
- Collaborate with security and architecture teams to embed cloud security best practices in the AI platform.
- Contribute to incident response, troubleshooting, and root cause analysis of ML and GenAI workload failures and latency issues.
- Implement MLOps practices to manage and optimize the lifecycle of machine learning models, including monitoring, versioning, and retraining.
- Collaborate with data scientists, software engineers, and other stakeholders to ensure effective integration of AI solutions within the business.
- Stay up to date with the latest advancements in AI, cloud computing, and DevOps practices, and integrate relevant technologies into the platform.
- Review Weekly/bi-weekly Cloud Cost Reports. Identify and lead the efforts for any cloud cost-savings opportunities
- Mentor junior engineers, providing technical leadership and fostering a culture of continuous learning.
- Ensure compliance with industry standards and best practices for data security and privacy.
- 7+ years of experience in platform engineering, with a proven track record of designing, deploying, and managing scalable and secure cloud-based infrastructures, leveraging both Azure and AWS services.
- Experience with Azure services such as Azure AI services, Azure Search, Azure ML, Databricks, Azure Kubernetes Service, and AWS services like AWS SageMaker, AWS Bedrock and AWS Lambda.
- Exposure to Generative AI and Agentic AI ecosystems such as Azure OpenAI, Azure AI Foundry, Azure AI Hub, Bedrock, Anthropic Claude, OpenAI API, LlamaCloud, LangChain.
- Understanding of token usage, LLM prompt injection risks, Jailbreak attempts and mitigation techniques.
- Strong knowledge of governance, audit, observability, and compliance in cloud-based GenAI and ML ecosystems.
- Should understand Azure AI Evaluation SDK and AI Red Teaming Prompt Security Scans
- Good to have experience with code assistant tools like Github Copilot, Cursor and Claude Code
- Expertise in Azure DevOps or AWS CodePipeline, including setting up and managing CI/CD pipelines.
- Advanced experience with Azure Blob Storage, Cosmos DB, SQL, Key Vault, AWS S3, DynamoDB, and AWS RDS etc and their integrations with AI services
- Advanced understanding of networking concepts, including DNS management, load balancing, VPNs, and virtual networks (VNets).
- Advanced understanding of security concepts, including IAM roles, identities, Azure policies, AWS SCPs.
- Experience in Advanced Authentication and Authorization Concepts across various cloud providers and platforms
- Must have experience with Azure Policy, AWS SCP, AWS IAM, audit logging, Azure RBAC etc.
- Mastery of infrastructure-as-code tools such as Azure ARM / Bicep, Terraform, CloudFormation, or equivalent.
- Proficiency in networking, DNS, load balancers, and cloud engineering services.
- Knowledge in Python programming and AI/ML libraries (TensorFlow, PyTorch, Sci-Kit learn etc.).
- Experience with containerization and orchestration tools such as Docker and Kubernetes.
- Good to have knowledge about Azure Bot framework, APIM, Application Gateway. Also, knowledge about M365 offerings like M365 Copilot. AWS CDK, AWS Python(Boto3) SDK.
- Experience with monitoring tools like Grafana, Prometheus, Application Insights, Log Analytics Workspaces, and Azure Monitor.
- Strong problem-solving and analytical skills.
- Strong communication and collaboration skills to work effectively with diverse teams.
- Proven leadership abilities to guide and mentor junior engineers.