
AI Software System Engineer
- Hyderabad, Telangana
- Permanent
- Full-time
- Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
- Administer ML/AI platforms – Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security.
- Automate system provisioning and Cluster management end to end
- Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise.
- Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards.
- Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team
- 5+ years of experience in developing python based AI apps and UI
- 5+ years of experience in HPC infrastructure engineering for AI/HPC domain
- 5+ years of experience in SLURM and Kubernetes management
- 2+ years of experience managing GPU clusters optimizing GPU-based services/tools/software
- Experience in creating web services with HPC backend (like AI)
- Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, GPU drivers, and Cluster interconnect with 400G networking.
- Demonstrated experience with AI workload schedulers and allocation optimization.
- Automation/monitoring tool - ansible / saltstack, terraform, Prometheus, grafana
- Strong organizational, problem-solving, and troubleshooting skills, with the ability to manage multiple projects simultaneously.
- Excellent verbal and written communication skills, with the ability to collaborate effectively with team members and stakeholders at all levels of the organization.