
Sr. AI System Infrastructure Engineer
- Bangalore, Karnataka
- Permanent
- Full-time
- Design, develop, and optimize algorithms for collective communication operations (e.g., All-Reduce, All-to-All, Broadcast) within AMD's RCCL.
- Analyze and tune the performance of collective communication libraries on large-scale GPU clusters, focusing on latency, bandwidth, and scalability over high-speed network fabrics.
- Integrate and validate RCCL with various network transport layers and protocols, such as UEC, RoCE (RDMA over Converged Ethernet), and custom interconnects.
- Collaborate closely with hardware, driver, and machine learning framework teams to co-design and debug system-level performance issues.
- Develop robust benchmarking and profiling tools to identify and resolve bottlenecks in the communication software stack.
- Contribute to the upstream open-source RCCL project and stay current with the latest advancements in the field.
- Provide expert guidance on GPU cluster network topology and configuration to maximize collective communication performance.
- 10+ years of experience in software development with a strong focus on high-performance computing or distributed systems.
- Highly proficient in C/C++ programming and debugging in a Linux environment.
- Experience with performance analysis, profiling, and debugging of complex, distributed systems in a Linux environment.
- Proven track record of optimizing software for specific hardware architectures.
- Strong analytical and problem-solving skills, with a proven ability to diagnose and resolve complex performance issues.
- Hands-on experience with AMD ROCm RCCL or similar GPU collective communication libraries (NVIDIA NCCL, MSICCL, oneCCL, Open MPI, MPICH, or other MPI implementations etc.), RoCEv2/RDMA is a huge plus.
- Effective communication and problem-solving skills.
- Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent