Sr. AI System Infrastructure Engineer

Bangalore, Karnataka
Permanent
Full-time

1 month ago

Job Description:WHAT YOU DO AT AMD CHANGES EVERYTHINGWe care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.AMD together we advance_MTS SOFTWARE DEVELOPMENT ENGINEERTHE ROLE:AMD is looking for a specialized software engineer who is passionate about improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology.THE PERSON:The ideal candidate should be passionate about software engineering and possess leadership skills to drive sophisticated issues to resolution. Able to communicate effectively and work optimally with different teams across AMD.KEY RESPONSIBILITIES:

Design, develop, and optimize algorithms for collective communication operations (e.g., All-Reduce, All-to-All, Broadcast) within AMD's RCCL.
Analyze and tune the performance of collective communication libraries on large-scale GPU clusters, focusing on latency, bandwidth, and scalability over high-speed network fabrics.
Integrate and validate RCCL with various network transport layers and protocols, such as UEC, RoCE (RDMA over Converged Ethernet), and custom interconnects.
Collaborate closely with hardware, driver, and machine learning framework teams to co-design and debug system-level performance issues.
Develop robust benchmarking and profiling tools to identify and resolve bottlenecks in the communication software stack.
Contribute to the upstream open-source RCCL project and stay current with the latest advancements in the field.
Provide expert guidance on GPU cluster network topology and configuration to maximize collective communication performance.

PREFERRED EXPERIENCE:

10+ years of experience in software development with a strong focus on high-performance computing or distributed systems.
Highly proficient in C/C++ programming and debugging in a Linux environment.
Experience with performance analysis, profiling, and debugging of complex, distributed systems in a Linux environment.
Proven track record of optimizing software for specific hardware architectures.
Strong analytical and problem-solving skills, with a proven ability to diagnose and resolve complex performance issues.
Hands-on experience with AMD ROCm RCCL or similar GPU collective communication libraries (NVIDIA NCCL, MSICCL, oneCCL, Open MPI, MPICH, or other MPI implementations etc.), RoCEv2/RDMA is a huge plus.

Effective communication and problem-solving skills.

ACADEMIC CREDENTIALS:

Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

#LI-PM2Benefits offered are described: .AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Advanced Micro Devices

Apply Now