Principal Software Architect- High Performance Computing (HPC)
Applied Materials View all jobs
- Bangalore, Karnataka
- Permanent
- Full-time
- As a Software Architect, you will be responsible for design and implementation of robust, scalable infrastructure solutions combining diverse processors (CPUs, GPUs, FPGAs).
- You will analyze and partition workloads to the most appropriate compute unit, ensuring tasks like AI inference and parallel processing runs on specialized accelerators, while serial tasks run on CPUs.
- You will work closely with cross-functional teams, including Algo engineers, product managers, and business stakeholders, to understand requirements and translate them into architectural/software designs that meet business needs.
- You will be coding and developing quick prototypes to establish your design with real code and data.
- You will be a subject Matter expert to unblock software engineers in the HPC domain.
- You will be expected to profile entire cluster of nodes and each node with profilers to understand bottlenecks, optimize workflows and code and processes to improve cost of ownership.
- Conduct performance tuning and capacity planning, monitoring GPU metrics (e.g., using NVIDIA DCGM) for reliability
- Evaluate and recommend appropriate technologies and frameworks to meet project requirements.
- Lead the design and implementation of complex software components and systems.
- Ensure that software systems are scalable, reliable, and maintainable.
- Your primary focus will be on ensuring that the software systems are scalable, reliable, maintainable and cost effective.
- 12 to 18 years of experience in implementing robust, scalable, and secure infrastructure solutions combining diverse processors (CPUs, GPUs, FPGAs)
- Working experience of GPU inference server like Nvidia Triton.
- Very good knowledge C/C++, Data structure and Algorithms and complexity analysis.
- Experience in developing Distributed High Performance Computing software using Parallel programming frameworks like MPI, UCX etc.
- Experience in GPU programming using CUDA, OpenMP, OpenACC, OpenCL etc.
- In depth experience in Multi-threading, Thread Synchronization, Inter process communication, and distributed computing fundamentals.
- Experience in Inter Process communication using Shared memory and Pipes.
- Experience in performance profiling at application and system level (e.g. vtune, Oprofiler, perf, Nividia Nsight etc.)
- Experience in low level code optimization techniques using Vectorization and Intrinsics, cache-aware programming, lock free data structures etc.
- Familiarity with microservices architecture and containerization technologies (docker/singularity) and low latency Message queues.
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration abilities.
- Ability to mentor and coach junior team members.
- Experience in Agile development methodologies.
- Experience in HPC Job-Scheduling and Cluster Management Software (SLURM, Torque, LSF etc.)
- Good knowledge of Low-latency and high-throughput data transfer technologies (RDMA, RoCE, InfiniBand)
- Good Knowledge of Parallel processing and DAG execution Frameworks like Intel TBB flowgraph, OpenCL/SYCL etc.