
R&D Engineer, HPC Systems
- Chennai, Tamil Nadu
- Permanent
- Full-time
- Expose limitations in existing solutions, based on clusters of CPUs & GPUs, to deploy AI-based solutions on on-prem & cloud infrastructures at scale.
- Develop distributed frameworks and system-level solutions that enable scaling out image processing & AI loads from single GPU to multi-node clusters with multiple GPUs.
- Install, benchmark, and evaluate pre-release hardware for early-stage evaluation and prototyping by identifying (or developing) relevant workloads.
- Masters / PhD in Computer Science or related fields; bachelors degree holders with relevant experience and extraordinary track-record will also be considered.
- Deep understanding of operating systems, computer networks, and high performance applications
- Good mental model of the architecture of a modern distributed systems that is comprised of CPUs, GPUs, and accelerators.
- Experience with deployments of deep-learning frameworks based on TensorFlow, and PyTorch on large-scale on-prem or cloud infrastructures.
- Strong background in modern and advanced C++ concepts
- Strong Scripting Skills in Bash, Python, or similar.
- Good communication.
- Experience in heterogenous programming languages like CUDA, Triton, etc.
- Experience with model development on DL frameworks such as TensorFlow, and PyTorch
- Experience with building open-source operating systems and software stack on pre-release hardware.
- Solid understanding of container infrastructure such as Docker or singularity, and Kubernetes.
- Active participation in C++ standards bodies or similar