
Computer Scientist - I (SRE)
- Noida, Uttar Pradesh
- Permanent
- Full-time
- Dynamic GPU orchestration using Kubernetes
- Built-in support for training and inference workflows
- End-to-end observability and cost tracking
- High developer velocity via self-service tooling
- Build tools, APIs, and platforms to improve reliability, deployment, and performance.
- Architect and scale Kubernetes-based infrastructure for high-performance workloads (including GPU).
- Write clean, efficient, and maintainable code in Python, Go, or similar languages.
- Own and evolve CI/CD pipelines, infrastructure-as-code systems (Terraform, Helm), and service observability.
- Troubleshoot and resolve complex system, network, and application-level issues.
- Participate in blameless incident response, root cause analysis, and reliability reviews.
- 4–5 years of experience in SRE, DevOps, or platform engineering roles.
- Strong development background with fluency in Python, Go, or similar languages.
- Solid understanding of Kubernetes internals, workload orchestration, and Helm.
- Deep knowledge of networking fundamentals: DNS, TCP/IP, routing, VPNs, firewalls.
- Experience with infrastructure automation and configuration management.
- Understanding of GPU scheduling, resource allocation, and NVIDIA ecosystem tools.
- Familiarity with service mesh, observability stacks (Prometheus, Grafana, OpenTelemetry), and cloud-native patterns.
- Experience supporting AI/ML pipelines, especially GPU-based training/inference.
- Contributions to open-source projects or internal developer platforms.
- You’ll write software, not just YAML.
- You’ll get to work on real AI infrastructure challenges (not just buzzwords).
- You’ll have impact across developer productivity, platform scalability, and service reliability.
- You’ll join a team that values code quality, systems thinking, and ownership.