
Principal SDET - Lustre
- Pune, Maharashtra
- Permanent
- Full-time
- Define, implement, and own the end-to-end QA architecture for distributed storage systems in HPC environments.
- Architect scalable test frameworks and automation pipelines to validate storage performance, throughput, IO behavior, and system reliability at scale.
- Design test plans that cover key areas such as metadata operations, object lifecycle, parallel IO, file system consistency, and failure scenarios.
- Lead performance benchmarking using industry-standard tools and custom workloads (e.g., IOR, MDTest, FIO, Vdbench).
- Validate integration with HPC compute clusters, schedulers (e.g.,Lustre), and storage tiers (e.g., NVMe, SSD, HDD).
- Simulate large-scale distributed environments and execute fault-injection and resilience testing.
- Collaborate with product managers, architects, and DevOps teams to ensure test coverage across CI/CD pipelines and production-like environments.
- Mentor QA engineers in automation development, performance validation, and HPC-specific debugging techniques.
- Analyze test data, identify trends, bottlenecks, or regressions, and communicate findings clearly to engineering stakeholders.
- Design and implement automated test cases using BDD frameworks such as Cucumber, Gherkin, or similar.
- Develop test automation scripts and test utilities in Rust.
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field.
- 15+ years of experience in software QA or systems testing, with 5+ years in a QA Architect or technical lead role.
- Deep knowledge of distributed storage systems (e.g., Lustre, Ceph, GPFS/Spectrum Scale, BeeGFS, GlusterFS).
- Experience with HPC workloads and environments, including MPI, high-throughput clusters, InfiniBand, and RDMA.
- Strong understanding of POSIX file systems, object storage interfaces (e.g., S3), and parallel file systems.
- Proficiency in automation and scripting (Python, Bash, Rust).
- Hands-on experience with storage benchmarking and profiling tools: IOR, MDTest, FIO, Vdbench, Perf, iostat, collectl.
- Familiarity with CI/CD tools and infrastructure-as-code (e.g., Jenkins, GitLab CI, Ansible, Terraform).
- Solid understanding of system-level debugging and analysis tools
- Strong communication skills and ability to lead cross-functional quality initiatives.
- Cross-collaboration with Dev teams to understand specifications
- Experience working with large-scale HPC clusters or supercomputing environments.
- Exposure to data-intensive applications like AI/ML pipelines, genomics, scientific simulations, or real-time analytics.
- Familiarity with Kubernetes, container storage interfaces (CSI), and containerized HPC workflows .
- Experience with hardware validation: NVMe, SSD, HDD tiering, network fabric performance tuning.
- Certifications in storage (e.g., SNIA) or HPC systems.
- Knowledge of Cloud Solutions (GCP,AWS)