
Mid QE Engineer
- Pune, Maharashtra
- Permanent
- Full-time
- Design, implement, and execute HA/DR testing strategies across distributed, cloud-native, and on-premises systems.
- Simulate failure scenarios (network outages, crashes, data corruption) to validate recovery and resilience.
- Automate failover, backup/restore, and disaster recovery test suites across services and databases.
- Embed chaos engineering and resilience testing into CI/CD pipelines alongside DevOps and SRE teams.
- Monitor and improve RTO (Recovery Time Objective) and RPO (Recovery Point Objective) across critical services.
- Build and maintain scalable test automation frameworks to support HA/DR and resilience testing.
- Analyze logs, metrics, and system behaviors to proactively identify weak points.
- Mentor engineers and share best practices in reliability, resiliency, and automation.
- Maintain and enhance manual and automated test cases and procedures.
- Conduct load, stress, and performance testing using enterprise-level tools.
- Interpret test results, perform root cause analysis, and prepare comprehensive reports.
- Participate in requirements, architecture, and design reviews with a focus on resilience.
- Serve as a mentor for other engineers on automation and test practices.
- Manage and maintain physical, virtualized, and simulated test environments.
- Identify and resolve integration issues across software components.
- Review and improve unit test coverage in collaboration with development teams.
- 5+ years of experience with test automation tools/frameworks (e.g., Selenium, SpecFlow, or similar), including 3+ years in HA/DR or reliability testing.
- 5+ years in software testing and test automation using Java, C#, JavaScript/TypeScript, and SQL.
- Strong expertise in cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
- Experience with test automation frameworks (Java, Python, or similar) and CI/CD tools (Jenkins, GitHub Actions, GitLab CI, Azure DevOps).
- Familiarity with chaos engineering tools (Gremlin, Chaos Monkey, LitmusChaos).
- Solid understanding of distributed systems, microservices architectures, and database replication/failover.
- Proficiency with observability and monitoring tools (Prometheus, Grafana, ELK, Splunk).
- Strong knowledge of test automation design, frameworks, and iterative development.
- Experience building custom automation frameworks for resilience and failover testing.
- Exposure to event-driven architectures (Kafka, Pub/Sub) with HA/DR validation.
- Background in performance and scalability testing under fault conditions.
- Familiarity with GitOps and Infrastructure-as-Code (Terraform, Helm, Ansible).
- Experience mentoring QA/Dev teams on resilience and automation practices.
- Knowledge of Identity and Access Management (IAM).
- Hands-on experience with JMeter for performance testing.
- Knowledge of accessibility testing standards and practices.
- Understanding of Behavior-Driven Development (BDD) with tools like Cucumber and Gherkin syntax.
- Experience in leadership or team lead roles (bonus).
- Strong organizational and problem-solving skills.
- Curious, resourceful, and eager to tackle new challenges.
- Experience in planning and implementing testing strategies and automation infrastructure for large-scale systems.
- Proven ability to collaborate with cross-functional teams in fast-paced environments.
- Lifelong Learner: You are always seeking to improve your technical and nontechnical skills.
- Team Player: You are someone who wants to see everyone on the team succeed and is willing to go the extra mile to help a teammate in need.
- Communicator: You know how to communicate your design ideas to both technical and nontechnical stakeholders, prioritizing critical information and leaving out extraneous details.
- Customer-first thinking: passionate about ensuring uptime and reliability for end users.
- Curious & experimental: You enjoy chaos testing, “breaking things on purpose” to uncover weaknesses.
- Automation-first approach – You eliminate manual steps through scripting and frameworks.
- Collaborative – You work seamlessly with developers, SREs, and product teams to strengthen resilience.
- Proactive & accountable – You take ownership of quality and sees failures as opportunities to improve.