
Lead Engineer - I/O
- Gurgaon, Haryana
- Permanent
- Full-time
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, learning, and continuous improvement.
- Maintain and support infrastructure services in development, integration and production Environments
- Design, implement, and manage robust, scalable, and high-performance systems and infrastructure.
- Ensure the reliability, availability, and performance of critical services through proactive monitoring, incident response, and root cause analysis.
- Drive the adoption of automation, CI/CD practices, and infrastructure as code (IaC) to streamline operations and improve operational efficiency.
- Collaborate with development teams to ensure that applications are designed for scalability, reliability, and fault tolerance.
- Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) to monitor and improve service health.
- Lead incident management, troubleshooting, and postmortems to identify and address operational challenges.
- Manage scaling strategies, and disaster recovery for cloud-based environments (GCP, Azure).
- Drive improvements in operational tooling, monitoring, alerting, and reporting.
- Act as a subject matter expert in reliability engineering best practices and promote these practices across the organization.
- Review services before they go live in production
- Focus on automation to improve scale and reliability
- Identifies and proposes alternative technology in order to create scalable implementations
- Identify and prioritize what technical debt will be eliminated
- Identify opportunities to influence the roadmap of infrastructure services
- 8+ years of experience in an engineering role with hands on experience in the public cloud
- Strong experience in designing and managing large-scale, distributed systems.
- Expertise in cloud technologies (GCP, Azure) and infrastructure automation tools (Terraform, Ansible, Puppet, etc.).
- Proficiency in containerization and orchestration technologies such as Docker, Kubernetes, and Helm.
- Experience with monitoring and observability tools like Prometheus, Grafana, NewRelic, or similar.
- Strong knowledge of CI/CD pipelines and related automation tools.
- Proficient in scripting languages like Python, Bash
- Proficient in programming languages like Java, .Net or Go.
- Strong troubleshooting and problem-solving skills.
- Experience leading and mentoring engineering teams, with a strong focus on collaboration and communication.
- Familiarity with incident management processes and tools (e.g., ServiceNow, XMatters).
- Ability to learn and adapt in a fast-paced environment, while producing quality code
- Ability to work collaboratively on a cross-functional team with a wide range of experience levels
- Finds creative way to execute even when there is no historical context or known path forward
- Ability to design roadmaps and relevant solutions for end-users to access interfaces
- Ability to assess the benefits, risks and success factors of potential applications