
Lead Site Reliability Engineer (Expert LAN & Wireless plus Automation)
- Delhi
- Permanent
- Full-time
- Define build and maintain support systems to ensure high availability and performance.
- Work closely with Product, Engineering & Service support architects for new product productization as Operation technical expert and as well in reviewing non standard bids to check operability feasibility.
- Ensure Operations readiness to support new products and ensure they are trained to support effectively.
- SREs are responsible for making sure that the systems and services they support meet the non-functional requirements defined by the business, the users, and the organization. They are the guardians of reliability and availability, ensuring that systems perform as expected, scale appropriately, and are resilient to failures
- Defining and Understanding NFRs:SREs collaborate with stakeholders to understand and define NFRs, such as performance targets (response time, throughput), scalability limits, security requirements (encryption, authentication), and maintainability goals (ease of updates, error handling).
SREs also focus on building systems that are scalable, meaning they Analyse network performance data and capacity requirements proactively to ensure the network can handle current and future demands without performance degradation. * Monitoring and Incident Response:SREs implement robust monitoring systems to track key metrics related to NFRs .
They set up alerts and notifications to proactively identify and address potential issues before they impact users. * SRE must define and maintain an event catalog specifying active events thresholds , propose & implement relevant remediation and optimize it for efficiency.
- Develop event response protocols provide training to teams and ensure quick and efficient handling of incidents.
- Highest technical escalation contacts to handle complex cases for the Portfolio service operations as technical expert.
- Accountable within SGS for the in scope product to ensure high availability performance of the product/solution.
- Technical expert /Guru in the domain and point of contact for engineering, management operations & product.
- Optimizing network performance by analysing traffic patterns, identifying bottlenecks, and implementing solutions
- Coordinate with incident management teams, operations experts and with different application & platform Portfolio service operations and Engineering teams to develop and implement permanent solutions.
- Conduct thorough problem investigations via trend analysis to diagnose recurring incidents and find permanent solutions.
- Conduct the problem review board weekly & Monitor the effectiveness of problem resolution activities & provide regular reports on problem management activities to ensure continuous improvement.
- Deployment and Release Management:
They implement strategies, clear process, SOP and rollback plan to minimize risks and reduce downtime during releases.
They ensure that new features and changes are deployed in a controlled manner, minimizing the impact on existing services. * Track deployment progress , conduct operational readiness assessments on successful execution of deployment and mitigate risk or improve deployment plan to ensure service stability.
- DevOps/NetOps Management: Manage continuous integration and deployment (CI/CD) pipelines ensuring smooth integration between development and operational teams.
- Building scripts to Automate network tasks, reducing manual effort, removing toils and human error .
- Implement automation for system provisioning, self-healing - auto recovery, deployment , system health checks etc & monitoring event to incident with proper correlation.
- Implement and manage infrastructure as code provide ongoing support for automation tools and continuously improve DevOps/Netops practices.
- Creating & maintaining documentation related to network configurations,SOP’s, and troubleshooting guides.
- Airline experience and/or ATI know-how, is good to have.
- Experise in troubleshooting Data center & Cloud setup technologies issues
- Expertise in technologies like Cisco routing & switching, Cisco ACI, CISCO Nexus , Aruba, Clear Pass, Juniper Mist or any other wireless technology.
- Hands-on experience with CI/CD pipelines automation system, performance monitoring and the implementation of infrastructure as code.
- Having experience in NetOps working enviornment.
- Having experience in Automation and scripting.
- Proven experience in managing high-availability systems and ensuring operational reliability.
- Extensive experience in root cause analysis (RCA) incident management and developing permanent solutions for recurring service disruptions.
- any one : Terraform OR Python, OR other languages is must for automation & scripting
- Git process knowledge good to have.
- CICD pipeline tools such as GitHub good to have
- Experience implementing architectural standards into pipelines
- CISCO Routing & switching must to have.
- CISCO ACI
- loadbalancers
- any wirless technology - Juniper Mist expertise or Aruba AP or CISCO
- Cisco Datacenter switches like Nexus must to have
- Aruba Clear pass knowledge good to have
- Palo Alto firewalls good to have
- Knowledge or experience with cloud platforms (AWS, Azure, Google Cloud) and their networking services is good to have.
- Familiarity with operating systems (Linux, Windows) and system-level troubleshooting is good to have.
- Understanding of ITIL or other incident management frameworks. Ability to effectively communicate technical information and collaborate with diverse teams.
- Adhering to SITA Principles & Values
- Good Communication
- Creating & Innovating
- Customer Focus
- Impact & Influence
- Leading Execution
- Results Orientation
- Teamwork
- Bachelor's or Master degree in Computer Science Information Technology Engineering or a related field.
- Relevant certifications such as CCIE in data centers OR routing & switching , Expert level certification in Juniper Mist or Aruba & Palo Alto Firewall.
- Good to have Certifications in cloud platforms (AWS Azure Google Cloud) or DevOps methodologies (e.g. Certified DevOps Professional).
- ITIL certification.