Sr Site Reliability Engineer
- Hyderabad, Telangana
- Permanent
- Full-time
- Experience solving problems related to large-scale distributed systems; is able to take complex problems and identify potential solutions, knowns, and unknowns.
- Works to drive continuous improvement and efficiency.
- Ability to write code in multiple languages, choosing the right strongly or dynamically typed language for the job.
- Leads a project team, providing direction, issue resolution, and mentorship, as well as regular progress updates and reporting.
- Solve problems in mission critical services; implement solutions to prevent recurrence; lead Retrospectives to explore and understand root causes, define next steps to avoid future incidents, and document and report findings.
- Help shape SRE strategies by evaluating and contributing to product/service design. Participate in system design meetings, capacity planning, launch reviews, etc. to ensure support services/platforms are as efficient as possible before going live.
- Scale systems sustainably through mechanisms such as automation and evolve systems by fostering changes that improve reliability and velocity.
- Enhance data-driven engineering culture by providing statistical trends and analysis using real service data to increase service health and quality.
- Bachelor's (or higher level) degree in one or more of these disciplines: Computer Science, Computer Engineering, or related fields.
- 7+ years of professional experience in software engineering
- Experience setting up and using incident and on-call management systems.
- Experience setting up and building tools to collect and visualize data (logs, metrics, alerts), building dashboards, alerting, and monitoring systems.
- Experience with deploying secure infrastructure and services in one or more cloud environments such as AWS or Azure.
- Experience with configuration management and deployment automation tools, such as Terraform, Ansible, Packer, etc.
- Proficiency in scripting languages such as Python and Bash.
- Experience with container (Docker) and orchestration systems (Kubernetes).
- Solid understanding of Linux OS + systems administration skills
- Excellent analytical and trouble-shooting skills.
- Dynamic collaborator who thrives in diverse, geographically distributed locales.
- Team player that demonstrates diplomacy, promotion of sound ideas & concepts, paired with the desire to help others grow their skills.
- Strong verbal and written communication skills.
- Experience with NGINX technologies a strong plus.
- Application Build and Deployment Processes (git*, automation pipelines, Infrastructure as code, etc.)
- Automated Application Delivery (load balancers, container orchestration, service mesh, High Availability architectures, Frontend, Backend technologies including database, etc.)
- Service Operation (Define, instrument, measure, and manage service level objectives. Experience with observability tooling including logging infrastructure, time series metrics databases, tracing systems, alert definitions, etc.)
- Incident management (service restoration, root cause analysis, postmortem authorship, define roles and responsibilities, etc.)
- Security awareness and competencies, including security as code.
- Configuration management
- Explores beyond the obvious to ensure Service Level Objectives (SLO) are met.
- Understands and measures system behaviors to quickly and efficiently diagnose, identify, and address needs.
- Proactively test, automate, monitor outputs, leverage signals to infer services and needs.
- Data management to explore properties, patterns, and distributed tracing
- Constantly seeking ways to improve systems, making them more efficient and reducing toil.
- Understands the difference between short-term strategic and long-term fixes
- Simplifies decisions and judgments by recognizing what to pay attention to and what to ignore; a proficient problem solver. Tenacious and resourceful with an inherent predisposition toward action; unafraid to try something new in the name of innovation.
- Possess an inherent bias toward innovation, always abreast of developing ideas and technologies. Thoughtfully and strategically considers future needs, opportunities, and advocates positive change.
- Technological creativity and capacity
- Conveys information, vision, and strategy in an accurate and timely manner, adjusting to ensure understanding based on the audience. Actively listens; seeks to understand rather than respond. Proactively solicits and values diverse perspectives, ideas, and opinions