Site Reliability Engineer
Tecsys Inc.
- Bangalore, Karnataka
- Permanent
- Full-time
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Develop tools & automation on top of Azure & AWS to continuously reduce the need for manual intervention.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Be on-call.
- Practice sustainable incident response and blameless postmortems.
- Implement automated solutions for continuous integration and delivery (CI / CD).
- Implement monitoring, Logging, alerting, and SLA Reporting.
- Implement service monitoring dashboards displaying key metrics.
- Create and maintain technical documentation.
- Apply SRE best practices.
- Take command of high-severity incidents and facilitate their resolution.
- Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
- Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
- Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.
- Bachelor's degree in computer science or related technical discipline.
- At least 5 years’ experience in systems engineering experience; demonstrable technical experience in new platform development, orchestration, product ownership, and iterative design and deployment.
- Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
- Strong knowledge of system design; high performance computing; file, block, and storage technologies; integration of compute, storage, and network technologies to deliver cohesive infrastructure solutions.
- High level of understanding and examples of executing projects with full stack automation; our scale is going to require a lot of it, we grow to use less manual intervention and work with both internal and open-source tools to automate day-to-day activities.
- Self-organize, collaborate, and manage efforts with peers and teams across responsibility areas, languages, geography, and time zones.
- Be a self-starter, curious, and not afraid to ask questions and challenge the way things are done today.
- See a problem or opportunity, take ownership and act on it independently.
- Knowledge of Datadog preferred (or at least, similar/equivalent product)
- Knowledge of Rapid7 Insight preferred (or at least, similar/equivalent product)
- Knowledge and experience of AWS or Azure required.
- Basic knowledge of Java- or .Net-based development required.
- Knowledge of GitLab (enterprise license) preferred (or at minimum, Jenkins required)
- Experience with SaaS company is a strong asset.
- Experience with FedRamp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
- Proficient English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues worldwide.
- Escalation on-call rotation