
SRE_Director_Software Production Management & Reliability Engineering
- Bangalore, Karnataka
- Permanent
- Full-time
- Proactively detecting, troubleshooting, and resolving all issues affecting production applications. This involves coordination with and escalation to development and external teams where necessary. This team owns all issues escalated to us until it is resolved or a workaround is provided for end user to continue functioning.
- Responsible for maintaining clear, concise, and timely communications with affected parties during the investigation and resolution of any individual or system-wide outage. Responsible for the stability of the Production environment.
- Develop and continually revise (in partnership with other teams where necessary) suitable policies and procedures to ensure appropriate application development standards are available to guide development for systems deployed to Production.
- As the gatekeepers of the Production environment, responsible for ensuring the Change Implementation Management guidelines/policies are adhered to for all systems deployed to Production.
- Responsible for servicing all requests for data or other activities that require access to Production systems
- Work with development teams at the appropriate stages in application development to ensure any new systems or projects meet the Production standard
- Responsible for maintaining and growing a body of knowledge that is accessible to all team members. Ensure information regarding any support related activities or issues are available and easily accessible. The goal is to improve self-reliance and reduce dependency on the availability of development or external team resources for the initial troubleshooting and resolution of problems.
- As a team member with expertise in deep analytical triage, you will provide subject matter expertise in debugging, issue analysis and troubleshooting, working with business and technical colleagues to provide reviews and recommendations to avoid any future application issues. Produce guidance documentation, standards and procedures, products assessments, and training material including working with the various application and infrastructure support teams ensuring that they are documenting every single troubleshooting step in Morgan Stanley knowledge base system to resolve issues in a faster time frame. You will serve as a fully seasoned/proficient technical resource; provide technical knowledge in outage management and proactive solutions to improve the user experience
- At least 4 years’ relevant experience would generally be expected to find the skills required for this role
- Minimum 7 years of experience in developing and/or supporting Enterprise Applications
- Willingness to embrace Agile and DevOps/SRE concepts.
- Solid analytical skills, problem determination, and resolution recovery processes
- Have experience with observability tools such as Prometheus, Grafana , Loki, kibana, Kubernetes, splunk etc
- Ability to interface and cultivate excellent working relationships with technology teams, business analysts, and vendors
- Strong Unix Shell scripting experience required.
- Have administrative competence in at least one major programming language or platform (for example: Perl, Powershell, Python or Java)
- Should be a fast learner of technologies in a quick paced environment.
- Have strong organizational skills and the ability to manage multiple tasks and high pressure situations for outage handling, management, or resolution
- Is driven to learn new technologies, techniques and what it takes to be an integral member of this team
- Hands-on experience administering large-scale, high-availability systems and the tools to monitor performance and availability
- BS/MS or equivalent, preferably in quantitative discipline (Computer Science, Computer Engineering, EE, Math, Physics).
- Experience with incident “on call” and ability to respond to emergencies on a 24/7 basis
- Experience working with Financial Services area will be a plus