Specialist, Project Management, Lead Observability Management (SRE)
DBS Bank
- Chennai, Tamil Nadu
- Permanent
- Full-time
- Responsible to designing and deploying new ELK clusters (Elasticsearch, Logstash, Kibana, beats, zookeeper etc. Proactively monitoring performance.
- Infrastructure design for the Elasticsearch, Logstash and Kibana Ensure implementation meets security controls, comply with OS-level networking standards, control the access with the least privilege.
- Managing the Cluster and integrate with Logstash & Elasticsearch
- Design and develop data engineering pipelines.
- Design and configure ETL data pipelines using Elastic Common Schema to onboard application logs and metrics Configure index templates and data life cycle management (ILM) for data retention
- Automate repetitive tasks and optimize practices and perform thorough testing to ensure product quality.
- Create and maintain software documentation.
- Monitoring and proactive support including morning checks etc
- Provide engineering solution and framework to support machine learning and data-driven business activities at large scale
- Perform R&D on new technologies and solutions to improve accessibility, scalability, efficiency and abilities of machine learning and analytics platform.
- Establish, apply and maintain best practices and principles of machine learning engineering.
- Keep innovating and optimizing the machine learning workflow, from data exploration, model experimentation/prototyping to production.
- Responsibility will be to Onboard applications into monitoring tools and perform production support for the platform.
- Deployment, support and monitoring of existing and new services, and application stacks.
- Automate repetitive tasks, optimize processes, and perform thorough testing to ensure quality.
- Design and develop data engineering pipelines and manage data lifecycle policies.
- Strong experience with the full ELK Stack - Elasticsearch, Logstash, Kibana, Beat agents, Machine Learning, APM, X-Pack and REST API integration.
- Experience with developing in multiple languages (Python, Bash, Painless, or other scripting languages).
- Develop Elastic alerting solutions using Watcher and Kibana or Grafana. Alerts that will have integrate into Teams and email.
- Develop Machine Learning (ML) job to dynamically monitor and alert on specific metrics
- Having basic knowledge of database systems (RDBMS, MariaDB, SQL, NOSQL),
- Experience in NodeJS, Spring boot and would be a plus.
- Experience & skills in automation tools (e.g. Ansible) & DevOps pipelines are appreciated
- Implement Site Reliability Engineering principles regarding performance, reliability, monitoring, alerting in Production environment
- Self-driven, strong, committed, and reliable team player. Ability to contribute to discussions on design and strategy.
- Good problem diagnosis and creative problem-solving skills
- Working knowledge of Grafana, Prometheus, Confluent Kafka, Elastic stack (Elasticsearch / Logstash / Kibana / Beats) including data ingestion, management, monitoring & analytics. Able to perform L1/2 ELK related tasks.
- Experience in designing and building highly scalable distributed ML models in production and then Create & deploy machine learning jobs for anomaly detection in IT eco Systems
- Creating automated anomaly detection systems and constant tracking of its performance
- Experience in anomaly detection or root cause analysis related to monitoring products is preferred.
- Familiar with machine learning related development frameworks, such as ELK, PyTorch, etc., experience in practical application and optimization of algorithm projects
- In-depth experience in Unix/Linux/Shell/Strong programming knowledge in Python and use Design patterns in development.
- Knowledgeable and experienced in SRE (Site Reliability Engineering) practices covering monitoring, observability, performance management, automation, and resiliency.
- Adequate knowledge of database systems (RDBMS, MariaDB, SQL, NOSQL),
- All business units in India for application feature enhancement, support & maintenance
- Internally with all IT heads, Leaders & members
- International/local software developers/vendors.
- Co-ordination with on-shore team to resolving the issue on-highest priority
- Drive Performance Through Value Based Propositions
- Ensure Customer Focus by Delighting Customers & Reduce Complaints
- Build Pride and Passion to Protect, Maintain and Enhance DBS’ Reputation
- Enhance Knowledge Base, Build Skill Sets & Develop Competencies
- Invest in Team Building & Motivation through Ideation & Innovation
- Execute at Speed While Maintaining Error Free Operations
- Develop a Passion for Performance to Grow Talent Pool
- Maintain the Highest Standards of Honesty and Integrity