Observability & Monitoring Lead

Indore, Madhya Pradesh
Permanent
Full-time

1 month ago

Project descriptionSupport clients in the operation, maintenance, and optimization of Oracle Cerner EHR environments. This role is designed for early-career professionals who are eager to grow their technical skills in healthcare IT while working under the mentorship of experienced consultants and technical leaders. You will gain hands-on exposure to Cerner infrastructure, system workflows, and healthcare technology best practices while contributing to meaningful client outcomes.Responsibilities

Trend Analysis & Problem Identification - Identify recurring incident patterns, anomalies, and signs of alert fatigue that may indicate deeper systemic issues. - Collaborate with L2/L3 teams to review telemetry data and recommend improvements to alert thresholds, rules, and policies. - Provide insights that support proactive issue prevention, noise reduction, and overall monitoring refinement. 2. Platform Management & Optimization - Develop, update, and maintain dashboards that reflect real-time system health, performance metrics, and service behavior. - Support the ongoing adoption and optimization of Dynatrace, enhancing dashboarding and visualization capabilities for cloud and on-prem observability. - Assist in routine platform checks, ensuring monitoring tools remain accurate, stable, and aligned with business and operational requirements. 3. Leadership & Collaboration - Responsible for organizing the work for the team, including planning, task breakdown, and ensuring clarity of priorities. - Provide structured, timely updates to leadership on progress, risks, blockers, team capacity, and delivery timelines. - Work closely with application teams, SRE groups, and infrastructure operations during incident triage, investigations, and routine monitoring reviews. - Ensure clear, timely, and effective communication with stakeholders during service-impacting events, providing status updates and context as needed. - Ensures adherence to engineering best practices, drives operational excellence, and maintains accountability for team delivery outcomes 4. Operational Excellence - Support platform stability and availability through adherence to lifecycle maintenance, patching schedules, and vulnerability management processes. - Contribute to the improvement of monitoring workflows, alert routing logic, runbook effectiveness, and incident management practices. 5. Innovation & AI Enablement - Assist in exploring and adopting AI-driven capabilities that improve observability, automate root-cause identification, and reduce manual effort. - Contribute to internal knowledge sharing by documenting best practices, playbooks, AI reference materials, and usage guidelines (e.g., Copilot tips). 6. Collaboration & Leadership Support - Partner with cross-functional teams to align monitoring practices with evolving business needs and operational priorities. - Drive end-to-end delivery of monitoring initiatives-requirements gathering, planning, execution oversight, and delivery validation. - Coordinate cross-team dependencies, ensure timelines are met, and proactively remove blockers for the team. - Provide subject-matter support for ITSM processes including incident, problem, and change management discussions.

SKILLSMust have

- 6+ years in Site Reliability Engineering or Observability/Monitoring engineering roles. - 5+ years hands-on with monitoring/observability tools: New Relic, SolarWinds ,WUG - 4+ years of scripting experience (JavaScript, Java, PowerShell, or others) - 2+ years with Azure (architecture fundamentals, observability in cloud-native and lift-and-shift contexts). - 4+ year scripting with Python and Bash or PowerShell for automation. - Experience troubleshooting complex distributed applications, leading/participating in war rooms, and performing code-level impact analysis (read logs/stack traces, correlate with deploys and infra changes). - Solid understanding of observability best practices (metrics, logs, traces), ITSM processes, and alert hygiene. - Have the mindset of "automate any task" - Maintain associated documentation as it applies to our audit and certification requirements - Ensure platform stability, availability, and compliance through proactive vulnerability management and lifecycle maintenance - Drive process improvements for monitoring workflows and incident management - Participate in troubleshooting, capacity planning, and performance analysis activities - Research new monitoring requirements and in many cases write code for that - Solid expertise in setting up monitoring policies/rules/templates; and writing scripts to accomplish monitoring requirements - Excellent problem solving, communication, and cross-team collaboration skills.

Nice to haveCertifications

Luxoft

Apply Now