
Lead Systems Engineer
- Hyderabad, Telangana
- Permanent
- Full-time
- Competitive compensation, including base pay and annual incentive
- Comprehensive health and life insurance and well-being benefits, based on location
- Pension / Retirement benefits
- Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being.
- DTCC offers a flexible/hybrid model of 3 days onsite and 2 days remote (onsite Tuesdays, Wednesdays and a third day unique to each team or employee).
- Lead the migration from OpenText monitoring tools to Grafana and other open-source platforms.
- Design and deploy monitoring rules for infrastructure and business applications.
- Develop and manage alerting rules and notification workflows.
- Build real-time dashboards to visualize system health and performance.
- Configure and manage OpenTelemetry Collectors and Pipelines.
- Integrate observability tools with CI/CD, incident management, and cloud platforms.
- Deploy and manage observability agents across diverse environments.
- Perform upgrades and maintenance of observability platforms.
- Minimum of 07+ years of related experience.
- Bachelor's degree preferred or equivalent experience.
- Proven experience designing intuitive, real-time dashboards (e.g., in Grafana) that effectively communicate system health, performance trends, and business KPIs.
- Expertise in defining and tuning monitoring rules, thresholds, and alerting logic to ensure accurate and actionable incident detection.
- Strong understanding of both application-level and operating system-level metrics, including CPU, memory, disk I/O, network, and custom business metrics.
- Experience with structured log ingestion, parsing, and analysis using tools like Splunk, Fluentd, or OpenTelemetry.
- Familiarity with implementing and analyzing synthetic transactions and real user monitoring to assess end-user experience and application responsiveness.
- Hands-on experience with application tracing tools and frameworks (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and service dependencies.
- Proficiency in configuring and using AWS CloudWatch for collecting and visualizing cloud-native metrics, logs, and events.
- Understanding of containerized environments (e.g., Docker, Kubernetes) and how to monitor container health, resource usage, and orchestration metrics.
- Ability to write scripts or small applications in languages such as Python, Java, or Bash to automate observability tasks and data processing.
- Experience with automation and configuration management tools such as Ansible, Terraform, Chef, or SCCM to deploy and manage observability components at scale.