
Platform Engineering -III
- Hyderabad, Telangana
- Permanent
- Full-time
- Design and implement comprehensive observability frameworks to monitor performance, reliability, and availability of EDAA data platforms and services.
- Define and track key SLIs, SLOs, and KPIs across critical platform components and data pipelines.
- Lead the integration of monitoring, logging, tracing, and alerting tools to enable real-time insights and root cause analysis.
- Collaborate with platform engineering, SRE, and product teams to enhance observability coverage and automate incident responses.
- Drive the adoption of best practices in telemetry collection, dashboards, and visualization for operational excellence.
- Oversee incident management processes and post-mortem practices to ensure continuous reliability improvements.
- Provide leadership in tool evaluation and deployment across observability and performance management platforms.
- Partner with security, compliance, and data governance teams to ensure visibility into data usage, lineage, and policy adherence.
- Lead operational reviews and reporting to highlight system health, risks, and opportunities for optimization.
- Mentor and coach engineers and analysts on observability concepts and tools to build a culture of shared ownership and resilience.
- 8+ years of experience in platform engineering, site reliability engineering (SRE), DevOps, or observability roles
- Proficient with of ETL Pipelines
- Strong expertise with observability tools and platforms (e.g., Datadog, Grafana, Prometheus, ELK, OpenTelemetry, Splunk)
- Experience setting and managing SLIs/SLOs/SLAs in high-availability environments
- Strong background in incident response, performance tuning, and root cause analysis
- Proven ability to lead cross-functional collaboration and align technical work with business outcomes
- Bachelor's degree in computer science, Engineering, or related field; advanced degree or certifications preferred