Reasoning - Back End - Observability Engineer

Sarvam AI

  • Bangalore, Karnataka
  • Permanent
  • Full-time
  • 23 days ago
Observability Engineer - Sarvam Reasoning AIAbout JobThe Sarvam AI Reasoning team is building sophisticated reasoning capabilities for India's first sovereign AI platform. We are seeking a skilled Site Reliability Engineer to join our organization. In this role, you'll be responsible for designing and implementing comprehensive observability systems, incident management frameworks, and automated remediation solutions. You'll work closely with our development teams to ensure our AI platform maintains exceptional reliability and performance. Your expertise will be critical in building resilient systems that can handle the scale and complexity of our enterprise-grade AI reasoning capabilities.This challenging position offers the opportunity to work at the intersection of AI technology and systems reliability. We seek candidates with strong expertise in observability, incident management, and chaos engineering who can demonstrate excellence in designing and maintaining complex distributed systems.Skills & QualificationBachelor's/Master's Degree in Computer Science or related field from a top-tier institution3-5 years of experience in Site Reliability Engineering or DevOps rolesDemonstrated expertise in designing and implementing observability platforms for distributed systemsStrong experience with monitoring tools, log aggregation systems, and metrics collectionProficiency in developing dashboards and alerting systemsExtensive knowledge of incident management processes and escalation policiesExperience implementing SLO/SLI frameworks and error budgetsFamiliarity with chaos engineering practices and toolsKnowledge of AIOps principles and predictive monitoringExperience with incident response automation and remediation playbooksStrong programming skills in languages like Python, Go, or JavaExcellent communication skills and experience with post-incident analysisResponsibilitiesDesign comprehensive observability data platforms for logs, metrics, and tracesDesign and implement distributed tracing with correlation IDs across all systemsDevelop dashboards and alerting systemsOwn on-call processes, incident classification, and escalation policiesOwn SLO/SLI frameworks with error budgets and automated remediation playbooksImplement chaos engineering practices to validate system resiliencyBuild AIOps systems for anomaly detection and predictive monitoringImplement automated incident response with predefined playbooksDesign blameless post-mortem processes and continuous improvement cyclesCollaborate with development teams to improve system reliability and performanceEstablish and maintain documentation for operational procedures and best practicesMentor junior engineers on observability and reliability engineering practices

Sarvam AI

Similar Jobs

  • Software Development Engineer

    Delta Air Lines

    • Bangalore, Karnataka
    About Delta Air Lines About the Company Delta Air Lines (NYSE: DAL) is the U.S. global airline leader in safety, innovation, reliability and customer experience. Powered by our…
    • 23 hours ago
    • Apply easily
  • Lead Engineer - React.js

    Neighborly

    • Bangalore, Karnataka
    About Neighborly Neighborly is a local network of home service brands that will connect you to very specific vetted local experts. Our family of service professionals work with r…
    • Just now
    • Apply easily
  • Senior Software Engineer

    Ferguson

    • Bangalore, Karnataka
    About Ferguson Ferguson is the largest value-added distributor serving the specialized professional in the residential and non-residential North American construction market. We …
    • 23 hours ago
    • Apply easily