Agentic Infrastructure Observability Engineer

XenonStack

  • Mohali, Punjab
  • Permanent
  • Full-time
  • 8 days ago
  • Apply easily
ABOUT XENONSTACKXenonStack is the fastest-growing Data and AI Foundry for Agentic Systems, enabling enterprises to gain real-time and intelligent business insights.We deliver innovation through:Agentic Systems for AI Agents →Vision AI Platform →Inference AI Infrastructure for Agentic Systems →Our mission is to accelerate the world’s transition to AI + Human Intelligence by building platforms that are scalable, reliable, and observable by design.THE OPPORTUNITYWe are seeking an Agentic Infrastructure Observability Engineer to design and implement end-to-end observability frameworks for AI-native and multi-agent systems.This role sits at the heart of AgentOps and Reliability Engineering — ensuring that agents, pipelines, and infrastructure are monitored, measurable, and continuously optimized.If you thrive on metrics, monitoring, and making complex systems transparent and reliable, this role offers a chance to define observability for the next generation of enterprise AI.KEY RESPONSIBILITIESObservability FrameworksDesign and implement observability pipelines covering metrics, logs, traces, and cost telemetry for agentic systems.Build dashboards and alerting systems to monitor reliability, performance, and drift in real-time.Agentic AI MonitoringTrack LLM usage, context windows, token allocation, and multi-agent interactions.Build monitoring hooks into LangChain, LangGraph, MCP, and RAG pipelines.Reliability & PerformanceDefine and monitor SLOs, SLIs, and SLAs for agentic workflows and inference infrastructure.Conduct root cause analysis of agent failures, latency issues, and cost spikes.Automation & ToolingIntegrate observability into CI/CD and AgentOps pipelines.Develop custom plugins/scripts to extend observability for LLMs, agents, and data pipelines.Collaboration & ReportingWork with AgentOps, DevOps, and Data Engineering teams to ensure system-wide observability.Provide executive-level reporting on reliability, efficiency, and adoption metrics.Continuous ImprovementImplement feedback loops to improve agent performance and reduce downtime.Stay updated with state-of-the-art observability and AI monitoring frameworks.SKILLS & QUALIFICATIONSMust-Have3–6 years of experience in SRE, DevOps, or Observability Engineering.Strong knowledge of observability tools (Prometheus, Grafana, ELK, OpenTelemetry, Jaeger).Experience with cloud-native infrastructure (AWS, GCP, Azure) and Kubernetes monitoring.Proficiency in Python, Go, or Bash for scripting and automation.Understanding of AI/LLM pipelines, RAG systems, and vector databases.Hands-on with CI/CD pipelines and monitoring-as-code.Good-to-HaveExperience with AgentOps tools (LangSmith, PromptLayer, Arize AI, Weights & Biases).Exposure to AI-specific observability (token usage, model latency, hallucination tracking).Knowledge of Responsible AI monitoring frameworks.Background in BFSI, GRC, SOC, or other regulated industries.WHY SHOULD YOU JOIN US?Agentic AI Product CompanyBuild observability frameworks for next-gen enterprise AI systems.A Fast-Growing Category LeaderBe part of one of the fastest-growing AI Foundries, powering mission-critical agent deployments.Career Mobility & GrowthAdvance into roles like Reliability Architect, AgentOps Lead, or Head of Observability.Global ExposureWork on observability challenges across Fortune 500 enterprises and global innovators.Create Real ImpactEnsure transparency, trust, and resilience in production-grade AI systems.Culture of ExcellenceOur values — Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession — give you autonomy to innovate and accountability to deliver.Responsible AI FirstHelp enterprises adopt AI that is not just powerful, but explainable and auditable.XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT!At XenonStack, we believe in shaping the future of intelligent systems. We foster a culture of cultivation built on bold, human-centric leadership principles, where deep work, simplicity, and adoption define everything we do.Our Cultural ValuesAgency – Be self-directed and proactive.Taste – Sweat the details and build with precision.Ownership – Take responsibility for outcomes.Mastery – Commit to continuous learning and growth.Impatience – Move fast and embrace progress.Customer Obsession – Always put the customer first.Our Product PhilosophyObsessed with Adoption – Making observability and trust an integral part of enterprise AI.Obsessed with Simplicity – Turning complex monitoring into seamless, actionable insights.Be part of our mission to accelerate the world’s transition to AI + Human Intelligence — by making agentic AI systems transparent, observable, and reliable at scale.

XenonStack