Sr Architect, Systems
TMUS Global Solutions
- Hyderabad, Telangana
- Permanent
- Full-time
Key Responsibilities:
- Lead real-time production triage for high escalated incidents (app, platform, network, data) and driving mitigation or failover.
- Design and evolve end-to-end observability (structured logs, metrics, traces, events, correlation IDs) to cut MTTD and eliminate blind spots.
- Perform deep performance engineering (latency breakdown, GC/heap tuning, thread/async analysis, CPU/memory/I/O profiling) and eliminate tail latency.
- Analyze incident and alert trends to remove systemic failure modes and reduce repeat occurrences and noisy alert sources.
- Provide recommendations on optimizing Kubernetes workloads (resource requests/limits, HPA, pod disruption budgets, affinity/anti-affinity, ingress, service mesh traffic) for resilience and efficiency.
- Build automation and self-healing (runbook codification, dependency health probes, pre-flight deployment guards, drift and config integrity checks).
- Work for post-incident reviews, producing clear causal chains, durable remediation actions, and tracked ownership to closure.
- Enhance release and change safety with automated rollback and SLO guardrails.
- Drive capacity and scalability planning (forecast saturation, right-size clusters, assess quota limits, model concurrency vs throughput) to prevent resource exhaustion.
- Maintain authoritative runbooks, architecture dependency maps, DR playbooks, and reliability scorecards for transparency and onboarding speed.
- Partner with development, platform, security, and data teams to embed reliability patterns (idempotency, bulkheads, circuit breakers, backpressure) early in design.
- Proactively surface emerging risks (error budget degradation, scaling inflection points, capacity shortfalls, aging certificates) before they become incidents.
- Production triage and troubleshooting and problem-solving skills and incident communication clarity (concise timeline narration, stakeholder updates, executive summaries, remediation advocacy).
- Strong production Kubernetes expertise (controllers, scheduling behaviour, networking, ingress, service mesh, resource tuning, multi-cluster operations), preferred CKAD or CKA certified.
- Proficiency in any one language Java or Go or Python for building diagnostic tooling, automation services, performance harnesses, and reliability utilities.
- Solid database and SQL capability (query tuning, indexing, execution plan analysis) plus familiarity with at least one NoSQL or caching layer (Dynamo, Mongo).
- Deep observability stack usage (Splunk, Prometheus, Grafana, OpenTelemetry, tracing systems, APM tools) and alert noise reduction techniques.
- Performance profiling mastery (async-profiler, flame graphs, thread and heap dumps, network and syscall analysis).
- Strong Linux/Unix internals knowledge (process scheduling, cgroups, kernel signals, network stack, filesystem and I/O, perf/strace/tcpdump/iostat/sar tooling).
- Automation and infrastructure-as-code experience (Ansible, Helm, GitOps pipelines, CI/CD gating, self-heal workflows).
- Strong log, metric, and trace correlation skills for root cause isolation across microservices, queues, caches, and external dependencies.
- Messaging and event streaming familiarity (Kafka, SQS, RabbitMQ) including lag analysis, consumer scaling, ordering, and replay strategies.
- Ownership mindset with collaborative influence, mentoring peers in production debugging, reliability principles, and continuous improvement discipline.
- Practical SRE framework implementation (SLI taxonomy, SLO lifecycle, error budget policies, toil reduction, reliability scorecards).
- Distributed systems resilience patterns (circuit breakers, retries with jitter, timeouts, bulkheading, idempotent semantics, backpressure, graceful degradation).
- Hands-on multi-region AWS and/or Azure experience (load balancing, autoscaling, Route53/DNS/Azure DNS, storage replication, DR and failover orchestration).
- Demonstrated proactive risk identification (capacity hotspots, noisy dependencies, cascading failure precursors, config drift, expiring certs/secrets).