Staff Database Reliability Engineer - (PostgreSQL + Cloud)
Rackspace Technology View all jobs
- Gurgaon, Haryana
- Permanent
- Full-time
- Somone who can work from office ( Hyderabad location )
- 8-10+ years in DBA / Platform Engineering
- Strong multi-cloud experience (Azure / AWS / GCP – at least two)
- Deep HA/DR & performance tuning expertise
- Automation-first mindset (Terraform, scripting, CI/CD)
- Experience in SaaS/DBaaS environments preferred
- Primary Database: PostgreSQL
- Secondary Database: MySQL, Oracle, MS SQL Server
- Database Backup & Recovery: Tools and strategies for database backups and disaster recovery.
- Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting.
- Database Security: User management, roles, access control, and auditing.
- Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore).
- Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes.
- Kubernetes & Containers: Running databases in containers (like Kubernetes).
- Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
- Database Migration: Migrating databases across different platforms or cloud environments.
- Database Scaling: Vertical and horizontal scaling techniques in cloud environments.
- Incident Management: Handling database outages, incident response, and on-call rotations.
- Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch.
- Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets.
- Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions.
- Scripting Languages: Python, Shell scripting, Bash, PowerShell.
- Automation Tools: Ansible, Puppet, Chef.
- Infrastructure Automation: Automating database deployment, patching, and scaling.
- Networking Basics: TCP/IP, DNS, Firewall, Load Balancers.
- Database Connectivity: Connection pooling, failover strategies, and multi-region deployment.
- Storage and Disk Management: Understanding IOPS, latency, and throughput.
- Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs).
- Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.).
- Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage.
- Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting.
- Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance.
- Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount).
- Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability.
- Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability.
- Log Analysis: Reading and analysing database and system logs.
- Root Cause Analysis (RCA): Performing in-depth analysis after major incidents
- Query Performance: Analysing slow queries, deadlocks, and resource contention.
- Communication Skills: Clear communication with stakeholders and engineering teams.
- Problem-Solving: Ability to troubleshoot complex database issues under pressure.
- Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams.