
Site Reliability Engineer II (Spark, Python, AWS/GCP)
- Bangalore, Karnataka
- Permanent
- Full-time
- Design, implement, and maintain highly available Apache Spark clusters and big data infrastructure across cloud and on-premises environments
- Monitor and optimize performance of distributed data processing workloads, ensuring SLA compliance and minimal downtime
- Implement comprehensive monitoring, alerting, and observability solutions for big data pipelines and infrastructure components
- Lead incident response and post-mortem analysis for data platform outages, implementing preventive measures to avoid recurrence
- Develop and maintain Infrastructure as Code (IaC) solutions using tools like Terraform, Ansible, or CloudFormation for big data infrastructure provisioning
- Build automated deployment pipelines and CI/CD workflows for Spark applications and data platform components
- Create and maintain runbooks, operational procedures, and disaster recovery plans for critical data systems
- Implement capacity planning and auto-scaling solutions to handle varying data processing workloads efficiently
- Collaborate with data engineering teams to optimize Spark job configurations, cluster sizing, and resource allocation
- Design and implement data platform governance, security, and compliance measures
- Evaluate and integrate new big data technologies and tools to improve platform capabilities and performance
- Establish best practices for code deployment, configuration management, and system maintenance
- 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles with focus on distributed systems
- Deep hands-on experience with Apache Spark (Scala, Python/PySpark) and Spark cluster management (YARN, Kubernetes, or standalone)
- Proficiency with big data ecosystem technologies including Hadoop, HDFS, Hive, Kafka, Airflow, and data lakes/warehouses
- Strong experience with cloud platforms (AWS, GCP, or Azure) and their big data services (EMR, Dataproc, HDInsight, etc.)
- Advanced knowledge of containerization technologies (Docker, Kubernetes) and orchestration in data processing contexts
- Experience with infrastructure monitoring and observability tools (Prometheus, Grafana, ELK stack, Datadog, or similar)
- Proficiency in Infrastructure as Code tools (Terraform, CloudFormation, Ansible) for managing big data infrastructure
- Strong Linux/Unix system administration skills and experience with configuration management tools
- Knowledge of networking, security, and performance tuning in distributed computing environments
- Proficient in at least one programming language (Python, Scala, Java, or Go) for automation and tooling development
- Experience with CI/CD pipelines and version control systems (Git, Jenkins, GitLab CI, or similar)
- Strong scripting skills (Bash, Python) for automation and operational tasks
- Understanding of software engineering best practices including testing, code review, and documentation
- Experience with stream processing frameworks (Kafka Streams, Apache Flink, or Spark Streaming)
- Knowledge of data governance, data quality, and data lineage tools
- Familiarity with machine learning operations (MLOps) and model deployment at scale
- Experience with database technologies (SQL, NoSQL) and data warehouse solutions
- Relevant certifications in cloud platforms or big data technologies
- Competitive base salaries
- Bonus incentives
- Support for financial-well-being and retirement
- Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location)
- Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
- Generous paid parental leave policies (depending on your location)
- Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
- Free and confidential counseling support through our Healthy Minds program
- Career development and training opportunities