Apache Airflow & AWS (S3/EMR/Bedrock) Data Platform Administrator / Operations Engineer

Zensar View all jobs

  • Pune, Maharashtra
  • Permanent
  • Full-time
  • 17 days ago
Job Description:
  • Strong experience in Airflow DAG monitoring, including tracking task states, resolving DAG execution delays, and ensuring reliability across distributed environments.
  • Expertise in failure recovery, including retry strategies, SLA miss handling, backfilling, re-running failed task instances, and ensuring consistent pipeline execution across environments.
  • Hands-on experience providing SLA-based job execution support, ensuring time-critical pipelines meet business deadlines and production SLAs are continuously maintained.
  • Skilled in performing root cause analysis (RCA) for pipeline failures, including dependency failures, task-level exceptions, scheduler issues, and platform-level bottlenecks.
  • Experience in managing S3 storage optimization, including lifecycle policies, intelligent tiering, storage class transitions, versioning, and cost-effective data retention strategies.
  • Expertise in securing S3 environments using IAM policies, bucket policies, encryption (KMS), access logging, and object-level permissions.
  • Skilled in conducting cost usage analysis for S3 storage and recommending optimization strategies to reduce operational spend.
  • Strong background in administering Amazon EMR clusters, including cluster provisioning, configuration, autoscaling, and lifecycle management.
  • Experience supporting Amazon Bedrock environments, including model endpoint configuration, invocation monitoring, access control, and cost governance.
Responsibilities:
  • Monitor and manage Apache Airflow DAGs, ensuring timely execution, resolving delays, and maintaining reliability across distributed environments.
  • Perform failure recovery activities, including retries, SLA-miss handling, backfilling, and rerunning failed task instances for consistent pipeline execution.
  • Provide SLA-driven operational support to ensure critical data pipelines meet business timelines and production availability targets.
  • Conduct in-depth RCA for pipeline issues such as dependency failures, task exceptions, scheduler disruptions, and platform bottlenecks.
  • Optimize AWS S3 storage through lifecycle policies, intelligent tiering, storage class transitions, and cost-effective data retention strategies.
  • Implement strong S3 security using IAM roles, bucket policies, KMS encryption, access logging, and object-level access controls.
  • Analyze S3 usage patterns and recommend cost-optimization measures to minimize storage and operational spend.
  • Administer Amazon EMR clusters, including provisioning, configuration management, autoscaling, and end-to-end lifecycle operations.
  • Support Amazon Bedrock environments with model endpoint configuration, monitoring invocations, managing access controls, and ensuring cost governance.
Qualifications:
  • Strong hands-on experience administering Apache Airflow in distributed, production-grade environments.
  • Deep understanding of DAG orchestration, task execution states, scheduler behavior, and pipeline reliability practices.
  • Proven expertise in workflow recovery techniques-retries, backfilling, SLA handling, and task reruns.
  • Solid background in troubleshooting and performing RCA for pipeline, platform, and dependency failures.
  • Practical experience managing and optimizing AWS S3 storage, lifecycle rules, tiering, and cost management.
  • Strong knowledge of S3 security controls including IAM roles, bucket policies, KMS encryption, and access logging.
  • Experience conducting S3 usage/cost analysis and recommending optimization strategies.
  • Hands-on expertise in provisioning, configuring, and managing Amazon EMR clusters and autoscaling policies.
  • Working experience with Amazon Bedrock-model endpoint setup, usage monitoring, access governance, and cost oversight.
  • Ability to support 24×7 production environments with a strong focus on operational excellence and SLAs.
  • Strong analytical, problem-solving, and cross-team coordination skills for cloud and data platform operations.
About Us: At Zensar, we're “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus.Part of the $4.8 billion RPG Group, we're a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore and join us to to be the best version of yourself.We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

Zensar