Apache Airflow & AWS (S3/EMR/Bedrock) Data Platform Administrator / Operations Engineer
Zensar View all jobs
- Pune, Maharashtra
- Permanent
- Full-time
- Strong experience in Airflow DAG monitoring, including tracking task states, resolving DAG execution delays, and ensuring reliability across distributed environments.
- Expertise in failure recovery, including retry strategies, SLA miss handling, backfilling, re-running failed task instances, and ensuring consistent pipeline execution across environments.
- Hands-on experience providing SLA-based job execution support, ensuring time-critical pipelines meet business deadlines and production SLAs are continuously maintained.
- Skilled in performing root cause analysis (RCA) for pipeline failures, including dependency failures, task-level exceptions, scheduler issues, and platform-level bottlenecks.
- Experience in managing S3 storage optimization, including lifecycle policies, intelligent tiering, storage class transitions, versioning, and cost-effective data retention strategies.
- Expertise in securing S3 environments using IAM policies, bucket policies, encryption (KMS), access logging, and object-level permissions.
- Skilled in conducting cost usage analysis for S3 storage and recommending optimization strategies to reduce operational spend.
- Strong background in administering Amazon EMR clusters, including cluster provisioning, configuration, autoscaling, and lifecycle management.
- Experience supporting Amazon Bedrock environments, including model endpoint configuration, invocation monitoring, access control, and cost governance.
- Monitor and manage Apache Airflow DAGs, ensuring timely execution, resolving delays, and maintaining reliability across distributed environments.
- Perform failure recovery activities, including retries, SLA-miss handling, backfilling, and rerunning failed task instances for consistent pipeline execution.
- Provide SLA-driven operational support to ensure critical data pipelines meet business timelines and production availability targets.
- Conduct in-depth RCA for pipeline issues such as dependency failures, task exceptions, scheduler disruptions, and platform bottlenecks.
- Optimize AWS S3 storage through lifecycle policies, intelligent tiering, storage class transitions, and cost-effective data retention strategies.
- Implement strong S3 security using IAM roles, bucket policies, KMS encryption, access logging, and object-level access controls.
- Analyze S3 usage patterns and recommend cost-optimization measures to minimize storage and operational spend.
- Administer Amazon EMR clusters, including provisioning, configuration management, autoscaling, and end-to-end lifecycle operations.
- Support Amazon Bedrock environments with model endpoint configuration, monitoring invocations, managing access controls, and ensuring cost governance.
- Strong hands-on experience administering Apache Airflow in distributed, production-grade environments.
- Deep understanding of DAG orchestration, task execution states, scheduler behavior, and pipeline reliability practices.
- Proven expertise in workflow recovery techniques-retries, backfilling, SLA handling, and task reruns.
- Solid background in troubleshooting and performing RCA for pipeline, platform, and dependency failures.
- Practical experience managing and optimizing AWS S3 storage, lifecycle rules, tiering, and cost management.
- Strong knowledge of S3 security controls including IAM roles, bucket policies, KMS encryption, and access logging.
- Experience conducting S3 usage/cost analysis and recommending optimization strategies.
- Hands-on expertise in provisioning, configuring, and managing Amazon EMR clusters and autoscaling policies.
- Working experience with Amazon Bedrock-model endpoint setup, usage monitoring, access governance, and cost oversight.
- Ability to support 24×7 production environments with a strong focus on operational excellence and SLAs.
- Strong analytical, problem-solving, and cross-team coordination skills for cloud and data platform operations.