
Senior Big Data Engineer - Assistant Vice President
- Pune, Maharashtra
- Permanent
- Full-time
- Design, develop, and maintain robust, scalable, and efficient big data pipelines primarily using PySpark for data ingestion, transformation, and processing.
- Implement and manage data workflows using Apache Airflow, including designing DAGs (Directed Acyclic Graphs), configuring operators, and optimizing task dependencies for reliable and scheduled data pipeline execution.
- Optimize PySpark jobs and data workflows for performance, cost-efficiency, and resource utilization across distributed computing environments.
- Collaborate closely with data scientists, AI/ML engineers, and other stakeholders to translate analytical and machine learning requirements into highly performant and automated data solutions.
- Develop and implement data quality checks, validation rules, and monitoring mechanisms within PySpark jobs and Airflow DAGs to ensure data integrity and consistency.
- Troubleshoot, debug, and resolve issues in PySpark code and Airflow pipeline failures, ensuring high availability and reliability of data assets.
- Contribute to the architecture and evolution of our data platform, advocating for best practices in data engineering, automation, and operational excellence.
- Ensure data security, privacy, and compliance throughout the data lifecycle within the pipelines.
- 7+ Years of Expert-level proficiency in PySpark for building and optimizing large-scale data processing applications.
- Strong hands-on experience with Apache Airflow, including DAG development, custom operators/sensors, connections, and deployment strategies.
- Proven experience in designing, building, and operating production-grade distributed data pipelines.
- Solid understanding of big data architectures, distributed computing principles, and data warehousing concepts.
- Proficiency in data modeling, schema design, and various data storage formats (e.g., Parquet, ORC, Delta Lake).
- Experience with cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP), specifically their big data services (e.g., EMR, Databricks, HDInsight, Dataflow) and object storage (S3, ADLS, GCS).
- Demonstrated experience with version control systems, particularly Git.
- Excellent problem-solving, analytical, and debugging skills.
- Ability to work effectively both independently and as part of a collaborative, agile team.
- Experience with containerization technologies (e.g., Docker, Kubernetes) for deploying PySpark applications or Airflow.
- Familiarity with CI/CD practices for data pipelines.
- Understanding of machine learning concepts and experience with data preparation for AI/ML models.
- Knowledge of other orchestration tools or workflow managers.
- Bachelor’s degree/University degree or equivalent experience