
Lead Data Engineer(pyspark)
- Hyderabad, Telangana
- Permanent
- Full-time
In this role, you will be instrumental in designing, developing, and optimizing our next-generation data pipelines and data platforms. You will work with large-scale datasets, solve complex data challenges, and contribute to building robust, scalable, and efficient data solutions that drive business value.
This is an exciting opportunity for someone passionate about big data technologies, performance optimization, and building resilient data infrastructure.As a Data Engineer you'll be:
- Performance Optimization: Identify, diagnose, and resolve complex performance bottlenecks in PySpark jobs and Spark clusters, leveraging Spark UI, query plans, and advanced optimization techniques (e.g., partitioning, caching, broadcasting, AQE, UDF optimization).
- Design & Development: Lead the design and implementation of highly scalable, fault-tolerant, and optimized ETL/ELT pipelines using PySpark for batch and potentially real-time data processing.
- Data Modeling: Collaborate with data scientists, analysts, and product teams to understand data requirements and design efficient data models (e.g., star/snowflake schemas, SCDs) for analytical and operational use cases.
- Data Quality & Governance: Implement robust data quality checks, monitoring, and alerting mechanisms to ensure the accuracy, consistency, and reliability of our data assets.
- Architectural Contributions: Contribute to the overall data architecture strategy, evaluating new technologies and best practices to enhance our data platform's capabilities and efficiency.
- Code Review & Best Practices: Promote and enforce engineering best practices, including code quality, testing, documentation, and version control (Git). Participate actively in code reviews.
- Mentorship & Leadership: Mentor junior data engineers, share knowledge, and contribute to a culture of continuous learning and improvement within the team.
- Collaboration: Work closely with cross-functional teams including software engineers, data scientists, product managers, and business stakeholders to deliver impactful data solutions.
- 8+ years of professional experience in data engineering, with at least 4+ years specifically focused on PySpark development and optimization in a production environment.
- Expert-level proficiency in PySpark including Spark SQL, DataFrames, RDDs, and understanding of Spark's architecture (Driver, Executors, Cluster Manager, DAG).
- Strong hands-on experience with optimizing PySpark performance on large datasets, debugging slow jobs using Spark UI, and addressing common issues like data skew, shuffles, and memory management.
- Excellent programming skills in Python with a focus on writing clean, efficient, and maintainable code.
- Proficiency in SQL for complex data manipulation, aggregation, and querying.
- Basic understanding of data warehousing concepts (dimensional modeling, ETL/ELT processes, data lakes, data marts).
- Experience with distributed data storage solutions such as Delta Lake, Apache Parquet etc.
- Familiarity with version control systems (Git).
- Strong problem-solving abilities, analytical skills, and attention to detail.
- Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to both technical and non-technical audiences.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related quantitative field.
- Competitive salary
- Self & Family Health Insurance
- Term & Life Insurance
- OPD Benefits
- Mental wellbeing through Plumm
- Learning & Development Budget
- WFH Setup allowance
- 15 days of Privilege leaves
- 12 days of Casual leaves
- 12 days of Sick leaves
- 3 paid days off for volunteering or L&D activities
- Stock Options