Description

Responsibilities:

• Design, develop, and maintain scalable data pipelines and systems for data processing.

• Utilize Hadoop and related technologies to manage large-scale data processing.

• Perform data ingestion using Kafka & spark, Sqoop and various file formats and process data into Hive using Beeline/spark.

• Develop and maintain shell scripts for automation of data processing tasks.

• Implement full and incremental data loading strategies to ensure data consistency and availability.

• Orchestrate and monitor workflows using Apache Airflow.

• Ensure code quality and version control using Git.

• Troubleshoot and resolve data-related issues in a timely manner.

• Stay up-to-date with the latest industry trends and technologies to continuously improve our data infrastructure.

 

Requirements:

• Proven experience as a Data Engineer (ETL, data warehousing).

• Strong knowledge of Hadoop and its ecosystem (HDFS, YARN, MapReduce, Tez and spark).

• Proficiency in Kafka & spark, Sqoop and Hive.

• Experience with shell scripting for automation.

• Expertise in full and incremental data loading techniques.

• Excellent problem-solving skills and attention to detail.

• Ability to work collaboratively in a team environment and communicate effectively with stakeholders.

 

Good to have:

• Understanding of PySpark and its application in real-time data processing.

• Hands-on experience with Apache Airflow for workflow orchestration.

• Proficiency with Git for version control

• Experience on Postgres SQL or SQL server or MSBI

Education

Any Graduate