Description

Responsibilities:

  • Lead the development and architecture of scalable data processing systems using PySpark.
  • Design and implement efficient and reliable data pipelines, data lakes, and ETL workflows.
  • Fine-tune Spark applications for optimal performance, including configuration tuning, memory management, and resource allocation.
  • Collaborate with data engineers, data scientists, and stakeholders to understand data processing requirements and deliver robust solutions.
  • Manage and optimize Spark clusters, ensuring high availability and performance, utilizing tools like Kubernetes, YARN, and Mesos.
  • Work with big data storage solutions such as HDFS, S3, Parquet, and ORC to manage data storage and retrieval efficiently.
  • Utilize Spark SQL, DataFrames, and Dataset APIs to perform complex data transformations and analytics.
  • Apply best practices in distributed computing principles and stay current with the latest technologies and trends in big data processing.

 Requirements:

  • 10+ Years experience as a Lead Spark Developer, Data Engineer, or similar role with extensive hands-on PySpark experience.
  • Strong proficiency in Python and Spark APIs.
  • Deep understanding of distributed computing principles, architectures, and best practices.
  • Expertise in designing and developing fault-tolerant and scalable data processing systems.
  • Strong skills in tuning Spark applications, including configuration, memory, and resource management.
  • Experience with cluster management tools such as Kubernetes, YARN, or Mesos.
  • Practical knowledge of big data storage solutions including HDFS, S3, and formats like Parquet and ORC.
  • Demonstrated ability to design and implement efficient data pipelines and data lakes.
  • Excellent problem-solving, communication, and collaboration skills.

 

Education

Any Graduate