Required Qualifications (Must-Have)
· Programming Skills: Advanced proficiency in Python, particularly with libraries such as NumPy and Pandas for data manipulation and analysis.
· Parquet Experience: Strong experience with Parquet files, including reading, writing, and optimizing for performance and storage efficiency.
· Data Structure Manipulation: Ability to set up and manipulate Python data structures such as lists, strings, dictionaries, and tuples.
· Data Exploration: Familiarity with data exploration, visualization, and comparing metrics of large CSV and Parquet files, including partitioned Parquet files.
· Advanced Data Techniques: Strong skills in joins, merges, pivot tables, grouping, and window functions in Python or SQL.
· Version Control: Strong understanding of GIT, including git push and git clone for collaborative development.
· Linux Proficiency: Experience with Linux commands and shell scripting for data operations.
· Data Pipeline Experience: Proven experience in building and managing data ingestion pipeline scripts, including batch and real-time processing.
· REST API Knowledge: Familiarity with building REST APIs and securing them through API key validation and authentication mechanisms.
· Debugging Skills: Demonstrated ability to handle complex data pipeline architecture with excellent debugging skills.
· Leadership Experience: Prior experience leading a technical team and mentoring junior engineers.
Preferred Qualifications (Good-to-Have)
· Cloud Platform Knowledge: Experience with cloud platforms, preferably AWS (S3, Lambda, Redshift), for data storage and processing.
· Workflow Orchestration: Familiarity with Apache Airflow or similar workflow orchestration tools for scheduling and monitoring workflows.
· Containerization: Knowledge of containerization technologies (Docker, Kubernetes) for deploying data pipelines in a scalable manner.
· Object-Oriented Programming: Good experience with object-oriented programming patterns, multithreading, and multiprocessing.
· Spark Applications: Experience developing Spark applications using Python, including familiarity with Apache Spark (Spark SQL, Spark Streaming, DataFrames, RDD, PySpark).
· Communication Skills: Excellent verbal and written communication skills, with the ability to convey technical concepts to non-technical stakeholders.
Bachelor's degree in Computer Science