Description

Must have:

· Strong hands-on experience with Apache Spark for data processing and analytics.
· Proficiency in writing advanced SQL queries, including complex joins, aggregations, and window functions.
· Familiarity with Spark components such as Spark SQL, Spark Streaming, and PySpark.
· Understanding of distributed computing concepts and Spark architecture (e.g., RDDs, DAGs, partitions).
· Experience working with large datasets, data lakes, and data warehouses.
· Knowledge of file formats like Parquet, Avro, and ORC.
· Proven ability to optimize Spark jobs and SQL queries for efficiency and scalability.
· Strong problem-solving skills with attention to detail.
· Ability to collaborate effectively with cross-functional teams.
· Excellent communication skills for sharing insights and progress with stakeholders.
· Knowledge of Python, Scala, or Java for Spark application development

Good to have:

· Experience with big data ecosystems like Hadoop, Hive, or HBase.
· Familiarity with workflow orchestration tools such as Apache Airflow or Luigi.
· Knowledge of NoSQL databases like MongoDB, Cassandra, or Elasticsearch.
· Experience deploying Spark jobs on cloud platforms (e.g., AWS EMR, Azure Synapse, or Google Dataproc)
· Familiarity with cloud data platforms like Snowflake, BigQuery, or Redshift.
· Scripting experience for automating repetitive tasks.
· Familiarity with monitoring tools like Prometheus, Grafana, or Spark’s built-in UI.
· Hands-on experience with debugging tools for Spark and SQL processes.
· Relevant certifications in big data (e.g., Databricks Certified Associate, Cloudera Certified Developer).
· Understanding of industry-specific data needs, such as finance, healthcare, or retail analytics.

 

What You'll Do

· Build and maintain distributed data processing pipelines using Apache Spark.
· Write efficient SQL queries to extract, transform, and analyze large datasets.
· Perform data cleansing, validation, and enrichment to ensure high-quality datasets.
· Optimize Spark jobs for performance, including tuning Spark configurations and improving query efficiency.
· Implement partitioning, caching, and indexing strategies for large-scale data processing.
· Develop and manage ETL workflows to process data from various sources into data lakes or warehouses.
· Collaborate with data engineers to integrate data from structured and unstructured sources.
· Monitor Spark jobs and cluster performance, addressing bottlenecks and failures.
· Troubleshoot SQL queries and Spark processes to resolve performance and accuracy issues.
· Work closely with data engineers, analysts, and stakeholders to understand data requirements.
· Present findings and insights derived from large datasets to business teams.
· Document workflows, best practices, and troubleshooting guides for Spark and SQL usage.

Education

Any Graduate