Description

OVERVIEW

Key Responsibilities

  • Design, develop, and optimize ETL pipelines using PySpark on Google Cloud Platform (GCP).
  • Work with BigQuery, Cloud Dataflow, Cloud Composer (Apache Airflow), and Cloud Storage for data transformation and orchestration.
  • Develop and optimize Spark-based ETL processes for large-scale data processing.
  • Implement best practices for data governance, security, and monitoring in a cloud environment.
  • Collaborate with data engineers, analysts, and business stakeholders to understand data requirements.
  • Troubleshoot performance bottlenecks and optimize Spark jobs for efficient execution.
  • Automate data workflows using Apache Airflow or Cloud Composer.
  • Ensure data quality, validation, and consistency across pipelines.
  • 5+ years of experience in ETL development with a focus on PySpark.
  • Strong hands-on experience with Google Cloud Platform (GCP) services, including:
  • BigQuery
  • Cloud Dataflow / Apache Beam
  • Cloud Composer (Apache Airflow)
  • Cloud Storage
  • Proficiency in Python and PySpark for big data processing.
  • Experience with data lake architectures and data warehousing concepts.
  • Knowledge of SQL for data querying and transformation.
  • Experience with CI/CD pipelines for data pipeline automation.
  • Strong debugging and problem-solving skills.
  • Experience with Kafka or Pub/Sub for real-time data processing.
  • Knowledge of Terraform for infrastructure automation on GCP.
  • Experience with containerization (Docker, Kubernetes).
  • Familiarity with DevOps and monitoring tools like Prometheus, Stackdriver, or Datadog.

Skills:

Gcp, Pyspark, Etl

Education

Any Graduate