Description

Key Responsibilities:

  • Design and implement scalable data processing pipelines for ML training and validation
  • Build and maintain feature stores with support for both batch and real-time features
  • Develop data quality monitoring, validation, and testing frameworks
  • Create systems for dataset versioning, lineage tracking, and reproducibility
  • Implement automated data documentation and discovery tools
  • Design efficient data storage and access patterns for ML workloads
  • Partner with data scientists to optimize data preparation workflows

Technical Requirements:

  • 7+ years of software engineering experience, with 3+ years in data infrastructure
  • Strong expertise in GCP's data and ML infrastructure:
    • BigQuery for data warehousing
    • Dataflow for data processing
    • Cloud Storage for data lakes
    • Vertex AI Feature Store
    • Cloud Composer (managed Airflow)
    • Dataproc for Spark workloads
  • Deep expertise in data processing frameworks (Spark, Beam, Flink)
  • Experience with feature stores (Feast, Tecton) and data versioning tools
  • Proficiency in Python and SQL
  • Experience with data quality and testing frameworks
  • Knowledge of data pipeline orchestration (Airflow, Dagster)

Nice to Have:

  • Experience with streaming systems (Kafka, Kinesis)
  • Experience with GCP-specific security and IAM best practices
  • Knowledge of Cloud Logging and Cloud Monitoring for data pipelines
  • Familiarity with Cloud Build and Cloud Deploy for CI/CD
  • Experience with streaming systems (Pub/Sub, Dataflow)
  • Knowledge of ML metadata management systems
  • Familiarity with data governance and security requirements
  • Experience with dbt or similar data transformation tools

Education

Any Graduate