Key Responsibilities
- Real-time implementations of the Data Lakehouse solution.
- Data Modelling and Data Architecting solutions.
- Design, implement, and maintain Data Lakehouse solutions, integrating structured and unstructured data sources.
- Develop scalable ETL/ELT pipelines using tools like Apache Iceberg, Trino, Apache Spark, Delta Lake, Databricks, or Snowflake.
- Optimize data storage formats and query performance across large datasets.
- Implement security and compliance best practices in data management (role-based access control, data masking, etc.), and regulation compliance like CCPA, CASL
- Build and Optimize a distributed search system using Trino to enable fast, SQL based querying across large-scale, heterogeneous datasets.
- Leverage Apache Iceberg for table format management, ensuring efficient data partitioning, versioning and schema evolution.
- Implement indexing strategies and query optimization techniques to enhance search performance and reliability.
- Architect a unified data Lakehouse solution that combines the flexibility of data lake with the structure of a data warehouse.
- Enable real-time and batch processing capabilities for analytics, machine learning, and reporting use cases
- Ensure data consistency, ACID compliance, and scalability using Iceberg’s transactional capabilities.
- Establish data governance frameworks, including metadata management, data lineage, and access control policies.
- Monitor and tune the performance of Trino queries and Iceberg-based storage systems to ensure low latency and high throughput.
- Collaborate with cloud and DevOps teams to support data infrastructure automation and monitoring.
Required Skills & Qualifications
- Real-time implementation knowledge of the deployment and creation of a Data Lakehouse.
- Hands-on experience with Apache Iceberg, Trino, Databricks, Delta Lake, or Snowflake.
- Proficiency in Apache Spark, Python/Scala, and SQL
- Strong working experience in data modelling, data partitioning, and performance tuning.
- Familiarity with data governance, data lineage, and metadata management tools.
- Experience working in Agile/Scrum teams.
- Work with structured and semi-structured data stored in object storage systems like S3, GCS.
- Experience with Apache Iceberg, SQL, and Python.
- Familiarity with data orchestration tools like Apache Airflow.
- Experience with real-time data processing frameworks (e.g. Kafka, Artimis, Flink)
- Knowledge of Machine learning pipelines and MLOps integrations with data lakehouses.
- Must be eligible for up to a Top Secret Security Clearance