Job Description:
Key Responsibilities:
1. Data Profiling
- Develop repeatable Python/SQL scripts to compute column statistics, null/unique distributions, outlier checks, referential integrity, and rule-based quality validations.
- Generate and publish standardized profiling reports/dashboards for stakeholder review.
2. Data Mapping (S2T)
- Create and maintain source-to-target mappings for ingestion and transformation layers, capturing business rules, lineage, assumptions, and edge cases.
- Maintain version control of mapping documents in GitLab.
3. ELT Development
- Extract/Load (Mage): Build and operate ingestion pipelines with retries, alerting, schema enforcement, and parameterized environment configurations.
- Transform (dbt): Develop staging, cleansing, and mart-level models with dbt tests (unique, not_null, accepted_values) and generate documentation.
4. Versioning & CI/CD
- Utilize GitLab for branching, merge request reviews, linting, dbt tests, and automated CI/CD deployments.
5. Data Quality Management
- Implement and monitor data quality tests at every stage of the pipeline.
- Track SLAs and enforce merge blocking on failures to prevent regression.
6. Documentation & Hand-offs
- Maintain runbooks, S2T documents, dbt docs, and pipeline diagrams.
- Ensure documentation updates within 3 business days of any change.
7. Collaboration
- Partner with analysts, architects, and QA teams to clarify transformation rules, review designs, and meet acceptance criteria.
Requirements:
Required Qualifications:
- 3–6+ years of experience in Data Engineering, preferably with offshore/remote delivery exposure.
- Strong expertise in SQL (advanced queries, window functions, performance tuning) and Python (data processing with Pandas or PySpark).
- Hands-on experience with Mage orchestration, dbt modeling, and GitLab workflows.
- Solid understanding of data modeling, lineage tracking, and data quality frameworks.
- Excellent communication skills and disciplined documentation practices.
Preferred Skills:
- Experience with Snowflake, BigQuery, Redshift, or Azure Synapse.
- Exposure to PySpark, Databricks, or Airflow.
- Awareness of BI tools such as Power BI, Tableau, or Looker for downstream analytics integration