Key Skills: Must-Have: Azure, Data Modelling, Data Engineering, MS SQL, Python, Airflow.
Nice-to-Have: AWS.
Roles & Responsibilities:
- Design, develop, and maintain scalable and efficient data pipelines using tools like Apache NiFi, Apache Airflow, or similar.
- Develop robust Python scripts for data ingestion, transformation, and validation.
- Manage and optimize object storage systems (e.g., Amazon S3, Azure Blob, Google Cloud Storage).
- Collaborate with Data Scientists and Analysts to understand data requirements and deliver high-quality, production-ready datasets.
- Implement data quality checks, monitoring, and alerting mechanisms to ensure data accuracy and reliability.
- Ensure data security, governance, and compliance with industry standards, including GDPR, HIPAA, etc.
- Contribute to the architecture and design of data platforms and solutions, ensuring scalability and reliability.
- Design and implement ETL processes that align with business needs and deliver insights efficiently.
- Work closely with cross-functional teams to integrate data solutions into the broader ecosystem.
- Perform data profiling and assess data quality using statistical methods.
- Automate manual processes to improve efficiency and streamline data workflows.
- Optimize performance of data pipelines to handle large volumes of data with minimal latency.
- Mentor junior engineers and promote best practices in data engineering, including code reviews, version control, and testing.
- Troubleshoot and resolve data-related issues, ensuring minimal downtime and high availability.
Experience Required:
- 7-10 years of experience with building data lakes, data warehouses, and real-time streaming data solutions using cloud technologies (Azure mandatory; AWS a plus).
- Proven ability to work with large-scale structured and unstructured data using tools such as Apache NiFi, Apache Airflow, and Spark.
- Practical experience designing and implementing data models optimized for analytics and reporting using tools like Azure Synapse, Snowflake, or Redshift.
- Expertise in optimizing database queries and managing relational databases like MS SQL Server, PostgreSQL, or MySQL.
- Experience integrating third-party APIs and systems into data pipelines.
- Ability to perform root cause analysis on internal and external data and processes to answer specific business questions and identify opportunities for improvement.
- Experience in developing CI/CD pipelines for data workflows and familiarity with DevOps practices.
- Knowledge of data lineage, cataloging, and metadata management tools such as Azure Purview, Collibra, or Alation.
- Prior experience in supporting data governance initiatives, including implementing row-level security, masking, and anonymization techniques.
- Exposure to modern data orchestration and version control systems (e.g., Git, Bitbucket, Jenkins).
- Demonstrated success in working within Agile teams, participating in sprint planning, backlog grooming, and cross-functional collaboration.
- Hands-on experience with containerization technologies like Docker and orchestration tools such as Kubernetes for deploying scalable data services.
- Experience with real-time data streaming frameworks like Kafka or Azure Event Hub is an added advantage.
- Familiarity with monitoring tools (e.g., Grafana, Prometheus, Azure Monitor) to ensure data pipeline health and performance.
- Education: B.Tech M.Tech (Dual), B.E., B.Tech