About the job
Key Skills: Scala Programming, Apache Spark ,SparkSQL, DataFrames, and Datasets, Big Data & Distributed Systems, SQL & Databases, AWS, Azure, or GCP
Roles and Responsibilities:
Develop and maintain scalable data pipelines using Apache Spark and Scala to process and manage large data volumes efficiently.
Write, optimize, and maintain robust code in Scala for Spark applications.
Design, build, and deploy real-time data pipelines for low-latency and real-time processing environments.
Create, transform, and process structured data using Spark SQL, Datasets, and DataFrames.
Develop distributed data processing solutions and optimize Spark jobs through partitioning, caching, and other performance enhancement techniques.
Maintain and deploy Spark-based processes for both batch and streaming data.
Implement streaming data architectures using Spark Streaming to handle continuous data flows.
Collaborate with cross-functional teams to ensure seamless data integration and pipeline efficiency.
Troubleshoot and debug distributed data processes and pipelines.
Maintain a codebase for data transformation processes and ensure data integrity throughout pipelines.
Skills Required:
4-7 years of experience as a Data Engineer working with Apache Spark and Scala.
Proficiency in writing and optimizing code in Scala for distributed data processing.
Strong understanding of Spark SQL, DataSets, DataFrames, and running Spark on clusters.
Experience in processing continuous data streams using Spark Streaming.
Hands-on experience with partitioning, caching, and other Spark optimization techniques.
Familiarity with streaming data architectures and real-time processing technologies.
Proven experience with building scalable and fault-tolerant data solutions.
Certification in Spark, Big Data, or related technologies is mandatory.
Preferred:
Knowledge of additional Scala libraries for data handling and processing.
Experience with cloud-based big data solutions and architectures.
Familiarity with data modeling techniques and SQL databases.
Strong problem-solving skills and ability to optimize resource usage in distributed systems.
Education: Bachelor's/Master's degree in Computer Science, Information Technology, or related fields.
Bachelor's/Master's degree in Computer Science, Information Technology