Develop and Maintain Scrapers : Build, deploy, and maintain efficient and reliable web scrapers using Java and its core libraries to extract data from diverse websites and online sources.
Automate and Schedule : Design and implement scripts to automate repetitive scraping tasks, scheduling jobs using tools like cron or enterprise schedulers (e.g., Airflow) to ensure timely data collection.
Data Storage and Management : Store and manage scraped data effectively in various databases, including SQL and NoSQL solutions, as well as cloud-based storage platforms.
Overcome Scraping Hurdles : Employ various tools and techniques to successfully navigate and bypass common web scraping obstacles such as CAPTCHAs, dynamic content loading, and IP blocking.
Optimize for Performance : Ensure scrapers are optimized for performance and scalability, capable of handling large-scale data extraction tasks without compromising system stability.
Data Processing and Cleansing : Transform raw scraped data into clean, structured formats like CSV and JSON. Implement data validation and cleansing processes to guarantee data quality and integrity.
Ensure Compliance : Adhere to web scraping best practices and ensure all data acquisition activities are in compliance with legal and ethical standards, including website terms of service.
Collaborate Effectively : Work closely with data analysts, product managers, and other developers to understand data requirements and deliver high-quality, actionable data.
A Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related technical field.
3+ years of professional experience in software engineering with a strong focus on Java development and proven experience writing Java code to extract data from websites, ensuring efficiency, accuracy, and adherence to best practices.
2+ years of experience with web technologies, including a solid understanding of JavaScript, HTML, CSS, and XML for effective entity extraction and hands-on experience designing, querying, and managing data in both SQL or NoSQL databases.
2+ years of experience with core Java web scraping libraries such as Jsoup for HTML parsing and browser automation tools like Selenium or HtmlUnit for handling dynamic, JavaScript-rendered content, handling data formats like JSON and CSV, coupled with experience in data cleaning and validation techniques.
English proficiency of B2 or higher.
Experience with cloud platforms such as AWS, Google Cloud, or Azure for deploying and managing scraping infrastructure.
A foundational understanding of network traffic analysis.
Familiarity with the full Software Development Life Cycle (SDLC), including testing and quality assurance.
Proficiency with version control systems, particularly Git, for collaborative development.
Experience with CI/CD pipelines and associated tools.
A keen understanding of the importance of respecting website terms of service and practicing ethical scraping.
Any Graduate