|
|
|
A web scraper or parser is a program or script for collecting and converting data from open sources into a structured format. Most often it works with text, however it can extract images, as well as other media content that is not protected by copyright.
There are many web-scraping tutorials on the internet. Given the many options for frameworks and languages, it can be quite difficult to choose the right one for your needs. I bring to your attention a small overview of the web scraper tools in the Python language.
Scrapy is an open source framework written in Python that takes care of everything from loading HTML pages to saving them in the desired form. Requests in Scrapy are scheduled and processed asynchronously. Asynchronous requests allow you to speed up your work by executing requests in parallel, without waiting for the previous one to complete. Scrapy is an extremely powerful and efficient tool for building complex architectural data collection and processing solutions. For efficient operation of applications built on the basis of this framework, you may need to rent a server with sufficient performance.
In traditional web pages, our job is to parse the HTML and extract the information we want. On modern websites, the frontend will likely not contain much HTML because the data is received asynchronously after the first request. Therefore, most people use headless browsers - they are able to execute JavaScript, make further requests, and then they can parse the entire page using regular expression (for details click here).
|
|
|
|
|
|
|