single blog background
default_img

GroupBWT

PHP scraper vs Python scraper. Episode 2

Read the beginning of the article about PHP scraper and Python scraper here.

Saving results

Let us assume that we need to store FB links that we have collected and these links are stored now in facebook_links array, which is 10 elements length.

Let us start with PHP scraper. Most often, the scraping results are stored in the database and to work with it, we have installed a Doctrine/DBAL, a popular package. It has a lot of advantages – it does not create unnecessary connects, when working with methods you don’t need to care what database it is and what driver is used; and it works quite fast. Drawback is that there is no multiple insert. This is a headache because you will need to make 10 inserts into database (instead of desired one) in order to store 10 FB links (will need to change Doctrine/DBAL to heavier Doctrine/DBAL or leave 10 inserts until later). Then we need to describe the function/method or class with method that is responsible for it and saving is done.

There is another solution that we have been using when working with Python and Scrapy. At the beginning, we have installed mysqlclient, this is a small library, which helps Python to work with MySQL database. Of course, we could use SQLAlchemist, which is just Doctrine/ORM on Python, but the issue is a big number of wrappers, opening session, setting various blocking etc but we need scraper to act fast that’s why we are using RAW MySQL. In most cases, when creating scraper you know in advance what database you are going to use – MySQL, Mongo or other. So why not using dialect directly without waiting unless wrapper does it? BTW, MySQL clients for Python have two methods, execute and execute_many (for single and multiple inserts, accordingly).

One more detail that is worth mentioning is how Scrapy saves scraped data. This is done with the help of descendants of scrapy.Item() class, which are defined in scrapy.Item(), example:


class FacebookItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
pass

Further in spider code, when we have defined another FB url element, we just generate FacebookItem class element:


yield FacebookItem({‘url’: found_url, ‘title’: found_title})

By this action spider puts element into a queue for processing. It does not care about this element anymore (and it should not), this is pipelines.py file that is working on it (this file stores classes for handling elements). These classes store logic of result returned from scraper – they can connect to database, collect elements for bulk processing, store results in file or log or get rid of results etc. More details here.

Pipeline classes connect to scraper through settings (settings.py file).

Here we mention a very important topic – project structure. Being a framework, Scrapy provides rule and model of how, where and when the objects should be defined and processed:

  • http requests and responses in spider class
  • result format in element class
  • processing elements in pipelines

At the same time, any custom scraper on PHP (even having an accurate structure with distributed connection to the database, storing results and processing page DOM) will be different from another good PHP-based scraper, let alone bad scrapers. In this sense Scrapy has a significant advantage because if you follow documentation and clearly described methods, code will be clear, easy to understand and customize.

Multithreading

What we want to start with is that multithreading in scrapers can be of two kinds:

  • multithreading as asynchronous threads within the process
  • multi-processing as individual processes running in parallel

PHP is a synchronous language, which does not have multithreading in it. There are libraries that emulate multithreading, like Icicle, React PHP: they act through creation of individual processes and they are expecting response within timer functions (multi-processing). Python, on the other hand, has built-in functions for multithreading and for multi-processing (due to multiprocessing and subprocess modules). It gives more flexible options in scrapers management, but a drawback of Python is that one python processes won’t run by multiple processors (cores) at the same time without you spending quite a significant effort on it so multithreading in Python scrapers should be combined with multi-processing.

To better understand cons and pros, let us imagine that our goal is to scrape 100 pages with the help of our scraper. For PHP scraper it will be a minor update in code – just to wrap calling scraper entry point into an array. Though it is quick to implement, it will take some time to process because only one page is being processed at a time. Abetter solution is to wrap into React PHP, it will allow us set up the number of generated child processes and they will work in parallel, speeding up scraping time. Basically, this is it.

Scrapy case. The simplest solution to fill start_urls array with this set consisting from 100 pages and implement two settings for spider

  • CONCURRENT_REQUESTS – number of parallel threads within a spider
  • CONCURRENT_REQUESTS_PER_DOMAIN – number of parallel threads per domain

These two settings manage multithreading within one spider entity. These simple actions on settings level almost completely copy React PHP functionality with one difference – spider entity will occupy only one processor core. To increase effectiveness we can manually split array of incoming pages into few smaller ones and run multiple entities. Multiprocessing and subprocess modules fulfill this task. In simple cases multiprocessing module will be enough, it will be easy to organize multi-processing with it. If you need more control over generating, closing processes, custom handling exit codes or command line parameters, you will need to implement a custom manager using subprocess.

In the end, we can say that organizing multithreading with the help of React PHP is not dramatically different from the same process using subprocess but Python-based scrapers have internal multithreading due to options provided by Python. In addition to flexible combining of pages that are processed at the same time, it allows to use database connections in more effective way. Each child process on PHP creates a separate database connection, while internal threads of Scrapy spider share one connect. For instance, to scrape 100 pages with 100 threads PHP scraper will open 100 connects (inserts) and Python/Scrapy solutions with 10 processes using 10 threads each will open 10 connections (10 inserts) for the same 100 pages.

Middleware

This is an option, which is present in Scrapy because it is a framework and it is not present in simple PHP scrapers. The closest variant would be setting up something in IF block. Say we need to scrape with proxies and in case with PHP we will do the following: if there are necessary parameters in configs or command line parameters, we need to add necessary parameters into CURL (and it will be directly in spider body).

In Scrapy the same logic can work as an additional layer. I.e. code modifying incoming query to go through proxy and returning result to spider is defined in middleware.py file. In fact, spider code file does not know that query has gone through proxy; spider handles only the query and the response. Middle layers from middleware.py or external modules, as well as pipelines, should be connected in scraper settings.

Conclusion

Just to conclude, it does not matter whether to implement this or that scraper on PHP or Python+Scrapy. Both solutions deliver desired results but personal opinion of the author is that Python-Scrapy solutions has certain advantages in terms of flexibility of development and further usage, namely:

  • convenient DOM scraping and possibility to use chains and selector branches
  • clean and easy to understand project structure
  • flexible combination of multithreading and multi-processing in each process
  • middleware
  • no need to think over request queries, response promises, redirect support, conversion of responses into scraping objects and other routine operations

Post scriptum

For those who have decided to give a chance to Scrapy, just a few useful links and tips:

  1. Official documentation.
  2. Spider can run not only by ‘crawl’ command but by another python script: https://doc.scrapy.org/en/latest/topics/practices.html
  3. Related to #1: if fun by script, in general case the spider won’t want to read settings.py, but settings object can be used too:

    from scrapy.utils.project import get_project_settings
    settings = get_project_settings() settings.set(‘LOG_LEVEL’, ‘DEBUG’)
  4. Take a look at argparse, configparser modules. They will be helpful in setting up command line parameters and reading configs.
  5. Try to avoid working directly with the database for spider body, everything can be done in pipelines (it might seem no but in fact yes).
  6. Take a look at def start_requests(self) method: if it is present in spider, it replaces generation of standard requests for start_urls array. It gives more control and custom options to set up generated start requests.
  7. NEVER pass response object int generated request as this will cause memory leaks because memory occupied by response will never be released. Correct solution is to pass necessary values Through Meta parameter and to all garbage collector where needed.
  8. Be careful with the rules for allowed/forbidden domains and following all links you find on the page, you are at risk of scraping the whole internet.
  9. Be careful in using dont_filter request option. Set to true, it will pass url through all filters. Most often, it is needed to scrape the page for the second time (by default scrapy does not allow to re-scrape pages in the same entity) but it can end up with scraper looping. If you need to re-scrape, use a special callback for response handling or pass meta parameter of re-scraping marker and handle it in general callback.
  10. Use logging module to show data during scraper work, standard print() won’t work correctly.
  11. It is possible to connect to spider_closed directly in pipeline.
  12. Don’t build too complicated xPath filters, they look bad and calculate slower, regular expressions work better in such cases.
  13. Scrapy has a middle layer for crawlera: pip install scrapy-crawlera