Project Description

A custom scraper for generating leads from Yelp. The goal was to collect leads for email marketing company, targeted at medical offices in the USA. For this purpose we used data available in the directory of Yelp.com.

Yelp is a directory, which stores reviews for businesses, it helps to find the best resources in various branches of industry and shows them on the map. Open data include business name, address, phone number, site and reviews. However, this data is not enough to run a successful marketing campaign because business email address is not provided by Yelp. Fortunately, most businesses have webpage and this address is specified in business details on Yelp.

Our task consisted of two steps:

  1. Collect data of all medical businesses in the USA
  2. Enter webpage of every medical office and scrape email addresses.

The challenge of step 1 was beating Yelp anti-scraping protection because Yelp blocks frequent calls. To resolve this issue we used proxy servers. We have our own tool for finding free proxies automatically, which grabs available proxies and checks if they are functional.

The challenge of step 2 was performance. We needed to process websites with emails as fast as possible. Say we have 800 000 webpages, each has 20 pages average, this means we need to process 16 mln pages. This is quite a lot, can take weeks, which we could not afford. For such tasks we have a server, which is specifically equipped to handle such load and reach all possible speed. We run 600 threads at the same time, which brought us more than 400 000 email addresses with 84% accuracy of finding email on website.