Project Description

Data augmentation for marketing campaign. The goal was to use the  database of email addresses to find information of their owners. Data augmentation is one of the most popular tasks in data processing for marketing and advertising. Email addresses without owner details are not enough to create high-quality marketing newsletter. The more real owner information is used in email, the better chances to get into mailbox and not be filtered out as spam. In case you have email owner’s first name, last name and position, you can create a personalized email and there is a high probability that the person will be interested to open and read it.

We have used Google to find and pull additional info for data augmentation. When making google search with a corporate email, in majority of cases owner details are shown in one of first three organic snippets. This is what we used to grab data. If personal details are missing in first three results, we can go to webpage and scan it trying to find personal details there.

To pass irrelevant snippets, we have developed an intellectual system of filters with blacklisted and whitelisted resources. The challenge is that google is doing his best to fight scraping and client IP address is blocked after a certain amount of queries. To resolve this issue we have used our custom tool for finding free proxy servers. Thus, queries are sent from multiple spots and in case a proxy is banned, we are finding another one that works to replace it.

Proxy servers often work slowly and with delays up to a minute so to speed up the process we run 800 threads at the same time, which allowed us to process data within a shortest possible period of time. Collected data relevance has reached 70%.