The project goal was data enrichment: to find and collect email addresses of people working in the educational institutions in the USA based off the list of universities. We were provided the list of websites of the USA universities and our task was to find and scrape all available emails associated with these educational institutions. The next step was to define the specific department, to which this or that email belongs.
To collect data and make data enrichment we have selected the following technical stack: Python, Scrapy, MySQL. The challenge was to link email to a specific department. When scraping with lots of threads there can be an issue that threads share the same database but do not share information about current process status or iteration. For handling such case, we have developed a special algorithm based on search trees: after finding a department and going through search branches and finding emails, it allows to connect emails to the department, to which current branches are connected. In the worst scenario, it allows to link email to the university. In rare cases it happened that scraper had found email on department pages before the department itself was defined. For handling such case we created a script for data post-processing, which updated connections between emails and departments basing upon matches of urls parts.
To search for departments (or define the link as a link to department), we have prepared a system of rules, which consists of few simple though effective steps.
1. To define a potential department link, all links scraped from the page are checked against keywords in url text or tag parameters.
2. Popular and widely used keywords, like news/blog/calendar are excluded.
3. If keywords has been found in url, link is cut right after the url part that contains he keyword. For instance, for ‘engineering’ keyword, from two links that have been found:
we will pull /department/engineering/electrical-engineering only.
4. If potential department link is a link on a subdomain that belongs to university webpage, we only treat subdomain as a department. For instance, http://art.qwe.edu/news/campus-art-gallery
In this case, university link is http://qwe.edu and http://art.qwe.edu was defined as a department link.
Searching for emails on the page is not a complicated task and it is done by means of complex scanning of tags and similar and scanning text with the help of regular expressions.
After that we have added specific methods that clean up pulled emails from encoded characters (urlencoded) or convert encrypted emails into usual format. For instance, from JS String.fromCharCode or from emails with @ replaced by -at-, (at), _at and similar, which is done to hide emails from scrapers.
Scraper has successfully processed about 6K educational websites that belong to the universities in the USA , collected about 3 mln emails and 200K departments connected to each other so data enrichment task was fulfilled successfully. On the next stage, enhanced with keywords that belong to European languages, scraper was used to grab data from European universities.