Most non-technical professionals working with web-enabled businesses have come across the concept of “scraping”, however the term gets thrown around loosely to basically mean copying any data off any website, even if the data in question is a handful of data points. This article aims to give the basics of web-scraping, which will hopefully help shed light on what makes for a good scraping use case and what are the technical basics of a scraping tool.
What data can be scraped online?
Although new technologies are emerging, such as Robotic Process Automation (RPA), that allow for more sophisticated data scraping even from non-standardised data structures (such as emails), we will focus on the most common use cases of scraping that junior developers could look to setup with ease when it comes to web pages.
The structure of a website is built around HTML code, which is the blueprint of any page that loads when the page is opened. You can view it by opening the source code of a website on most browsers. Inside the HTML code you might find calls to other plugins, applications and essentially other code that is not all viewable to the user. For example, you can open up the source code of your Facebook profile to see the structure of the visible elements of the page, which is the HTML, but you will not be able to access to databases in which Facebook stores all your data.
This HTML code is visible to users, and therefore readable by scraping tools which essentially break the HTML code into a more digestible form. A scraper can be programmed to always look for certain tags in HTML code, for example, on your Facebook page, your name can always be found between two title tags in the form of <title id=”pageTitle”>USER_NAME</title>. By knowing that the information you may want to scrape is always held between the same tags, you can easily programme a scraper to collect that data.
Web scraping can also refer to other formats, for example scraping PDFs available online, however again for our purposes we will stick with standard web pages.
What are the best use cases for scraping?
Collecting names from Facebook pages is not the best example of scraping. Instead the typical use cases rest on being able to collect hundreds or thousands of standardised data points very quickly. Let’s say you wanted to collect all the names, prices, categories and descriptions from an e-commerce site, which are available for hundreds of products on a listing page after a quick search. Again, all the data points you want would be standardised in the sense that they would be held between the same tags and could therefore be scraped with ease.
So before asking someone to “scrape” data, think about whether there is a high volume of data that requires scraping and whether the data is presented in a standardised way.
The basics of setting up a scraper
Depending on which programming language you are using, setting up a basic scraper can be quite simple for a junior developer. For this example, we will use Ruby, which is a general-purpose programming language.
Ruby, like all other popular coding languages, have ready-built modules that can be used. For scraping, you need two: Open-URI and Nokogiri.
Open-URI is a module in Ruby that allows us to make HTTP requests outside of our browser and collect the HTML code of any page. For example “html = open(‘http://www.google.com‘)” would return to us the HTML code of Google’s home page and make it readable. So the purpose of Open-URI is simply to make the HTML of any website programmatically accessible to us.
Nokogiri is a piece of software (called a “gem”) compatible with Ruby that anyone can store on their computer with a few simple commands in the terminal (“gem install nokogiri”). Nokogiri allows you to parse HTML code and collect the data in it. Nokogiri actually means a “fine-toothed saw”, as it allows us to go through all the snippets of code and pull out precisely the information we want through various methods provided in the gem.
Scraping each website requires the scraper to be customised for that page, however the basic idea of using Nokogiri is that it allows us to iterate over nested structures within HTML code and pick out the data held within specific tags that we are looking for.
After data has been collected we can simply store it the appropriate output format, depending on what we intend the use the data for. For example, a spreadsheet or a database are two popular output formats.
Words of caution
The other thing to consider is how much flexibility will the process require. We already touched upon standardised data, however if data is constantly being collected from different websites, such as authors of blogs, it may be much more effective to do the work manually through human eyes. This work would require judgement, for example to check various pages for the information required, and building a scraper to handle all the possible cases would result in probably more development effort than it’s worth.
Finally, scrapers can break if the website being scraped changes it’s source code. This will require a developer to go in and re-configure the scraper to understand the new layout of the source code. If scraping is a critical part of your operation, make sure you have a solid fallback in place if your scraper were to break.
If you are looking for options for collecting data, having this discussion with more technical colleagues is worthwhile, however after reading this article hopefully you can have a more educated discussion from the start.