Scraping websites

Home / Scraping websites

I am currently working on the migration of a website. The old one – after ten years – has a lot of content, including more than 3800 documents in different shapes and sizes. To complicate matters, they are hosted on a web server that is not directly accessible, and in a CMS that is not very user-friendly.

Scraping the html content was relatively easy. It allowed me to make a sitemap and find all active pages, the resources linked to them, download their content, and the date they were last modified. Finding the same for the PDFs and DOCs was harder to do.

A preview with metadata was available, yet the extension was .pdf or .doc, and the scrapers I used do not follow those links.

The solution was to get into the javascript of webscraper.io, and there remove the filter to exclude documents. And voila: the crawler follows the links and captures the content of the preview.

It allows me to see where pages link to documents, which documents just sit there, and when they were uploaded or last modified. Whilst you still have to go through some of the records manually , you can now decide on 95% of your content when you implement simple rules, e.g.:

  1. Delete stubs and content older than 5 years
  2. Delete all vacancy announcements older than 30 days
  3. Keep all annual and financial reports, etc.

Should you migrate your website, then scraping may be an option. It may also be useful when you want to do a spring clean. But better still: get a decent CMS and make sure that it is managed!