back to listing index

Archiving a (WordPress) website with wget | D’Arcy Norman dot net

[web search]
Original source (darcynorman.net)
Tags: history wget web-archiving archive
Clipped on: 2014-08-20

Archiving a (WordPress) website with wget

I needed to archive several WordPress sites as part of the process of gathering the raw data for my thesis research. I found a few recipes online for using wget to grab entire sites, but they all needed some tweaking. So, here’s my recipe for posterity:

I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).

wget --mirror -p --html-extension --convert-links -e robots=off -P . http://url-to-site

That command doesn’t throttle the requests, so it could cause problems if the server has high load. Here’s what that line does:

  • --mirror: turns on recursion etc… rather than just downloading the single file at the root of the URL, it’ll now suck down the entire site.
  • -p: download all prerequisites (supporting media etc…) rather than just the html
  • --html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
  • --convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.
  • -e robots=off: executes the “robots off” command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don’t own the site being archived, you should obey all robots.txt files or you’ll be a Very Bad Person.
  • -P .: set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)
  • http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

You may also need to play around with the -D domain-list and/or --exclude-domains options, if you just want to control how it handles content hosted on more than one domain.

It’s worth noting that this isn’t WordPress-specific. This should work fine for archiving any website.

3 thoughts on “Archiving a (WordPress) website with wget”

  1. Image (Asset 1/2) alt=