Jay Taylor's notes
back to listing indexArchiving a (WordPress) website with wget | D’Arcy Norman dot net
[web search]I needed to archive several WordPress sites as part of the process of gathering the raw data for my thesis research. I found a few recipes online for using wget
to grab entire sites, but they all needed some tweaking. So, here’s my recipe for posterity:
I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).
wget --mirror -p --html-extension --convert-links -e robots=off -P . http://url-to-site
That command doesn’t throttle the requests, so it could cause problems if the server has high load. Here’s what that line does:
- --mirror: turns on recursion etc… rather than just downloading the single file at the root of the URL, it’ll now suck down the entire site.
- -p: download all prerequisites (supporting media etc…) rather than just the html
- --html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
- --convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.
- -e robots=off: executes the “robots off” command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don’t own the site being archived, you should obey all robots.txt files or you’ll be a Very Bad Person.
- -P .: set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)
- http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.
You may also need to play around with the -D domain-list and/or --exclude-domains options, if you just want to control how it handles content hosted on more than one domain.
It’s worth noting that this isn’t WordPress-specific. This should work fine for archiving any website.