Find missing content with wget spider

After moving my blog from digital ocean a month ago I've had Google Search Console send me a few emails about broken links and missing content. And while fixing those was easy enough once pointed out to me, I wanted to know if there was any missing content that GSC had not found yet.

I've used wget before to create an offline archive (mirror) of websites and even experimented with the spider flag but never put it to any real use.

For anyone not aware, the spider flag allows wget to function in an extremely basic web crawler, similar to Google's search/indexing technology and it can be used to follow every link it finds (including those of assets such as stylesheets etc) and log the results.

Turns out, it’s a pretty effective broken link finder.

Installing wget with debug mode

Debug mode is required for the command I'm going to run.

On OSX, using a package manager like Homebrew allows for the --with-debug option, but it doesn't appear to be working for me at the moment, luckily installing it from source is still an option.

Thankfully cURL is installed by default on OSX, so it's possible to use that to download and install wget.

Linux users should be able to use wget with debug mode without any additional work, so feel free to skip this part.

Download the source

cd /tmp
curl -O https://ftp.gnu.org/gnu/wget/wget-1.19.5.tar.gz
tar -zxvf wget-1.19.5.tar.gz
cd wget-1.19.5/

Configure with openSSL

./configure --with-ssl=openssl --with-libssl-prefix=/usr/local/ssl

Make and install

make
sudo make install

With the installation complete, now it's time to find all the broken things.

Checking your site

The command to give wget is as follows, this will output the resulting file to your home directory ~/ so it may take a little while depending on the size of your website.

wget --spider --debug -e robots=off -r -p http://jcode.me 2>&1 \
        | egrep -A 1 '(^HEAD|^Referer:|^Remote file does not)' > ~/wget.log

Let’s break this command down so you can see what wget is being told to do:

--spider, this tells wget not to download anything.
--debug, gives extra information that we need.
-e robots=off, this one tells wget to ignore the robots.txt file.
-r, this means recursive so wget will keep trying to follow links deeper into your sites until it can find no more.
-p, get all page requisites such as images, styles, etc.
https://jcode.me, the website url. Replace this with your own.
2>&1, take stderr and merge it with stdout.
|, this is a pipe, it sends the output of one program to another program for further processing.
egrep -A 1 '(^HEAD|^Referer:|^Remote file does not)', find instances of the strings "HEAD", "Referer" and "Remote file does not". Print out these lines and the ones above it.
> ~/wget.log, output everything to a file in your home directory.

Reading the log

Using grep we can take a look inside the log file, filtering out all the successful links and resources, and only find references to the lines which contain the phrase broken link.

grep -B 5 'broken' ~/wget.log

It will also return the 5 lines below that line so that you can see the resource concerned (HEAD) and the page where the resource was referenced (Referer).

An example of the output;

--
HEAD /autorippr-update/ HTTP/1.1
Referer: https://jcode.me/makemkv-auto-ripper/
User-Agent: Wget/1.16.3 (darwin18.2.0)
--
Remote file does not exist -- broken link!!!


--
HEAD /content/images/2019/01/tpvd27rgco7ssa21.jpg HTTP/1.1
Referer: https://jcode.me/makemkv-auto-ripper/
User-Agent: Wget/1.16.3 (darwin18.2.0)
--
Remote file does not exist -- broken link!!!