After moving my blog from digital ocean a month ago I've had Google Search Console send me a few emails about broken links and missing content. And while fixing those was easy enough once pointed out to me, I wanted to know if there was any missing content that GSC had not found yet.
I've used wget before to create an offline archive (mirror
) of websites and even experimented with the spider
flag but never put it to any real use.
For anyone not aware, the spider
flag allows wget to function in an extremely basic web crawler, similar to Google's search/indexing technology and it can be used to follow every link it finds (including those of assets such as stylesheets etc) and log the results.
Turns out, it’s a pretty effective broken link finder.
Installing wget with debug mode
Debug mode is required for the command I'm going to run.
On OSX, using a package manager like Homebrew allows for the --with-debug
option, but it doesn't appear to be working for me at the moment, luckily installing it from source is still an option.
Thankfully cURL is installed by default on OSX, so it's possible to use that to download and install wget.
Linux users should be able to use wget with debug mode without any additional work, so feel free to skip this part.
Download the source
cd /tmp
curl -O https://ftp.gnu.org/gnu/wget/wget-1.19.5.tar.gz
tar -zxvf wget-1.19.5.tar.gz
cd wget-1.19.5/
Configure with openSSL
./configure --with-ssl=openssl --with-libssl-prefix=/usr/local/ssl
Make and install
make
sudo make install
With the installation complete, now it's time to find all the broken things.
Checking your site
The command to give wget is as follows, this will output the resulting file to your home directory ~/
so it may take a little while depending on the size of your website.
wget --spider --debug -e robots=off -r -p http://jcode.me 2>&1 \
| egrep -A 1 '(^HEAD|^Referer:|^Remote file does not)' > ~/wget.log
Let’s break this command down so you can see what wget is being told to do:
--spider
, this tells wget not to download anything.--debug
, gives extra information that we need.-e robots=off
, this one tells wget to ignore therobots.txt
file.-r
, this means recursive so wget will keep trying to follow links deeper into your sites until it can find no more.-p
, get all page requisites such as images, styles, etc.https://jcode.me
, the website url. Replace this with your own.2>&1
, takestderr
and merge it withstdout
.|
, this is a pipe, it sends the output of one program to another program for further processing.egrep -A 1 '(^HEAD|^Referer:|^Remote file does not)'
, find instances of the strings "HEAD", "Referer" and "Remote file does not". Print out these lines and the ones above it.> ~/wget.log
, output everything to a file in your home directory.
Reading the log
Using grep we can take a look inside the log file, filtering out all the successful links and resources, and only find references to the lines which contain the phrase broken link
.
grep -B 5 'broken' ~/wget.log
It will also return the 5 lines below that line so that you can see the resource concerned (HEAD
) and the page where the resource was referenced (Referer
).
An example of the output;
--
HEAD /autorippr-update/ HTTP/1.1
Referer: https://jcode.me/makemkv-auto-ripper/
User-Agent: Wget/1.16.3 (darwin18.2.0)
--
Remote file does not exist -- broken link!!!
--
HEAD /content/images/2019/01/tpvd27rgco7ssa21.jpg HTTP/1.1
Referer: https://jcode.me/makemkv-auto-ripper/
User-Agent: Wget/1.16.3 (darwin18.2.0)
--
Remote file does not exist -- broken link!!!