There are a few simple and lazy ways to check for dead links. The methods to use depend on whether you want to check a list of links in a text file or you want to check for dead links within a website. We all know that checking for dead links manually by clicking each one of them is a very long and boring task. Thankfully I have found – after a very long search – a few incredibly handy URL checking scripts.
This short guide is split into two parts:
Finding dead links is an easy task with help from the tools described here. I’ve provided a brief usage guide to the desktop program Dead Link Checker (DLC) which checks the validity of links stored in a text file (example output files provided).
The next two pages explain a little more about the options available for checking the validity of URLs listed in either a web site or a text file. Two free downloads are available directly through JournalXtra, they are
Other link and URL validity checking downloads are linked to over the next few pages.
The support pages for Dead Link checker are available here
The support page for Scriptilitious is here
Checking for Dead Links Within a Website
There is a lot of dead link checking software available for those who want to run a test on a website to check that all the URLs within it are alive. This type of software usually spiders a specified website’s links and compiles a report of those links that work and those that are dead. This type of link checking software usually runs from a desktop and can be told how many links to follow from the first ones it encounters, for example five links deep from every link found on a website’s home page. Put another way, to map all links up to five pages distant from the first page trawled.
A few of these dead link checking programs are:
There are also many scripts that can be installed onto a server to constantly check websites for dead URLs.
Here’s a link to an online dead link checking service
Link checking programs of this type have two huge limitations:
- They check all links on a specified page i.e no URL can be exempted from the check
- They only check links that are set within a web page and are served by a web server
Those two limitations might not seem so bad if all you intend to do is check the validity of the links set within one website and your not bothered about the length of time it takes to check them or the hits on your server as the links are checked; but, what if you know exactly the URLs of the links you wish to check?
Checking Links from a Text File
There is a perl script called Dead Link Checker which inspects a list of links in a locally stored file (such as on a desktop computer) and compiles a report of the server response code given for each link. The report is printed to an html file that can be loaded into a browser. Each link within the html file is categorized by server response code.
The advantages of using DLC to validate a list of known URLs are:
- The URLs do not have to point to one domain – they can be for as many domains as their total number
- The list of URLs do not need to be stored on a web server
- Server hits are reduced to only as many as are required to validate the URLs
The script is simple to use: it requires one text file of URLs, one terminal command to get it going and the understanding of server response codes that I am about to give you.
Potential dead links return 3xx or 4xx server codes (for example 301 or 403).
Not all 3xx or 4xx server responses indicate a dead link, sometimes they indicate a page redirect so their validity must be manually confirmed by inspection of the URL each flagged link redirects to. Dead Link Checker lists URLs that redirect alongside the URL they redirect to so the manual checking process can be very quick – if there is no redirect URL then it is a dead link, if there is a redirect URL then the redirection page must be checked to see what it is.
DLC might not be a cure all but it makes the job of deciding whether a link is dead or just redirected a damn site easier.
Many page redirects are only between the http:// and http://www. URL preference of a site’s webmaster. So be sure to check the differences between links and their redirects before you write them off as being dead.
Here’s an example of how to use Dead Link Checker, with example pages. All work is performed from within the same folder:
- Create a text file (links.txt)
- Paste the URLs to be checked into the file (links.txt) – it might look like this
- Format the links so that the URLs have no html tags e.g if they are formatted like these change them to look like these. (Use the URL Extractor script in Scriptilitious to extract URLs from large files and strip them of any html formatting.)
- Open a terminal in the folder that holds the text file (links.txt)
- Type the command
deadlinkcheck -HTMLoutput -noCache -Verb links.txt > checked.html
To get Dead Link Checker to check the URLs in the file links.txt and print a report to checked.html – the report will look similar to this
- Load DLC’s output file, the html report (checked.html), both in a web browser and in a text editor (I use Kate) – the file in the text editor will look similar to this
- Check the links listed in your browser and manually remove the bad ones from the file loaded in the text editor. The response codes are:
- 2xx response codes indicate live links. Keep these
- 3xx response codes indicate URLs that redirect somewhere else. Check these
- 4xx response codes indicate pages that are not found. DLC relies on the server response code, not all servers give a correct response so check these links manually
- 5xx response codes indicate a pages that will likely never load. Check them manually
- Once the bad links have been removed use the text editor to reformat the links in the HTML file (checked.html). Here is how I do that with Kate (Linux text editor)
- convert all characters to lower case (Tools>Lowercase)
- remove indentation (Tools>Clean Indentation)
- open find and replace (ctrl+r)
- find
- replace it with
|
(or some other character that is not reproduced anywhere within the file)
- Use Alt+A to replace all occurrences.
- find
- replace it with
<code>|-> </code>
- replace it with nothing (i.e just remove it)
- find
<b>
- replace it with nothing (i.e just remove it)
- find
</b>
- replace it with nothing (i.e just remove it)
- find
</a><br>
- replace it with
</a>
- Manually remove anything that isn’t between a tags (<a> and </a>)
- Use a terminal (Konsole or Console) – open it from within the folder that holds the HTML file (checked.html) and enter this command
sed 's/.*|//g' checked.html > links.txt
to remove anything in a line that is written in front of a pipe “|” (inclusive of the pipe) and place all the links into the file links.txt
- The file links.txt will now contain the checked, active URLs from the original list formatted within <a> tags with anchor text made from the URL e.g
<a href="https://journalxtra.com">https://journalxtra.com</a>
It will look similar to this
- If required, alter the anchor text to remove the http:// and .com (or .net etc…) components.
You can use the URL Extractor and URL2Hyperlink scripts provided in Scriptilitious to quickly reformat the links in your checked file.
Intelligent use of Dead Link Checker can transform a task that would usually take a week or two to complete into one that takes less than 20 minutes.
DLC is a free perl script and might or might not work in Windows. Use a Linux Live Disk (Kubuntu or Linux Mint) if it doesn’t.
Downloads
These programs are free to use and download.