Checking Links from a Text File
There is a perl script called Dead Link Checker which inspects a list of links in a locally stored file (such as on a desktop computer) and compiles a report of the server response code given for each link. The report is printed to an html file that can be loaded into a browser. Each link within the html file is categorized by server response code.
The advantages of using DLC to validate a list of known URLs are:
- the URLs do not have to point to one domain – they can be for as many domains as their total number
- the list of URLs do not need to be stored on a web server
- server hits are reduced to only as many as are required to validate the URLs
The script is simple to use: it requires one text file of URLs, one terminal command to get it going and the understanding of server response codes that I am about to give you.
Potential dead links return 3xx or 4xx server codes (for example 301 or 403).
Not all 3xx or 4xx server responses indicate a dead link, sometimes they indicate a page redirect so their validity must be manually confirmed by inspection of the URL each flagged link redirects to. Dead Link Checker lists URLs that redirect alongside the URL they redirect to so the manual checking process can be very quick – if there is no redirect URL then it is a dead link, if there is a redirect URL then the redirection page must be checked to see what it is.
DLC might not be a cure all but it makes the job of deciding whether a link is dead or just redirected a damn site easier.
Many page redirects are only between the http:// and http://www. URL preference of a site’s webmaster. So be sure to check the differences between links and their redirects before you write them off as being dead.
Here’s an example of how to use Dead Link Checker, with example pages. All work is performed from within the same folder:
- create a text file (links.txt)
- paste the URLs to be checked into the file (links.txt) – it might look like this
- format the links so that the URLs have no html tags e.g if they are formatted like these change them to look like these. (Use the URL Extractor script in Scriptilitious to extract URLs from large files and strip them of any html formatting.)
- open a terminal in the folder that holds the text file (links.txt)
- type the command
deadlinkcheck -HTMLoutput -noCache -Verb links.txt > checked.html
to get Dead Link Checker to check the URLs in the file links.txt and print a report to checked.html – the report will look similar to this
- load DLC’s output file, the html report (checked.html), both in a web browser and in a text editor (I use Kate) – the file in the text editor will look similar to this
- check the links listed in your browser and manually remove the bad ones from the file loaded in the text editor. The response codes are:
- 2xx response codes indicate live links. Keep these
- 3xx response codes indicate URLs that redirect somewhere else. Check these
- 4xx response codes indicate pages that are not found. DLC relies on the server response code, not all servers give a correct response so check these links manually
- 5xx response codes indicate a pages that will likely never load. Check them manually
- once the bad links have been removed use the text editor to reformat the links in the HTML file (checked.html). Here is how I do that with Kate (Linux text editor)
- convert all characters to lower case (Tools>Lowercase)
- remove indentation (Tools>Clean Indentation)
- open find and replace (ctrl+r)
- find
- replace it with
|
(or some other character that is not reproduced anywhere within the file)
- Use Alt+A to replace all occurrences.
- find
- replace it with nothing (i.e just remove it)
- find
- replace it with nothing (i.e just remove it)
- find
- replace it with nothing (i.e just remove it)
- find
- replace it with
- Manually remove anything that isn’t between a tags (<a> and </a>)
- use a terminal (Konsole or Console) – open it from within the folder that holds the HTML file (checked.html) and enter this command
sed 's/.*|//g' checked.html > links.txt
to remove anything in a line that is written in front of a pipe “|” (inclusive of the pipe) and place all the links into the file links.txt
- The file links.txt will now contain the checked, active URLs from the original list formatted within <a> tags with anchor text made from the URL e.g
<a href="http://journalxtra.com">http://journalxtra.com</a>
It will look similar to this
- If required, alter the anchor text to remove the http:// and .com (or .net etc…) components.
<code>|-> </code>
<b>
</b>
</a><br>
</a>
You can use the URL Extractor and URL2Hyperlink scripts provided in Scriptilitious to quickly reformat the links in your checked file.
Intelligent use of Dead Link Checker can transform a task that would usually take a week or two to complete into one that takes less than 20 minutes.
DLC is a free perl script and might or might not work in Windows. Use a Linux Live Disk (Kubuntu or Linux Mint) if it doesn’t.
Downloads
These programs are free to use and download.
Dead Link checker’s user guide is here
Scriptilitious support page here
Want to republish this content? Read the copyright notice first.. If you like it, support it.










Pingback: How to Check For Dead Links « The Webforager's Scrapbook
Pingback: Comment faire pour vérifier les liens morts
Pingback: JournalXtra Events Recap: New in February
Pingback: Crafty Sitemap Building