Crafty Sitemap Building

The Sitemap is the Holy Grail of a website. It’s the sheet (or sheets) of xml that new webmasters don’t know to use and some experienced webmasters neglect to create. Consider that every website has a front, a back, a mouthpiece, a gang of security guards and a guide. Visitors see the front, the webmaster uses the backend to create the front, the RSS feed tells the world what’s happening at the website, robots.txt and other little bits help protect it, and the sitemap guides search engine spiders around the it.

A site in need of a sitemap

Usually, if you use a content management system (CMS) you will be blessed with automatic sitemap generation either through an inbuilt process or a plugin. In which case, you only need to locate it, submit it to search engines, link to it from your index page or the footer of every page, and regularly ping it to tell search engines about updates to it. You will usually find your sitemap sitting comfortably close to your robots.txt at the root of your domain e.g. your-domain.com/sitemap.xml

If you are not blessed with automatic sitemap generation and submission then you will need to create your own sitemap. Of course, that is what this article is all about and below here are the instructions your should follow to do that.

Most often, a sitemap needs to be manually created when a website is hand crafted in (x)html or when a sitemap is to be remotely hosted on a different domain or server to the website it maps (frequently the case when a sponsor provides a co-brand or white label site but not enough space or facility to host a sitemap).

There are programs and scripts that can be used to generate sitemaps. These scripts can be split into two categories: those that work and those that don’t work. Pedants might point out that a third category exists which includes those that only work when they feel like it or after a lot of flirtatious smooth-talking, as is often the case, but I’m not going to discuss those ones.

Those sitemap generators that do work can be divided into two subcategories:

  • Those that run from a desktop PC
  • Those that run from a web server

And they may be subdivided into paid and free sitemap makers. Guess which ones we’re going to work with?

Most of the free sitemap tools that work from a desktop PC are the same ones used to check for dead links. More often than not the “sitemaps” created by those programs need to be  manually edited into the xml sitemap format, for example, the URLs

https://journalxtra.com/downloads/
https://journalxtra.com/tools/

Would become:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
 xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- The site URLs go below here -->

<url>
 <loc>https://journalxtra.com/downloads/</loc>
 <changefreq>weekly</changefreq>
</url>
<url>
 <loc>https://journalxtra.com/tools/</loc>
 <changefreq>weekly</changefreq>
</url>

</urlset>

I’ve created a Bash script which automatically converts links into the xml sitemap format. The script is available for free with Scriptilitious. You can download it at the end of this article.

You can map your website with klinkstatus (or similar) and use Scriptilitious to convert the indexed URLs into an xml sitemap. There is a a rumor that klinkstatus will soon have a specific template for xml sitemap creation which is good news for webmasters who use Linux (like me).

Let’s Take a Look at the Free Online Sitemap Generators

There are many scripts that can be uploaded to a web server and configured to automatically rebuild a sitemap and submit it to various search engines. Unfortunately, and they are incredibly awkward to set up and configure; plus, for security reasons, many of them will only map a website that is on the domain where the script is being used. That restriction prohibits them from being used to create sitemaps for remote sites.

A better option is to use free online sitemap generators.

Online sitemap generators work, are not limited to one website, don’t care whether you own the site being mapped and they can be used frequently. There is one catch: most limit their free maps to either 500, 1000 or 5000 URLs and only map URLs that can be reached from the root (index) page of a website. The ones I use are no exception:

  1. xml-sitemaps.com will generate a well formatted xml sitemap of up to 500 URLs,
  2. sitemaps-builder.com generates a map with up to 1,000 URLs, and
  3. PC Time Limit builds sitemaps of upto 5,000 links.

Those three sitemap generators are more than enough for most sites but what if you have a co-brand, white label or hand-crafted website that updates daily and has hundreds of thousands of pages that must be indexed? How might all those lovely URLs be added to your sitemap?

Think about this:

A list of the most recent URLs is created when you generate a sitemap. When a new web page is created a new URL is created which must be added to that map. If you start out with 1,000 URLs and add 10 new URLs every day then over 20 days another 200 URLs must be mapped. If a sitemap generator maps only the first 1000 URLs it encounters from a website’s index page and there are 1200 URLs to index then 200 URLs will be missed out of the map. An incomplete map is bad news. An incomplete map could result in a site being poorly indexed by search engines.

Is there a way to coax the online generators to create a bigger sitemap?

Fortunately, sitemap generators do not check the size of a current sitemap and cannot determine whether a sitemap is made up from the contents of multiple sitemaps that have been generated by free sitmap generators. This failing can be turned to our advantage: we can use the same free tools to create daily or weekly sitemaps then combine their results to build one super sitemap. We can then force the generator to map different parts of a website by putting links to those parts on the website’s index page. For best results, one of those links should point to an artificial linklist that points to the sections of the site that need to be mapped; but, we must be careful not to duplicate data lines!

The Method

The method is easy and relies on a program called sed.

Sed is a text document manipulation program that is accessed via the Linux command line. Windows and Mac users without the use of sed can install Linux with VirtualBox, run a Linux LiveDisk or can install CygWin (Cygwin or CygwinX).

The instructions provided below assume you have already placed strategic links (or a linklist) on your site’s front page (its index) that point to the deeper parts of your website you wish to have mapped. The strategic links should be as close to the top of the index page as possible because computers read webpages top-to-bottom, left-to-right and the earlier your links are read the more readily they will be mapped.

You can make your life easier my using the automated sitemap-ripper utility that comes with Scriptilitious. Again, Scriptilitious can be downloaded at the bottom of this article. So, here’s how we create a sitemap using online generators and (or not) the free sitemap-ripper utility:

    1. Use one of the sitemap generation tools listed on page two.
    2. Upload the sitemap to your server. Placed it in the root directory e.g. your-domain.com/sitemap.xml
    3. Register the sitemap with the Google and Bing
    4. Place a link to the sitemap in the footer of your site’s index page for Yahoo’s benefit
    5. If possible, place a link to your sitemap in robots.txt by adding this line to it:
Sitemap: http://www.example.com/sitemap.xml
  1. Use My Page Rank to ping the major search engines with the details of your sitemap;
  2. To update the sitemap, use one of the sitemap generation tools but instead of overwriting the old sitemap with the newly created one, combine their contents. You can do this with sitemap-ripper (comes with Scriptilitous)
  3. Repeat step 7 then 6 every time a new sitemap is generated.

Ensure that your URLs use only one of the http:// or http://www formats. If you’re URLs are mixed then the pages could be indexed twice or thrice which could be rewarded with a search engine penalty and lower page rank due to different backlinks pointing to different pages (http:// is different to http://www.)

If the above sounds like too much of a chore, for as little as $19.99 you can get a sitemap generator that will create as many sitemaps as you need as big as you need them. It’s available from xml-sitemaps.com and provides automatic sitemap updates as well as saving you a lot of your valuable time.

Additional Notes

Sitemaps must contain fewer than 50,000 unique URLs or be less than 10mb in size and their URLs must have special characters converted to their equivalent escape codes.

The sitemap-ripper and sitemap-maker scripts in Scriptilitious automatically convert the URL escape sequences. Sitemap-splitter will help you split any sitemap that is oversized.

You can link multiple sitemaps together by using a sitemap index file. The format for a sitemap index is:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml</loc>
<lastmod>2010-05-28</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml</loc>
<lastmod>2010-12-28</lastmod>
</sitemap>
</sitemapindex>

The location of the sitemap files is placed between the <loc> and </loc> tags and the last modification time is placed between the <lastmod> and </lastmod> tags. The <lastmod></lastmod> (modification time) component is not essential and may be omitted.

The sitemap index file can be called anything you want provided it has a .xml file type and is the only sitemap file to  submit to search engines.

Scripilitious

This free interactive utility box comes with two sitemap creation scripts that automate step 7 to produce a sitemap with the name sitemap.xml. Scriptilitious is known to work with Linux it might work natively with Windows but most likely will require Cygwin or some other Linux terminal emulator.

Instructions

  1. Unzip the downloaded file,
  2. Place the two sitemaps that need to be combined into the WorkBox folder (give them both a different name),
  3. Open a terminal in Scriptilitious folder and type ./scriptilitious.sh
  4. Further usage instructions are provided as the script runs.

Sharing is caring!

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
2
0
Would love your thoughts, please comment.x
()
x