Clean URLs in Pelican Sitemap using Python

A common problem that I faced using Jamstack static site generators is having clean URLs.

What is Clean URL?

A "Clean URL" is basically page address that doesn't have any extension such as .html, .php or trailing slashes / at the end of the URL.

You run into challenges ranging from:

  1. Lack of full support for clean URLs in the Static Site Generator (SSG). E.g. typically Clean URLs are configured by using a property of webservers and browsers where by default any webserver will return index.html inside the folder if you access the folder directly.
  2. If you configure clean URLs by using the technique above, the URLs withh have trailing / slashes and "pages" will continue to have .html.
  3. If you use URL rewrite rules on the hosting webserver (e.g. Firebae provides a very easy to configure method to generate clean URLs), your SSG generated sitemap will be incorrect and contain .html

So, you basically end up between a rock and a hard place.

Why are Clean URLs important?

There are two related aspects.

  1. Clean URLs look better. Take for example the URL of this page https://uberpython.com/articles/clean-urls-in-sitemap-using-python. You can understand from the URL that this is an article and the topic. With a .html at the end, nothing will change but it just looks ugly.
  2. Clean URLs are better for SEO: Widely known fact is because Clean URLs look better and increases the accessibility of a page, thus also increases the SEO score of a page

How to create a Clean URLs?

Clean URLs in Firebase Hosting

The best method that I found is to use URL rewrite rules to generate Clean URLs, e.g. in if you're using Firebae Hositng, change the firebase.json to the below

{
  "hosting": {
    "public": "output",
    "ignore": [
      "firebase.json",
      "**/.*",
      "**/node_modules/**"
    ],
    "cleanUrls": true,
    "trailingSlash": false
  }
}

That's all it takes, Firebase takes care of all the internal configuration in the webserver.

Solving the Sitemap '.html' Problem

After you have configured clean URLs in your webserver, if you don't correct your Sitemap, Google & other search engines will struggle to crawl and index your website correctly and that can have negative impact on your search engine rankings.

To solve this, we will do the following

  1. Let the SSG generate the sitemap that includes .html extension
  2. Run a small Python snippet to replace that

Create a file named fix_sitemap.py witht he below contents and place it in your Pelican root directly (the folder where you have publishconf.py)

# fix_sitemap.py
def fix_sitemap():
    try:
        with open("output/sitemap.xml", "r") as file:
            original = file.read()
            corrected = original.replace(".html", "")
    except Exception as e:
        print(f"Opening sitemap failed with error: {e}")

    try:
        with open("output/sitemap.xml", "w") as file:
            file.write(corrected)
    except Exception as e:
        print(f"Saving sitemap failed with error: {e}")

A rather simple solution where we simplly searched for all .html and removed it from the sitemap. No XML parsing, no tree navigation.

Now go your Pelican's configuration in publishconf.py and add these lines right at the bottom

from fix_sitemap import fix_sitemap
fix_sitemap()

This will ensure everytime you publish your website, the sitemap is updated correctly.

Need Help? Open a discussion thread on GitHub.