Update your RSS with Python

11-06-2020

In this post I will show you a simple Python script which goal is update an existent RSS Atom file with the content of a new post.

This solves a problem of mine: I couldn't find a simple, stupid script to do this basic thing. You have to consider that many websites use platforms like Hugo or Write.as, which have their own plugins to auto-update RSS feeds.

How a RSS file should look

This may sound trivial, but it's not. To have an idea about how RSS files are done, I just looked at the feeds I follow using Miniflux. Here is the funny thing: I could not find two files structured in the same way. Sure, they were all RSS files, but with so many differences!

There exist a lot of variants of RSS, with the version 2.0 which looks to be prevailing. However, also Atom exists.

Let's look at the feeds I found. One was like this:

<rss version="2.0">
  <channel>
    <feedpress:locale>en</feedpress:locale>
    <atom:link rel="hub" href="http://feedpress.superfeedr.com/"/>
    <title>...</title>
    <description>
    </description>
    <link>https://example.com/</link>
    <atom:link href="https://example.com" rel="self" type="application/rss+xml"/>
    <pubDate>Fri, 22 Jul 2018 05:26:43 -0400</pubDate>
    <lastBuildDate>Fri, 22 Jul 2018 05:26:43 -0400</lastBuildDate>
    <item>
      <title>...</title>
      <pubDate>Thu, 01 Jun 2017 00:00:00 -0400</pubDate>
      <description>
	...
      </description>
      <link>http://tracking.feedpress.it/link/.....</link>
      <guid isPermaLink="false">
	https://example.com/post
      </guid>
    </item>
  </channel>
</rss>

This looks like a standard RSS; however, we have more entries related to Feedpress, a paid service to manage your RSS feeds.

Okay, I don't need this, I can manage my RSS by myself, thank you. Let's see the next:

<rss version="2.0">
  <channel>
    <title>...</title>
    <link>https://example.com</link>
    <description>...</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 21 May 2020 10:26:31 -0600</lastBuildDate>
    <atom:link href="https://example.com/index.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>...</title>
      <link>https://example.com/blog/post</link>
      <pubDate>Sat, 15 May 2020 03:02:34 -0700</pubDate>
      <guid>https://example.com/blog/post</guid>
      <description>
	...
      </description>
    </item>
  </channel>
</rss>

This one sounds more standard. However, it has been generated with Hugo, as we can see from the extra generator field. Lastly, let's give a look at one Atom:

<feed>
  <title>...</title>
  <link href="http://example.com/atom.xml" rel="self"/>
  <link href="http://example.com"/>
  <updated>2010-01-08T02:41:52+00:00</updated>
  <id>example.com</id>
  <author>
    <name>Name Surname</name>
  </author>
  <generator uri="http://gohugo.io/">Hugo</generator>
  <entry>
    <title type="html">Title</title>
    <link href="https://example.com/post"/>
    <updated>2010-01-08T02:41:52+00:00</updated>
    <id>
      https://example.com
    </id>
    <content type="html">
      ...
    </content>
  </entry>
</feed>

The first differences I can see are:

  1. The structure is big container feed and every post is an entry;
  2. We have some different fields like author, id;
  3. The date is in another format.

The first two differences were not so problemat; but the last was, because dealing with date is always a pain in programming.

Let's get it clean

So, I looked at the structure of RSS as defined by the W3C. The basic structure should be something like:

<rss version="2.0">
  <channel>
    <title>AlexMV12's blog</title>
    <link>https://alexmv12.xyz/</link>
    <description> New posts on AlexMV12's blog. </description>
    <atom:link href="https://alexmv12.xyz/atom.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>My new website</title>
      <link> https://alexmv12.xyz/blog/newblog </link>
      <pubDate> Tue, 02 Jun 2020 17:00:34 +0200 </pubDate>
      <guid> https://alexmv12.xyz/blog/newblog </guid>
      <description>
	...
      </description>
    </item>
  </channel>
</rss>

This is valid RSS, as suggested by the W3C Feed Validation Service. You can see, however, the presence of a atom:link field. This was suggested by the validator itself.

A simple Python script

import argparse
from bs4 import BeautifulSoup
from email.utils import formatdate


def main(starting_rss_path, html_path, new_link, output_path):
    # First, we open the existent RSS file.
    handler = open(starting_rss_path).read()
    soup = BeautifulSoup(handler, features="lxml-xml")
    # ... and we get the channel section:
    channel = soup.channel

    # Prepare the new post's fields. For now use new_tag on the soup
    # objects, we will put this in the right position later.
    new_post = soup.new_tag('item')
    title = soup.new_tag('title')
    link = soup.new_tag('link')
    pubDate = soup.new_tag('pubDate')
    guid = soup.new_tag('guid')
    description = soup.new_tag('description')

    # Populate the new post by reading the correct data
    # from the HTML page.
    handler = open(html_path).read()
    parsed_html = BeautifulSoup(handler, features="lxml")

    # content is the div with id "content", where the actual post is written.
    # You should replace this with the appropriate selector.
    content = parsed_html.body.find('div', attrs={'id': 'content'})

    # we use the first h1 as title for the post
    title.string = content.find('h1').text

    # use the link passed as parameter
    link.string = guid.string = new_link

    pubDate.string = formatdate(localtime=True)

    # this is the real content of the post. Join all the rows.
    description.string = ''.join(map(str, content.contents))

    # put the elements in order.
    new_post.append(title)
    new_post.append(link)
    new_post.append(pubDate)
    new_post.append(guid)
    new_post.append(description)

    channel.append(new_post)

    # pretty print the XML
    prettified_xml = soup.prettify()

    with open(output_path, 'w') as file:
        file.write(prettified_xml)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Simple RSS builder.')
    parser.add_argument("-s", "--starting_rss_file",
                        help="Path of the starting RSS file; it has to be a valid RSS xml file.")
    parser.add_argument("-i", "--input",
                        help="HTML file to extract content from")
    parser.add_argument("-l", "--link",
                        help="Link that the post will have once published")
    parser.add_argument("-o", "--output",
                        help="Path of the new RSS file")
    args = parser.parse_args()
    main(args.starting_rss_file, args.input, args.link, args.output)

This is what I came up with.

We can launch it using:

python rss_creator.py -s "atom.xml" -i "new_post.html" -l "https://example.com/blog/post1" -o "new_rss.xml"

where:

  1. "atom.xml" is the path of an already existent RSS feed; it should have at least channel;
  2. "new_post.html" is the path of the HTML file which contains our new post, and from which we will extract the content;
  3. "https://example.com/blog/post1" is the link that our post will have; I haven't found a way to set it automatically;
  4. "new_rss.xml" is the path of our new RSS file.

The only library used is BeautifulSoup. I hope the comments in the code clarify it enough.