Update your RSS with Python
Date: 2020-06-11
In this post I will show you a simple Python script which goal is update an existent RSS Atom file with the content of a new post.
This solves a problem of mine: I couldn’t find a simple, stupid script to do this basic thing. You have to consider that many websites use platforms like Hugo or Write.as, which have their own plugins to auto-update RSS feeds.
How a RSS file should look
This may sound trivial, but it’s not. To have an idea about how RSS files are done, I just looked at the feeds I follow using Miniflux. Here is the funny thing: I could not find two files structured in the same way. Sure, they were all RSS files, but with so many differences!
There exist a lot of variants of RSS, with the version 2.0 which looks to be prevailing. However, also Atom exists.
Let’s look at the feeds I found. One was like this:
<rss version="2.0">
<channel>
<feedpress:locale>en</feedpress:locale>
<atom:link rel="hub" href="http://feedpress.superfeedr.com/"/>
<title>...</title>
<description>
</description>
<link>https://example.com/</link>
<atom:link href="https://example.com" rel="self" type="application/rss+xml"/>
<pubDate>Fri, 22 Jul 2018 05:26:43 -0400</pubDate>
<lastBuildDate>Fri, 22 Jul 2018 05:26:43 -0400</lastBuildDate>
<item>
<title>...</title>
<pubDate>Thu, 01 Jun 2017 00:00:00 -0400</pubDate>
<description>
...
</description>
<link>http://tracking.feedpress.it/link/.....</link>
<guid isPermaLink="false">
https://example.com/post
</guid>
</item>
</channel>
</rss>
This looks like a standard RSS; however, we have more entries related to Feedpress, a paid service to manage your RSS feeds.
Okay, I don’t need this, I can manage my RSS by myself, thank you. Let’s see the next:
<rss version="2.0">
<channel>
<title>...</title>
<link>https://example.com</link>
<description>...</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Sun, 21 May 2020 10:26:31 -0600</lastBuildDate>
<atom:link href="https://example.com/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>...</title>
<link>https://example.com/blog/post</link>
<pubDate>Sat, 15 May 2020 03:02:34 -0700</pubDate>
<guid>https://example.com/blog/post</guid>
<description>
...
</description>
</item>
</channel>
</rss>
This one sounds more standard. However, it has been generated with Hugo, as we can see from the extra generator
field.
Lastly, let’s give a look at one Atom:
<feed>
<title>...</title>
<link href="http://example.com/atom.xml" rel="self"/>
<link href="http://example.com"/>
<updated>2010-01-08T02:41:52+00:00</updated>
<id>example.com</id>
<author>
<name>Name Surname</name>
</author>
<generator uri="http://gohugo.io/">Hugo</generator>
<entry>
<title type="html">Title</title>
<link href="https://example.com/post"/>
<updated>2010-01-08T02:41:52+00:00</updated>
<id>
https://example.com
</id>
<content type="html">
...
</content>
</entry>
</feed>
The first differences I can see are:
- The structure is big container
feed
and every post is anentry
; - We have some different fields like
author
,id
; - The date is in another format.
The first two differences were not so problemat; but the last was, because dealing with date is always a pain in programming.
Let’s get it clean
So, I looked at the structure of RSS as defined by the W3C. The basic structure should be something like:
<rss version="2.0">
<channel>
<title>AlexMV12's blog</title>
<link>https://alexmv12.xyz/</link>
<description> New posts on AlexMV12's blog. </description>
<atom:link href="https://alexmv12.xyz/atom.xml" rel="self" type="application/rss+xml"/>
<item>
<title>My new website</title>
<link> https://alexmv12.xyz/blog/newblog </link>
<pubDate> Tue, 02 Jun 2020 17:00:34 +0200 </pubDate>
<guid> https://alexmv12.xyz/blog/newblog </guid>
<description>
...
</description>
</item>
</channel>
</rss>
This is valid RSS, as suggested by the W3C Feed Validation Service.
You can see, however, the presence of a atom:link
field. This was suggested by the validator itself.
A simple Python script
import argparse
from bs4 import BeautifulSoup
from email.utils import formatdate
def main(starting_rss_path, html_path, new_link, output_path):
# First, we open the existent RSS file.
handler = open(starting_rss_path).read()
soup = BeautifulSoup(handler, features="lxml-xml")
# ... and we get the channel section:
channel = soup.channel
# Prepare the new post's fields. For now use new_tag on the soup
# objects, we will put this in the right position later.
new_post = soup.new_tag('item')
title = soup.new_tag('title')
link = soup.new_tag('link')
pubDate = soup.new_tag('pubDate')
guid = soup.new_tag('guid')
description = soup.new_tag('description')
# Populate the new post by reading the correct data
# from the HTML page.
handler = open(html_path).read()
parsed_html = BeautifulSoup(handler, features="lxml")
# content is the div with id "content", where the actual post is written.
# You should replace this with the appropriate selector.
content = parsed_html.body.find('div', attrs={'id': 'content'})
# we use the first h1 as title for the post
title.string = content.find('h1').text
# use the link passed as parameter
link.string = guid.string = new_link
pubDate.string = formatdate(localtime=True)
# this is the real content of the post. Join all the rows.
description.string = ''.join(map(str, content.contents))
# put the elements in order.
new_post.append(title)
new_post.append(link)
new_post.append(pubDate)
new_post.append(guid)
new_post.append(description)
channel.append(new_post)
# pretty print the XML
prettified_xml = soup.prettify()
with open(output_path, 'w') as file:
file.write(prettified_xml)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Simple RSS builder.')
parser.add_argument("-s", "--starting_rss_file",
help="Path of the starting RSS file; it has to be a valid RSS xml file.")
parser.add_argument("-i", "--input",
help="HTML file to extract content from")
parser.add_argument("-l", "--link",
help="Link that the post will have once published")
parser.add_argument("-o", "--output",
help="Path of the new RSS file")
args = parser.parse_args()
main(args.starting_rss_file, args.input, args.link, args.output)
This is what I came up with.
We can launch it using:
python rss_creator.py -s "atom.xml" -i "new_post.html" -l "https://example.com/blog/post1" -o "new_rss.xml"
where:
- “atom.xml” is the path of an already existent RSS feed; it should have at least
channel
; - “new_post.html” is the path of the HTML file which contains our new post, and from which we will extract the content;
- ”https://example.com/blog/post1” is the link that our post will have; I haven’t found a way to set it automatically;
- “new_rss.xml” is the path of our new RSS file.
The only library used is BeautifulSoup. I hope the comments in the code clarify it enough.