Extracting links from a webpage with Python

Enlaces en Página de Facebook Some time ago we presented a small program that helped us to publish in Twitter Publishing in Twitter when posting here.

Later I started having a look at the Facebook API and doing some tests. I discovered that Facebook does not allow to publish links with their anchor text. It transforms them in links that you can click on but such that they have the own link as text. I wanted to publish in Facebook the whole text (it will not show easily the whole entry, just a small part and a link to click in order to see more; and so on).

It has always called my attention the netiquette in some mailing lists where they add numbers near to the anchor text of links and they write at the end these numbers and the corresponding links. See, for example this Support page.

I decided to follow this path in order to publish in my Facebook pages. In the following I will try to explain some parts of the program for doing this. The code is available at rssToLinks.py (version in this moment, maybe they will be changes later).

There are several ways to extract links: regular expressions, some HTML parser (in our Blogómetro project we used this approach with the Simple SGML parser). Looking for alternatives I found Beautiful Soup, as a fast way to parse a web page and I decided to give it a try.

In order to use it we need some modules. We will publish in Facebook using the RSS feed, so we will also need to include the ‘feedparser’ module.

import feedparser
from bs4 import BeautifulSoup
from bs4 import NavigableString
from bs4 import Tag

Now we can read the RSS feed:

feed = feedparser.parse(url)

for i in range(len(feed.entries)):

And now the magic of BeautifulSoup can start:

soup = BeautifulSoup(feed.entries[i].summary)
links = soup("a")

That is we parse the RSS entry looking for links (“a” tag). We will have the entry in the ‘summary’ part and we are interested in the entry in position ‘i’. It will return the list of HTML elements with that tag.

In some entries we include images, but we do not want them to appear in the text. For this we use ‘isinstance’ in order to check if inside the text there is another HTML tag. We will check the list with the links together with a counter ‘j’ in order to associate the numbers and the links (in the original HTML, we have not modified it yet).

	j = 0
	linksTxt = ""
	for link in links:
		if not isinstance(link.contents[0], Tag):
			# We want to avoid embdeded tags (mainly <img ... )
			link.append(" ["+str(j)+"]")
			linksTxt = linksTxt + "["+str(j)+"] " + link.contents[0] + "\n"
			linksTxt = linksTxt + "    " + link['href'] + "\n"

The content of the link (now we know that it is not an image nor another HTML tag) will be available at `link.contents[0]` (of course, it could be more content but our links tend to be simple).

        linksTxt = linksTxt + "["+str(j)+"] " + link.contents[0] + "\n"

and the links is at `link[‘href’]`.

                linksTxt = linksTxt + "    " + link['href'] + "\n"

Now we need the text of the HTML.

        print soup.get_text()

Sometimes this text can have breaklines, spaces, … We could suppress them. We usually have very simple links, so we are not going to pay attention to this problem.

Now, we can add at the end the links:

        if linksTxt != "":
                print
                print "Links :"
                print linksTxt