Extracting links from a webpage with Python

Enlaces en Página de Facebook Some time ago we presented a small program that helped us to publish in Twitter Publishing in Twitter when posting here.

Later I started having a look at the Facebook API and doing some tests. I discovered that Facebook does not allow to publish links with their anchor text. It transforms them in links that you can click on but such that they have the own link as text. I wanted to publish in Facebook the whole text (it will not show easily the whole entry, just a small part and a link to click in order to see more; and so on).

It has always called my attention the netiquette in some mailing lists where they add numbers near to the anchor text of links and they write at the end these numbers and the corresponding links. See, for example this Support page.

I decided to follow this path in order to publish in my Facebook pages. In the following I will try to explain some parts of the program for doing this. The code is available at rssToLinks.py (version in this moment, maybe they will be changes later).

There are several ways to extract links: regular expressions, some HTML parser (in our Blogómetro project we used this approach with the Simple SGML parser). Looking for alternatives I found Beautiful Soup, as a fast way to parse a web page and I decided to give it a try.

In order to use it we need some modules. We will publish in Facebook using the RSS feed, so we will also need to include the ‘feedparser’ module.

import feedparser
from bs4 import BeautifulSoup
from bs4 import NavigableString
from bs4 import Tag

Now we can read the RSS feed:

feed = feedparser.parse(url)

for i in range(len(feed.entries)):

And now the magic of BeautifulSoup can start:

soup = BeautifulSoup(feed.entries[i].summary)
links = soup("a")

That is we parse the RSS entry looking for links (“a” tag). We will have the entry in the ‘summary’ part and we are interested in the entry in position ‘i’. It will return the list of HTML elements with that tag.

In some entries we include images, but we do not want them to appear in the text. For this we use ‘isinstance’ in order to check if inside the text there is another HTML tag. We will check the list with the links together with a counter ‘j’ in order to associate the numbers and the links (in the original HTML, we have not modified it yet).

	j = 0
	linksTxt = ""
	for link in links:
		if not isinstance(link.contents[0], Tag):
			# We want to avoid embdeded tags (mainly <img ... )
			link.append(" ["+str(j)+"]")
			linksTxt = linksTxt + "["+str(j)+"] " + link.contents[0] + "\n"
			linksTxt = linksTxt + "    " + link['href'] + "\n"

The content of the link (now we know that it is not an image nor another HTML tag) will be available at `link.contents[0]` (of course, it could be more content but our links tend to be simple).

        linksTxt = linksTxt + "["+str(j)+"] " + link.contents[0] + "\n"

and the links is at `link[‘href’]`.

                linksTxt = linksTxt + "    " + link['href'] + "\n"

Now we need the text of the HTML.

        print soup.get_text()

Sometimes this text can have breaklines, spaces, … We could suppress them. We usually have very simple links, so we are not going to pay attention to this problem.

Now, we can add at the end the links:

        if linksTxt != "":
                print
                print "Links :"
                print linksTxt

Advertisements

Publishing in Twitter when posting here

I don’t think RSS is dead. But we can see how many people is using social networking sites to get their information. For this reason I was publishing the entries of my blogs using services like IFTTT and dlvr.it. They are easy to use and they work pretty well. Nevertheless, one is always wondering if we could prepare our own programs to manage these publications and learn something new on the way.

I started with Facebook publishing but I’m presenting here a program for Twitterpublishing: we only need to publish the title and the link (and, maybe, some introductory text).

I found the project twitter as an starting point. It has implemented an important part of the work. We can install it using pip:

fernand0@here:~$ sudo pip install twitter

It needs `BeautifulSoup` and maybe some other modules. If they are not available in our system we will get the adequate ‘complaints’.

Now we can execute it.
This step is useful in order to do the authentication steps in Twitter and getting the oauth token. Our program will not deal with this part and it will be smaller and more simple.
Not so long ago it was possible to send tweets with just the username and passowrd but Twitter decided to start using a more sofisticated systema based in OAuth.

fernand0@here:~$ twitter

The program launches a browser for authentication and then giving our app the adequate permissions. This generates the tokens and other information needed to interact with Twitter. They will be stored at `~/.twitter_oauth` (in a Unix-like system, I’d be happy to know about other systems) that we will reuse in our own application.

The program is quite simple, it can be downloaded from rssToTwitter.py V.2014-12-07) (link to the commented version, the program has been updated to correct bugs and add features).

We will start reading the configuration:

config = ConfigParser.ConfigParser()

config.read([os.path.expanduser('~/.rssBlogs')])
rssFeed = config.get("Blog1", "rssFeed")
twitterAc = config.get("Blog1", "twitterAc")

This configuration file must contain a section for each blog (this program uses only the configuration for the first one). Each section will contain the RSS feed, the name of the Twitter account and the name of the Facebook account (it can be empty if it won’t be used. For example, for this blog it would be:

[Blog1]
rssFeed:https://makingfernand0.wordpress.com/feed/
twitterAc:MakingFernand0
pageFB:

It also needs the Twitter configuration:

config.read([os.path.expanduser('~/.rssTwitter')])
CONSUMER_KEY = config.get("appKeys", "CONSUMER_KEY")
CONSUMER_SECRET = config.get("appKeys", "CONSUMER_SECRET")
TOKEN_KEY = config.get(twitterAc, "TOKEN_KEY")
TOKEN_SECRET = config.get(twitterAc, "TOKEN_SECRET")

We can use the ones that have been generated before (we can copy them from the app); in my system it is at: `/usr/local/lib/python2.7/dist-packages/twitter/cmdline.py` and the tokens are stored at `~/.twitter_oauth`

The configuration file is as follows:

[appKeys]
CONSUMER_KEY:xxxxxxxxxxxxxxxxxxxxxx
CONSUMER_SECRET:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[makingfernand0]
TOKEN_KEY:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TOKEN_SECRET:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Notice that you can configure as many Twitter accounts as needed. The name of the second section is the same as the one used in the previous configuration file.

Now we can read the RSS feed in order to extract the required data:

feed = feedparser.parse(rssFeed)

i = 0 # It will publish the last added item

soup = BeautifulSoup(feed.entries[i].title)
theTitle = soup.get_text()
theLink = feed.entries[i].link


For this, we will use `feedparser` in order to download the RSS feed and process it.

We are chosing the first entry (position 0), that will be the last one published. For Twitter we just need the title and the link.
We use BeautifulSoup for processing the title, in order to avoid the tags (para evitar las posibles etiquetas que pueda contener (CSS, HTLL entities, ...) 

And finally, we will build the tweet:


statusTxt = "Publicado: "+theTitle+" "+theLink

We can now proceed to the steps of identification, authentication and publishing:

t = Twitter(
	auth=OAuth(TOKEN_KEY, TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET))

t.statuses.update(status=statusTxt)

This entry was originally published (in Spanish) at: Publicar en Twitter las entradas de este sitio.