I’m going to post this tutorial in two parts. Part 1 will be an introduction to Beautiful Soup and an example of using this library to extract particular data from an html file and saving it as a csv file. Part 2 will be a bit more complicated case of creating a csv file from more generic table data.
What is Beautiful Soup?
“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.” (Opening lines of Beautiful Soup)
Beautiful Soup is a python library for getting data out of html, xlm, and other markup. It provides a way to extract particular content in a webpage by location in the html tags (think X-Paths and navigating the DOM), by CSS identifier, or by html id or class identifiers (or some combination thereof).
So say there is a website that contains data that is relevant to your research, such as date or address information, or link information for other sources of data that you also want to scrape. Beautiful Soup offers easy ways to pull that particular content from the webpage, remove it from its html wrappings, and put it into a new context which will allow you to do whatever next operation you desire to do.
I highly recommend looking at the Beautiful Soup documentation pages to get a sense of variety of things you can do with simple Beautiful Soup commands, from isolating titles and links to extracting all of the text from the html tags to altering the html within the document you’re working with.
Installing Beautiful Soup
Installing Beautiful Soup is easiest if you already have pip or another python installer already in place. If you don’t have pip, start with Fred’s tutorial on installing python modules. Once you have pip installed, run the following command to install Beautiful Soup:
pip install beautifulsoup4
You may need to include “sudo” in your command. Sudo gives your computer permission to write to your root directories and requires you to re-enter your password. This is the same logic behind your being prompted to enter your password when you install a new program.
With sudo, the command is:
sudo pip install beautifulsoup4
Using Beautiful Soup in a Python Script
There are two basic steps to using Beautiful Soup in your python script. First is to import the library at the beginning of your script by writing:
from bs4 import BeautifulSoup
Second, you have to pass the document or url to Beautiful Soup to make the “soup.” For this example we will be using a locally saved file and will create the soup this way:
soup = BeautifulSoup(open("example.txt"))
This creates a large soup object out of the content of our “example.txt” file and we can then run the Beautiful Soup methods on that object.
Application: Extracting names and URLs from an HTML page
Preview: Where we are going
Because I like to see where the finish line is before starting, I will begin with a view of what we are trying to create. We are attempting to go from a search results page where the html page looks like this:Powered by Wordpress Plugins - Get the full version!
but follow along to understand how Beautiful Soup gets us to that point.
Get a file to scrape
The first step is getting the files for scraping. This can be done in a variety of ways. Usually, I would recommend scraping using wget or cURL (see my slides on an introduction to webscraping). To do this, use wget or cURL in Terminal and point it at the particular webpage or folder that you want to download.
However, the Congressional database is a bit more complicated because the URL for particular search results is hidden. While this can be bypassed programmatically, it is easier for our purposes to go to http://bioguide.congress.gov/biosearch/biosearch.asp, search for the Congress 43, and to save a copy of the webpage of results. Selecting “File” and “Save Page As …” from your browser window will accomplish this. For a filename, avoid spaces – I am using “43rd-congress.html”. Move the file into the folder you want to work in and let’s proceed.
One of the first things Beautiful Soup can help us with is getting a sense of how the different HTML tags are nested within each other. This can be very useful when you need to isolate content that is buried within the HTML structure as Beautiful Soup allows you to select content based upon tag within tag within tag (example: soup.body.p.b finds the first bold item inside a paragraph tag inside the body tag in the document).
To get a good view of how the tags are nested in the document, we use the method “prettify” on our soup object.
Create a new file called soupexample.py. This file will contain your Python script that we will be developing over the course of the tutorial.
In this file we need to import the Beautiful Soup library, open the file and pass it to Beautiful Soup, and then print the pretty version in the terminal.Powered by Wordpress Plugins - Get the full version!
Save this file in the folder with your text file and go to the command line. Navigate (use ‘cd’) to the folder you’re working in and execute the following:
You should see your terminal window fill up with a nicely indented version of the original html text. This is a clean picture of how the various tags relate to one another.
Using BeautifulSoup to select particular content
So, we are interested in the links and names of the various member of the 43rd Congress. Looking at the ”pretty” version of the file, the first thing to notice is that this is a relatively flat file – our tags are not too deeply embedded within each other.
While this makes some of the identifying more difficult, we are interested in the names and urls and all of these are, most fortunately, embedded in “<a>” tags. So, we need to isolate out all of the “<a>” tags. We can do this by updating the code in “soupexample.py” to the following:Powered by Wordpress Plugins - Get the full version!
Save and run the script again to see all of the anchor tags in the document.
One thing to notice is that there is an additional link in our file – the link for an additional search. We can get rid of this with just a line or two of additional code. Going back to the pretty version, notice that this last “<a>” tag is not within the table but is within a “<p>” tag.
Because Beautiful Soup allows us to modify the data, we can remove the “<a>” that is under the “<p>” before searching for all the “<a>” tags.
To do this, we can use the “decompose” method, though be careful with this method lest you remove more information than you intend. Update the file as below and run again.Powered by Wordpress Plugins - Get the full version!
And success! We have isolate out all of the links we want and none of the links we don’t!
Writing content to CSV file
But displaying these things in the Terminal is less than useful. And, the html tags are still surrounding all of our data. Let’s strip away the tags and save the data into a file.
To save the data to a file rather than print it in the Terminal, we need to add a couple lines of code. First, while printing directly from each “p” in person worked in the terminal, because we want to add a return after each entry, we need to convert the object to a string. To do so, we use “str(p)”.
Because we are inside the loop when we are writing the data to the file, we need to tell the program to append the data to the file rather than just write. If we used “w” instead of “a” we would be rewriting the file on each pass. Not very effective.
Modify the soupexample.py file as follows and run again.Powered by Wordpress Plugins - Get the full version!
You should now have a file entitled “43rd-congress.txt” in your folder. If anything went wrong and you need to re-run this code, delete this .txt file first. Otherwise, the new data will just append to the end of whatever you created on the first pass. (It is a very obedient script.)
If you open “43rd-congress.txt” you will notice that your html tags are still in place. So, for our last step, we will remove the html tags and create a new .csv file.
In order to clean up the html tags and split the URLs from the names, we need to perform some operation on each line in the list of links. We also need to create a place to store all of the data we retrieve from those lines.
To do this, first create an empty list:
clean_list = 
Next, create a loop to operate on each line in the list of links
for link in links:
Now from each line in “links” we want both the name of the person and the URL for their bioguide entry. To pull out this data we will create a variable for names and a variable for the URL. To isolate the URL, we tell Python to select the content that is associated with the key “href”, and to select the name, we tell Python to get the contents of the tag (the information in-between “<a>” and “</a>”.)
single_link = link['href'] name = link.contents
To save the information outside of the loop, we must add it to the list we created. While there are many ways to do this, I’ve chosen to create a new variable “entry” that is composed of two strings, the first of which is a “name” and the second is a link. Each “entry” is then added into the list. Notice that the separating comma is included in the string we are creating.
entry = "%s, %s" % (name, single_link) clean_list.append(entry)
Finally, to print from the newly created list, we will use the .join method (think implode in php) to create a string from all of the items in the “clean_list”, separating each line with a return (“\n”).
This leaves us with a script that looks like this:Powered by Wordpress Plugins - Get the full version!
which when executed gives us a clean CSV file that we can then use for other purposes.
And that is an example of how to use Beautiful Soup to save particular content out of an HTML document.