I’m going to post this tutorial in two parts. Part 1 will be an introduction to Beautiful Soup and an example of using this library to extract particular data from an html file and saving it as a csv file. Part 2 will be a bit more complicated case of creating a csv file from more generic table data.
What is Beautiful Soup?
Overview
“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.” (Opening lines of Beautiful Soup)
Beautiful Soup is a python library for getting data out of html, xlm, and other markup. It provides a way to extract particular content in a webpage by location in the html tags (think X-Paths and navigating the DOM), by CSS identifier, or by html id or class identifiers (or some combination thereof).
So say there is a website that contains data that is relevant to your research, such as date or address information, or link information for other sources of data that you also want to scrape. Beautiful Soup offers easy ways to pull that particular content from the webpage, remove it from its html wrappings, and put it into a new context which will allow you to do whatever next operation you desire to do.
I highly recommend looking at the Beautiful Soup documentation pages to get a sense of variety of things you can do with simple Beautiful Soup commands, from isolating titles and links to extracting all of the text from the html tags to altering the html within the document you’re working with.
Installing Beautiful Soup
Installing Beautiful Soup is easiest if you already have pip or another python installer already in place. If you don’t have pip, start with Fred’s tutorial on installing python modules. Once you have pip installed, run the following command to install Beautiful Soup:
pip install beautifulsoup4
You may need to include “sudo” in your command. Sudo gives your computer permission to write to your root directories and requires you to re-enter your password. This is the same logic behind your being prompted to enter your password when you install a new program.
With sudo, the command is:
sudo pip install beautifulsoup4

Using Beautiful Soup in a Python Script
There are two basic steps to using Beautiful Soup in your python script. First is to import the library at the beginning of your script by writing:
from bs4 import BeautifulSoup
Second, you have to pass the document or url to Beautiful Soup to make the “soup.” For this example we will be using a locally saved file and will create the soup this way:
soup = BeautifulSoup(open("example.txt"))
This creates a large soup object out of the content of our “example.txt” file and we can then run the Beautiful Soup methods on that object.
Application: Extracting names and URLs from an HTML page
Preview: Where we are going
Because I like to see where the finish line is before starting, I will begin with a view of what we are trying to create. We are attempting to go from a search results page where the html page looks like this:
Powered by Wordpress Plugins - Get the full version!to a csv file with names and urls that looks like this:
Powered by Wordpress Plugins - Get the full version!
The finished code is:
Powered by Wordpress Plugins - Get the full version!
but follow along to understand how Beautiful Soup gets us to that point.
Get a file to scrape
The first step is getting the files for scraping. This can be done in a variety of ways. Usually, I would recommend scraping using wget or cURL (see my slides on an introduction to webscraping). To do this, use wget or cURL in Terminal and point it at the particular webpage or folder that you want to download.
However, the Congressional database is a bit more complicated because the URL for particular search results is hidden. While this can be bypassed programmatically, it is easier for our purposes to go to http://bioguide.congress.gov/biosearch/biosearch.asp, search for the Congress 43, and to save a copy of the webpage of results. Selecting “File” and “Save Page As …” from your browser window will accomplish this. For a filename, avoid spaces – I am using “43rd-congress.html”. Move the file into the folder you want to work in and let’s proceed.
Identify content
One of the first things Beautiful Soup can help us with is getting a sense of how the different HTML tags are nested within each other. This can be very useful when you need to isolate content that is buried within the HTML structure as Beautiful Soup allows you to select content based upon tag within tag within tag (example: soup.body.p.b finds the first bold item inside a paragraph tag inside the body tag in the document).
To get a good view of how the tags are nested in the document, we use the method “prettify” on our soup object.
Create a new file called soupexample.py. This file will contain your Python script that we will be developing over the course of the tutorial.
In this file we need to import the Beautiful Soup library, open the file and pass it to Beautiful Soup, and then print the pretty version in the terminal.
Powered by Wordpress Plugins - Get the full version!Save this file in the folder with your text file and go to the command line. Navigate (use ‘cd’) to the folder you’re working in and execute the following:
python soupexample.py
You should see your terminal window fill up with a nicely indented version of the original html text. This is a clean picture of how the various tags relate to one another.
Using BeautifulSoup to select particular content
So, we are interested in the links and names of the various member of the 43rd Congress. Looking at the ”pretty” version of the file, the first thing to notice is that this is a relatively flat file – our tags are not too deeply embedded within each other.
While this makes some of the identifying more difficult, we are interested in the names and urls and all of these are, most fortunately, embedded in “<a>” tags. So, we need to isolate out all of the “<a>” tags. We can do this by updating the code in “soupexample.py” to the following:
Powered by Wordpress Plugins - Get the full version!Save and run the script again to see all of the anchor tags in the document.
python soupexample.py
One thing to notice is that there is an additional link in our file – the link for an additional search. We can get rid of this with just a line or two of additional code. Going back to the pretty version, notice that this last “<a>” tag is not within the table but is within a “<p>” tag.
Because Beautiful Soup allows us to modify the data, we can remove the “<a>” that is under the “<p>” before searching for all the “<a>” tags.
To do this, we can use the “decompose” method, though be careful with this method lest you remove more information than you intend. Update the file as below and run again.
Powered by Wordpress Plugins - Get the full version!And success! We have isolate out all of the links we want and none of the links we don’t!
Writing content to CSV file
But displaying these things in the Terminal is less than useful. And, the html tags are still surrounding all of our data. Let’s strip away the tags and save the data into a file.
To save the data to a file rather than print it in the Terminal, we need to add a couple lines of code. First, while printing directly from each “p” in person worked in the terminal, because we want to add a return after each entry, we need to convert the object to a string. To do so, we use “str(p)”.
Because we are inside the loop when we are writing the data to the file, we need to tell the program to append the data to the file rather than just write. If we used “w” instead of “a” we would be rewriting the file on each pass. Not very effective.
Modify the soupexample.py file as follows and run again.
Powered by Wordpress Plugins - Get the full version!You should now have a file entitled “43rd-congress.txt” in your folder. If anything went wrong and you need to re-run this code, delete this .txt file first. Otherwise, the new data will just append to the end of whatever you created on the first pass. (It is a very obedient script.)
If you open “43rd-congress.txt” you will notice that your html tags are still in place. So, for our last step, we will remove the html tags and create a new .csv file.
In order to clean up the html tags and split the URLs from the names, we need to perform some operation on each line in the list of links. We also need to create a place to store all of the data we retrieve from those lines.
To do this, first create an empty list:
clean_list = []
Next, create a loop to operate on each line in the list of links
for link in links:
Now from each line in “links” we want both the name of the person and the URL for their bioguide entry. To pull out this data we will create a variable for names and a variable for the URL. To isolate the URL, we tell Python to select the content that is associated with the key “href”, and to select the name, we tell Python to get the contents of the tag (the information in-between “<a>” and “</a>”.)
single_link = link['href'] name = link.contents[0]
To save the information outside of the loop, we must add it to the list we created. While there are many ways to do this, I’ve chosen to create a new variable “entry” that is composed of two strings, the first of which is a “name” and the second is a link. Each “entry” is then added into the list. Notice that the separating comma is included in the string we are creating.
entry = "%s, %s" % (name, single_link) clean_list.append(entry)
Finally, to print from the newly created list, we will use the .join method (think implode in php) to create a string from all of the items in the “clean_list”, separating each line with a return (“\n”).
This leaves us with a script that looks like this:
Powered by Wordpress Plugins - Get the full version!which when executed gives us a clean CSV file that we can then use for other purposes.
And that is an example of how to use Beautiful Soup to save particular content out of an HTML document.
Thank you for this tutorial, especially such the very useful example for my purposes! I was able to replicate it today.
Your instructions for downloading and importing Beautiful Soup are very clean and helpful, given that there are things you mention that are not mentioned in the documentation. (Love the SUDO cartoon, which doubles as a well-needed laugh and an aid for comprehension!). I just have a few suggestions regarding the organization and some places where extra explanation would provide further clarity:
1 – In your section “Using BeautifulSoup to select particular content,” it would be helpful to show a snippet of the “prettified” HTML code to demonstrate how you walked through the DOM to get soup.p.a. This is minor, because I got it! But a snippet would help me avoid bouncing back and forth between the tutorial and scrolling through Terminal.
2 – In this same section, it might be helpful to explain decompose(), or at least provide link to the documentation. (Again, lazy readers don’t want to have to bounce back and forth between the documentation and the tutorial too much. As well, you’re right: improperly used, this command could destroy everything! So understanding what it does is key.)
3 – For the line “name = link.contents[0]“, I think the explanation gets lost with the explanation of the code snippet above it. We’ve worked with arrays enough to understand this, but I needed your in-person explanation, so it might be helpful to explain it separately.
4 – For the section “Writing content to CSV file,” you actually might want to run through all the code for stripping all the tags as a separate sub-header before you teach us to write to a csv file ( I understand why stripping the tags is part of writing the csv file because it helps delineate the fields in the csv file). However, I read the sub-header “Writing content to a CSV file” and expected to jump right into the csv file, not stripping more tags.
5 – I used the 44th Congress (1875-1877), which happens to have a few folks with suffixes (Jr., Sr., etc.) These wrote to a separate field in the csv file. They can be part of the first name, which I suspect will change the strings in the line “entry = . . . “.
6 – As an author and a primary user, I am very familiar with Bioguide. But, you might want to provide a screenshot of the initial search form with the Congress typed in under the heading “Get a file to scrape.” It’s been my experience in explaining Bioguide that people see “Year” for that blank and stop without understanding that typing “43″ will get 43rd Congress. (Not your fault, part of 15-year-old Bioguide’s charm.)
7 – Some congressional parlance: Member (as in Member of Congress) is always capitalized. This is something specific to how my office does things, but we include year spans with the first mention of a Congress, as in 43rd Congress (1873-1875) because most people think in years and not Congresses. (A cheat sheet should appear on our new website.) But these are minor corrections.
8 – I am working on a way to programatically “search” for a Congress (or Congresses) from the initial search form and scrape the result pages. Should I be successful, it will mesh nicely with what you’ve done.
THANKS!
Just realized I never “officially” replied! Thank you for these – I have incorporated most of them in draft 2. I didn’t catch where I used “Member” in a non-capitalized manner though.
Thanks for this helpful introduction! Caveat: I just read through the tutorial to see if I could understand it, and didn’t try to execute the code. But I was thrown for a loop when you changed the “people” variable in your second gist to “links” in your third gist. Only while expecting that change did I realize that in gist #2 you are operating on the html file, while in gist #3 you are operating on the text file created by gist #2.
I think this confused me because I was expecting gist #3 (also presented as the “finish line” snapshot at the beginning) to contain all of the code necessary to pull off the objective. Why not somehow incorporate the operations on the html file directly into the final code, so that gist #3 gets you all the way from start to finish?
Apologies if there is a technical reason why this isn’t possible. I’m very new to Python and also, as I said, read this quickly.
Hi Caleb,
Thank you for the suggestions! I agree that being more consistent would be clearer – I will adjust that in the next draft.
The content of gist #3 is indeed supposed to contain all the code – it appear that I made a tragic typo in naming the initial file in that last gist. It should be .html throughout. Thank you for catching that! (I’ve adjusted the gist already since that’s a rather consequential typo.)
Thank you again!
Nice tutorial! Thanks for making it. I get the following error when I run this code:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup (open(“C://congress.html”))
links = soup.find_all(‘a’)
for link in links:
print link
ERROR:
links = soup.find_all(‘a’)
TypeError: ‘NoneType’ object is not callable
Is it because I have the older version of BeautifulSoup?
Hi Josh,
It may indeed be that an older version had a slightly different syntax. I looked at the version 3 documentation – try this:
links = soup.findAll(‘a’)
Also, I’m guessing this is just a copy paste thing, but make sure you have “print link” indented.
Good luck! Let me know if that doesn’t work – I am learning all of this as well!
Jeri
Jeri,
Thanks a lot! I’m learning this stuff slowly but surely
Will look at the documentation next time. If you know of any other tutorials I’d love to do them!
hey you might wanna fix the “Powered by WordPress Plugins – Get the full version!”
Some of the code is not visible. Thanks
Jeri,
this is a very useful and well-documented tutorial – thanks a bunch. You might consider linking to its second part (I suspected it was there but had to google it).