I’m going to post this tutorial in two parts. Part 1 will be an introduction to Beautiful Soup and an example of using this library to extract particular data from an html file and saving it as a CSV file. Part 2 will use the same techniques to create a CSV files from all of the data contained in the table.
What is Beautiful Soup?
“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.” (Opening lines of Beautiful Soup)
Beautiful Soup is a python library for getting data out of html, xlm, and other markup. It provides a way to extract particular content in a webpage by location in the html tags (think X-Paths and navigating the DOM), by CSS identifier, or by html id or class identifiers (or some combination thereof).
So say there is a website that contains data that is relevant to your research, such as date or address information, or link information for other sources of data that you also want to scrape. Beautiful Soup offers easy ways to pull that particular content from the webpage, remove it from its html wrappings, and put it into a new context which will allow you to do whatever next operation you desire to do.
I highly recommend looking at the Beautiful Soup documentation pages to get a sense of variety of things you can do with simple Beautiful Soup commands, from isolating titles and links to extracting all of the text from the html tags to altering the html within the document you’re working with.
Installing Beautiful Soup
Installing Beautiful Soup is easiest if you already have pip or another python installer already in place. If you don’t have pip, start with Fred’s tutorial on installing python modules. Once you have pip installed, run the following command to install Beautiful Soup:
pip install beautifulsoup4
You may need to include “sudo” in your command. Sudo gives your computer permission to write to your root directories and requires you to re-enter your password. This is the same logic behind your being prompted to enter your password when you install a new program.
With sudo, the command is:
sudo pip install beautifulsoup4
Using Beautiful Soup in a Python Script
There are two basic steps to using Beautiful Soup in your python script. First is to import the library at the beginning of your script by writing:
from bs4 import BeautifulSoup
Second, you have to pass the document or url to Beautiful Soup to make the “soup.” For this example we will be using a locally saved file and will create the soup this way:
soup = BeautifulSoup(open("example.txt"))
This creates a large soup object out of the content of our “example.txt” file and we can then run the Beautiful Soup methods on that object.
Application: Extracting names and URLs from an HTML page
Preview: Where we are going
Because I like to see where the finish line is before starting, I will begin with a view of what we are trying to create. We are attempting to go from a search results page where the html page looks like this:Powered by Wordpress Plugins - Get the full version!
but follow along to understand how Beautiful Soup gets us to that point.
Get a file to scrape
The first step is getting the files for scraping. This can be done in a variety of ways. Usually, I would recommend scraping using wget or cURL (see my slides on an introduction to webscraping). To do this, use wget or cURL in Terminal and point it at the particular webpage or folder that you want to download.
However, the Congressional database is a bit more complicated because the URL for particular search results is hidden. While this can be bypassed programmatically, it is easier for our purposes to go to http://bioguide.congress.gov/biosearch/biosearch.asp, search for Congress number 43, and to save a copy of the webpage of results.
Selecting “File” and “Save Page As …” from your browser window will accomplish this. For a filename, avoid spaces – I am using “43rd-congress.html”. Move the file into the folder you want to work in and let’s proceed.
One of the first things Beautiful Soup can help us with is getting a sense of how the different HTML tags are nested within each other. This can be very useful when you need to isolate content that is buried within the HTML structure as Beautiful Soup allows you to select content based upon tag within tag within tag (example: soup.body.p.b finds the first bold item inside a paragraph tag inside the body tag in the document).
To get a good view of how the tags are nested in the document, we can use the method “prettify” on our soup object.
Create a new file called soupexample.py. This file will contain your Python script that we will be developing over the course of the tutorial.
In this file we need to import the Beautiful Soup library, open the file and pass it to Beautiful Soup, and then print the pretty version in the terminal.Powered by Wordpress Plugins - Get the full version!
Save this file in the folder with your text file and go to the command line. Navigate (use ‘cd’) to the folder you’re working in and execute the following:
You should see your terminal window fill up with a nicely indented version of the original html text. This is a clean picture of how the various tags relate to one another.
Using BeautifulSoup to select particular content
So, we are interested in the links and names of the various member of the 43rd Congress. Looking at the ”pretty” version of the file, the first thing to notice is that this is a relatively flat file – our tags are not too deeply embedded within each other.
While this makes some of the identifying more difficult, we are interested in the names and urls and all of these are, most fortunately, embedded in “<a>” tags. So, we need to isolate out all of the “<a>” tags. We can do this by updating the code in “soupexample.py” to the following:Powered by Wordpress Plugins - Get the full version!
Save and run the script again to see all of the anchor tags in the document.
One thing to notice is that there is an additional link in our file – the link for an additional search. We can get rid of this with just a line or two of additional code. Going back to the pretty version, notice that this last “<a>” tag is not within the table but is within a “<p>” tag.
Because Beautiful Soup allows us to modify the data, we can remove the “<a>” that is under the “<p>” before searching for all the “<a>” tags.
To do this, we can use the “decompose” method, which erases whatever you tell Beautiful Soup to decompose. Update the file as below and run again.Powered by Wordpress Plugins - Get the full version!
And success! We have isolated out all of the links we want and none of the links we don’t!
Stripping Tags and Writing Content to a CSV file
Displaying these things in the Terminal is less than useful. And, the html tags are still surrounding all of our data. Let’s strip away the tags and save the data into a file.
In order to clean up the html tags and split the URLs from the names, we need to isolate the information from the html tags. To do this, we will use two powerful, and commonly used Beautiful Soup methods: contents and get.
Here is the file – I will explain the different pieces below.Powered by Wordpress Plugins - Get the full version!
The first change we’ve made is to add “import csv” to the beginning of the file. This is because we are going to use the csv library to write the file. You should not need to download the csv library.
The second change comes in the for loop. Instead of merely printing all of the content of “link,” we are identifying the pieces of information that we want. To isolate the names, we are using the method “contents,” and for the links, we are using the method “get.”
Contents isolates out the text from within html tags. For example, if you started with “<h2>This is my Header text</h2>,” you would be left with “This is my Header text” after applying the contents method. In this case, we are taking the contents inside the first elements of the array. (There is only one element in our array at the moment, but the computer is ever literal and needs to be told where to look.)
Get is another method for selecting the text out from the html tags. Here we are getting the text associated with the tag “href.”
Finally, we are using the csv library to write the file. Because we are executing this within the loop, we need to append (‘a’) rather than write (‘w’) to the file. This syntax tells the computer to include the data from names and the data from fullLinks on each row, separated by a comma.
When executed, this gives us a clean CSV file that we can then use for other purposes.
And that is an example of how to use Beautiful Soup to save particular content out of an HTML document. I will build on this example to save all of the data from the table into a CSV file in part 2.