Beautiful Soup Tutorial (Part 2, Draft 1)

In the first part of this tutorial, we extracted the names and links from the webpage. In this part, we will go one step further and move all of the table data into the csv file so that we can more easily use it elsewhere.

Reviewing the Challenge

Back to the HTML

Let’s review again the file that we’re attempting to extract data from.

Powered by Wordpress Plugins - Get the full version!

When we were looking for names, all of the data that we wanted was contained within the anchor tags, which allowed us to make a targeted search. Now, all of the data we want is contained in the html table structure. Getting this data out is the puzzle we’re going to solve.

Previewing the Final Product

We know what the html file looks like. The CSV file will look as follows:

Powered by Wordpress Plugins - Get the full version!

And this is the code that will get us there:

Powered by Wordpress Plugins - Get the full version!

Writing the Script

The Problem of Extra Data

This is the code we had from the end of Part I:

Powered by Wordpress Plugins - Get the full version!

The problem of extra data that we had in Part I was that there was an additional anchor tag, giving us an additional line.

While that problem still exists, we have an additional problem. There is an additional table at the top of the file that has styling data. We can use an additional decompose line, identifying this table by the color information as this is the only place where there is color information in the file.

rogue = soup.find(bgcolor="#990000")
rogue.decompose()

These two lines find everything within the tags containing the color information bgcolor = “990000″. We know we don’t want any of this information, so we can decompose it.

Identifying the Parts

We know that everything we do want for our CSV file lives within table row (“tr”) tags. We also know that these items appear in the same order within the tags. Because we are dealing with lists, we can identify pieces of information by its place in the list. This means that the first item in the table is identified by [0], the second by [1], etc.

Extracting the Data

We can extract the data in two moves. First, we isolate the link information and then we move on to the information within the various html tags.

For the first, we create a loop from a search for all of the anchor tags. Then we need to move through this to “get” all of the data associated with the “href” tag.

for link in tr.find_all('a'):
     fullLink = link.get('href')

We then need to run a search for the table data within the table rows.

tds = tr.find_all("td")

Next, we need to extract the data we want. Because not all of the rows contain the same number of data items, we need to build in a way to tell the script to move on if it encounters an error. This is the logic of the “try”, “except”. If a particular line fails, the script will continue on to the next line.

Within this we are using the following structure:

years = str(tds[1].get_text())

In this, we are applying the “get_text” method to the 2nd element in the row (because computers count beginning with 0) and then creates a string from the result. This we assign to the variable, which we will use to create the csv file. We repeat this for every item in the table that we want to capture in our file.

Writing the CSV file

The last step in this file is to create the CSV file. Here we are using the same process as we did in Part I, just with more variables. Again, because we are writing within the loop, use ‘a’ for append rather than ‘w’ for write.

As a result, our file will look like:

Powered by Wordpress Plugins - Get the full version!

You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.

This entry was posted in Course Reflections, Digital Praxis: History 698. Bookmark the permalink.

2 Responses to Beautiful Soup Tutorial (Part 2, Draft 1)

  1. Brad Tombaugh says:

    Jeri, thanks for sharing the tutorial. It’s a good introduction to a complex topic. I’d make a suggestion, though, which would allow your script to run more efficiently — move the file open statement before the loop, instead of inside the loop:

    from bs4 import BeautifulSoup
    import csv

    #open the html file and create a soup object
    soup = BeautifulSoup(open(“43rd-congress.html”))

    #get rid of the final link that is outside the table
    final_link = soup.p.a
    final_link.decompose()

    #get rid of the link that is within the table data but is not part of the data for inclusion in the CSV file
    rogue = soup.find(bgcolor=”#990000″)
    rogue.decompose()

    trs = soup.find_all(“tr”) #find all of the table rows

    f= csv.writer(open(“43rd_Congress.csv”, “w”)) # Open the output file for writing before the loop
    f.writerow(["Name", "Years", "Position", "Party", "State", "Congress", "Link"]) # Write column headers as the first line

    for tr in trs: #for each item in the list of rows
    for link in tr.find_all(‘a’): #this is a bit tricky – you are combining the search for anchor tags and the for loop in one step
    fullLink = link.get(‘href’) #get the value of the href

    tds = tr.find_all(“td”) #run another search for all of the table data

    try: #we are using “try” because the table is not well formatted. This allows the program to continue after encountering an error.
    names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    years = str(tds[1].get_text())
    positions = str(tds[2].get_text())
    parties = str(tds[3].get_text())
    states = str(tds[4].get_text())
    congress = tds[5].get_text()

    except:
    print “bad tr string”
    continue #This tells the computer to move on to the next item after it encounters an error

    f.writerow([names, years, positions, parties, states, congress, fullLink]) #you can write the fields in whatever order you wish.

    This way the operating system only has to open the file once. Its good practice to close the file when you’re finished, but Python does that for you when the script completes. I also added another writerow to put headers on the columns.

    • jeri.elizabeth says:

      Hi Brad,

      Thank you very much for that suggestion! That makes a lot of sense and I will adjust the tutorial!

      Best,

      Jeri

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>