In the first part of this tutorial, we extracted the names and links from the webpage. In this part, we will go one step further and move all of the table data into the csv file so that we can more easily use it elsewhere.
Reviewing the Challenge
Back to the HTML
Let’s review again the file that we’re attempting to extract data from.Powered by Wordpress Plugins - Get the full version!
When we were looking for names, all of the data that we wanted was contained within the anchor tags, which allowed us to make a targeted search. Now, all of the data we want is contained in the html table structure. Getting this data out is the puzzle we’re going to solve.
Previewing the Final Product
We know what the html file looks like. The CSV file will look as follows:Powered by Wordpress Plugins - Get the full version!
And this is the code that will get us there:Powered by Wordpress Plugins - Get the full version!
Writing the Script
The Problem of Extra Data
This is the code we had from the end of Part I:Powered by Wordpress Plugins - Get the full version!
The problem of extra data that we had in Part I was that there was an additional anchor tag, giving us an additional line.
While that problem still exists, we have an additional problem. There is an additional table at the top of the file that has styling data. We can use an additional decompose line, identifying this table by the color information as this is the only place where there is color information in the file.
rogue = soup.find(bgcolor="#990000")
These two lines find everything within the tags containing the color information bgcolor = “990000″. We know we don’t want any of this information, so we can decompose it.
Identifying the Parts
We know that everything we do want for our CSV file lives within table row (“tr”) tags. We also know that these items appear in the same order within the tags. Because we are dealing with lists, we can identify pieces of information by its place in the list. This means that the first item in the table is identified by , the second by , etc.
Extracting the Data
We can extract the data in two moves. First, we isolate the link information and then we move on to the information within the various html tags.
For the first, we create a loop from a search for all of the anchor tags. Then we need to move through this to “get” all of the data associated with the “href” tag.
for link in tr.find_all('a'):
fullLink = link.get('href')
We then need to run a search for the table data within the table rows.
tds = tr.find_all("td")
Next, we need to extract the data we want. Because not all of the rows contain the same number of data items, we need to build in a way to tell the script to move on if it encounters an error. This is the logic of the “try”, “except”. If a particular line fails, the script will continue on to the next line.
Within this we are using the following structure:
years = str(tds.get_text())
In this, we are applying the “get_text” method to the 2nd element in the row (because computers count beginning with 0) and then creates a string from the result. This we assign to the variable, which we will use to create the csv file. We repeat this for every item in the table that we want to capture in our file.
Writing the CSV file
The last step in this file is to create the CSV file. Here we are using the same process as we did in Part I, just with more variables. Again, because we are writing within the loop, use ‘a’ for append rather than ‘w’ for write.
As a result, our file will look like:Powered by Wordpress Plugins - Get the full version!
You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.