Short Tutorial: Cleaning up downloaded files (draft)

So you used wget or Python to pull down a collection of files from the web. Excellent! But in looking through your loot, you notice that a number of the files are oddly small and, on further examination, find that they are functionally empty. How do you clean up your collection of files quickly?

There are a number of very powerful command line tools built into UNIX systems (Mac and Linux) that allow you to manipulate your files quickly and easily. This is a brief tutorial on how to use those tools to locate all of the files that are too small and then remove those files from your collection.

Begin by navigating in Terminal to your collection of folders. For example, my files were located a couple of folders downs within my Documents folder.

cd Documents/Github/Clio3/Webscraping/hymn-files

Once in the folder with your downloaded files, you need to find a way to isolate out the files that are too small to be interesting. To do this, use the “find” command. First, to see the various options associated with “find”, enter

man help

Use the up and down arrows to move around the window that appears. Look around at the various options – there are many! However, because we are looking only in a particular folder, which in my case has no subfolders, all that we need to look at for now is size.

To exit this window, type “q”.

If the files we are interested in sorting through are all one file type (in my case they’re .json files), we can tell the computer to find all of the files of a particular type and particular size as follows:

find *.json -size 28c

This would find all of the json files that are 28 bytes in size.

However, we want all the files that are 28 bytes or less.

find *.json -size -28c

Finding the files is great, but now we need to do something with that collection of files.

First we are going to “pipe” the results of our search over to our second, removal, function.

(From here on the examples are showing how the command is built – Do not run the command until the very end when all the pieces are in place)

find *.json -size -28c |

Next we are going to use xargs, which helps the computer handle a long list of file names.

Note: The wikipedia entry on xargs suggests using “-0″ (zero) when dealing with file names with spaces in them as xargs defaults to separating at white space (another reason to avoid spaces in filenames). If you run this command without -0 and it doesn’t work, try adding the -0.

find *.json -size -28c | xargs

And the function we want to run on each filename is “rm” or remove. rm has a number of options that you can research, some of which really make data un-recoverable and should be used with care. However, a basic “rm” command will be sufficient for this example.

find *.json -size -28c | xargs rm

Run this command to remove all of the files less than 28 bytes from your current directory.

(I used 28 bytes because the computer was having trouble with 0 and all of the files I wanted to keep were larger than 28 bytes. Not exactly sure why 0 was a problem but this is why experimenting with the find options before moving on to removing the files is a good idea!)

This entry was posted in Course Reflections, Digital Praxis: History 698. Bookmark the permalink.

4 Responses to Short Tutorial: Cleaning up downloaded files (draft)

  1. Jeri -
    This is a great and very well written tutorial! The comments I have are so nit-picky I am hesitant to mention them but I will and you can do with them what you will.
    My first comment is about your tiny section on the “find” command. My understanding of this tutorial is that you are showing how to do one very specific task but the way you phrased that section makes it appear that this is a necessary step in that process. As I understand what you are trying to teach in this tutorial that is not necessarily the case. So possibly just changing your phrasing to tell the reader that if they want to know all the cool things you can do then this is the command to use to see them or rephrasing your section clearer to tell the reader why this is important information.
    My second comment is again an attempt at clearer language and coming from a totally personal interpretation.
    You wrote: “First we are going to “pipe” the results of our search over to our second, removal, function.” Coming from a larger knowledge base in PHP when I see the word “function” I think we’re doing something somewhere else, not building onto the same line. Also the word “second” is slightly confusing because I’m not sure what the first one was. Don’t get me wrong, I figured it out real quick but it was a little jarring. Along the same lines, maybe showing the full line first and then breaking it down might alleviate some of this confusion instead of only telling us half-way through the line construction not to run it until the whole line is constructed. Or telling us this first and then building it like you do.
    Anyways, that’s all I have to say. In the end I understood everything perfectly and would be able to follow along with this tutorial with great ease. These suggestions are just my reactions to some minor language choices that could be entirely irrelevant :-)
    Great job!

    • jeri.elizabeth says:

      Sasha,

      Thanks for these! One of the biggest challenge for writing these is keeping the language consistent and clear. I think you’re right – changing the “man find” to an optional “discover more” line sounds good.

      I’ll change the second one to command rather than function. The second is there because that one line is actually doing two different operations. I will try to find a better way of communicating that the pipe takes the result from one command and feeds those into the second command.

      And I like the idea of posting the whole thing and then breaking it down into the parts. I wasn’t quite sure how all to handle both explaining and giving the lines to execute – this should help.

  2. Erin Bush says:

    Jeri, from what I can understand of this (because I am not a Mac user), it seems very straightforward, but that does bring me to the obvious q, can those of us on Windows machines clean up our files this way? Do we use the same commands?

    • jeri.elizabeth says:

      That is the question I asked before writing the tutorial. Unfortunately, Windows has different commands for the same functions. You can use UNIX commands, however, you would have to download some sort of UNIX tools for Windows first. :(

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>