Rusty's Blog

Thoughts and musings of someone who's not sure what 'normal' is…

Tuesday, August 18, 2009

Removing ‘dupes’

If you read my blog from yesterday, you know that I have a rather substantial collection of files sitting on my server. Over the years I’ve written a few stories, taken some pictures, subscribed to a few news feeds, and so on.

But that’s only a small part of why my server has such a large collection of files on it. The real reason is that I’ve backed up systems that I needed to do some work on to my server in the past, and it’s accumulated a few duplicate files. I’ve had at least three laptops that I’ve created a folder on the server for, then dumped the entire content of my home folder from the laptop into the folder on the server. Including all subdirectories.

I have also backed up what I considered to be important folders on the system the same way. In my user account’s home folder there are folders ‘etc’, ‘oldetc’ and the like that are copies of what I had in the folder at the time I needed to do something significant to the system. Oh, it has saved me a few times. But I really don’t need a lot of that sitting around anymore. And it does take time to parse when I do make backups.

Well, just merge the folders together then, and throw away the source folders for any duplicates. Right?

In some cases that might be workable. However it would cause a few issues as well. Let’s take a couple of examples. Say I had a laptop Able, another Betsy, and a third Chuck. At one point or another I made backups of each to my server. However when I was using Betsy, some time after I was using Able, I create a file ‘Meeting Notes.txt’ in the home folder. Oddly enough I have a similar file in Able, and another in Chuck. How do I ‘merge’ the three folders? In this case I don’t really want to do that. What I want to do is keep the three folders, and their unique files separate, but after I had taken a picture of my Lab a when I was setting up Able, I decided that it would be a great background image, so I tossed it in a folder Backgrounds, and copied that from system to system. That is a prime image to remove the duplicates of.

Sounds simple right? Bring up a list of the files in each folder, and delete the duplicates. Right?

I mentioned that this home folder has almost 200 gig of content right? It turns out that there are a bit over 380,000 files. For the moment I”m presuming that there are at least 100,000 duplicates. Additionally some of the duplicates may not have the same name.

Say I take a photo I’ve taken, and store a copy in my backup folder. Well the camera gives it a positively useless name like ‘IMG00010.PEF’ and I use a tool called UFW to conver this to img00010.jpg, but the name is still hardly descriptive. The picture is of a friend’s Samoyed named Doug. So I copy the image to doug010.jpg for my friend, and just happen to leave it on the laptop. Ok, she was in the image too and I was using the image as a background for a while.

Now I don’t mind having the jpg dupe of the pef file. they are in different formats, and the jpg takes up significantly less space. However I probably don’t need both of the img00010 and doug010 images hanging around at the same time. Since the doug010 image is really the same as the img00010 image, I can use a feature of many Linux file systems and ‘link’ the two file names to the same file. In fact that is already in use on some folders as a couple of tools such as web servers have changed what folder they pointed at for the user accoung. At one point Apache was pointing at WWW, and at another it was pointing at public_html, and at another time it was pointing at www. Rather than delete and recreate each time, I created a link to the original folder, and pretty much forgot about it.

In any case there are a lot of possibilities for why a duplicate file may exist, and in some cases what appears to be a duplicate file may not be one.

So how to clean this up.

First of all let’s find a way of identifying duplicates that has nothing to do with the file name. There’s a handy tool available for Linux and I believe for MacOS as well, Windows too I think, called sha256sum. This tool is primarily used to generate checksums of files that are going to be distributed on the Internet, so that once you have completed your download you can check to see if the resulting file matches what the distributer says they sent out. It’s really likely ot be overkill for what I intend to do with it. I could probably get away with using md5sum, but considering that I have over a quarter of a million files to go through, I might just as well use the ‘best’ tool for the job. What the tool does is go through a file and spit out a hash of the file contents. Actually it spits out a 64 character string that is reasonably unique for each file. For my purposes it is close enough to being unique. I can survuve deleting some files.

What I am doing now is using the command

find . -type f -exec sha256sum "{}" \; > sha256sums.txt

to generate a file that I can then process. The format of the file is each line contains first a sha256sum then the filename.

Next up is to sort the file by the first 64 characters. Actually just sorting will be fine. I’ll probably use the command

sort <sha256sums.txt >sha256sums.srt

which should spit out a file of the same number of lines.  A quick check for that

wc -l sha256sums.txt sha256sums.srt

to verify that. Now we are going to want to weed out all the lines that represent unique files. Essentially we are going to do the reverse of what the tool uniq was originally intended to do. By default if you pipe the contents of a file through ‘uniq’ you get al the lines that are unique, or are not duplicates of the line before them. It does include the first line that may have a duplicate after it, but not any subsequent duplicates. Well, that’s close to what we want. Time to take a look at it’s options. Hmm. -d – dump out duplicate lines. Ok, we’re closer. But as I said the format for each line is ‘hash filename’ and in this case that means that, even sorted, each line is going to be unique. Oh, wait, -c – compare first ‘n’ characters. Bingo.

uniq -d -c64 <sha256sums.srt > sha256sums.dups

should take care of the job. Well, maybe.

Until I take a look at the output I won’t really know if it will list both or all files with that checsum. Ok, let’s presume for the moment that it won’t. What to do?

Well, let’s use the cut command to cut out the first field, then use fgrep to find all the lines that have the resulting hashes in them. Since there are possibly 3 or 4 copies of some files (or more) let’s also go through uniq to make sure we have a reasonably clean copy.

uniq -d -c64 <sha256sums.srt | cut -f1 -d\  | sort -u > sha256sums.hashs
for A in 'cat sha256sums.hashs' ; do grep ${A} sha256sums.txt >> sha256sums.dups ; done

OK, that should get the list of duplicate files. What I need to do now is get rid of the hashes and sort the list of files.

cat sha256sums.dups | cut -f2 -d\  | sort > duplicatefiles.txt

Now at this point let’s get rid of those sha256sums files. We will also need to edit the duplicatefiles.txt What I want to do is limit the files that I delete to just those in ‘backup’ folders for devices that I am not worried about having a duplicate file from. While we are in there it would be a good idea ot escape out non-graphical characters within the file. It will make a difference later.

rm sha256sums.*
gedit duplicatefiles.txt
for A in `cat duplicatefiles.txt` ; do rm ${A} ; done ; rm duplicatefiles.txt

And we are ‘done.’

Well sort of. In reality we ended up deleting about 600 files. Not quite the experience in space savings I was looking for.

But it’s a start. And we have to start some place. I suppose. Now to figure out what else to get rid of….

posted by Rusty at 12:54 am  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress