<< Prev | TOC | Front Page | Talkback | FAQ | Next >>
...making Linux just a little more fun!
|
Perl One-Liner of the Month: The Adventure of the Runaway Files
By Ben Okopnik
|
- "Well, well - what have we here?"
Woomert Foonly had been working with his collection of rare airplanes, and
was concentrating on the finer details of turbocharger gate flows and jet
fuel cracking pressures. Nevertheless, the slight noise behind him that heralded
an unannounced visitor (Woomert could recognize Frink's step quite well) caused
him to instantly spin around and apply a hold from his Pentjak Silat
repertoire to the unfortunate sneak, causing the latter to resemble a fancy
pretzel (if pretzels could produce choked, squeaking sounds, that is). The
question was asked in calm, measured tones, but there was an obvious undertone
of "this hold could get much more painful very quickly, so don't waste
my time" that changed the helpless squeaking to slightly more useful words.
- "Ow! I'm - ow! - sorry, Mr. Foonly, but I just had to come
see you! I've got this bad problem, and - ow, ow! - I really didn't want anybody
to know, and - ouch! - I didn't want to use the front door, 'cause somebody
might have spotted me! I didn't mean any - ow! - harm, really!"
Woomert sighed and released his grip, then helped the stranger untangle
himself, since he clearly would not be able to, for example, untie his left
shoelace from his right wrist - especially since it was tied behind his back.
He smiled briefly to himself while working; the old skills were still in
shape, and would be there when he really needed them.
- "Next time, I suggest calling or emailing me ahead of time. The
Zigamorph Gang, whom I helped apprehend when I solved the Bank Round-Downs
Mystery, is out of prison and threatening various sorts of mayhem; I can
handle them and their plotting, but it's just not a smart idea to sneak up
on me right now - or at any time. Who are you, anyway?"
The visitor shook himself and made a forlorn attempt at straightening out
his rumpled jacket. Since it now resembled a piece of wrung-out laundry, he
gave up after a few moments and shook his head mournfully.
- "Well... my name is Willard Furrfu. You see, Mr. Foonly, I'm working
as a data entry operator, but I've been trying to learn some programming skills
after work so I can get ahead. I've managed to install a C compiler in my
home directory, and I've been experimenting with loops... and I managed to
really screw things up. I'm hoping you can help me, because if anybody
finds out what happened, I'm toast!"
While Willard was talking, Woomert quickly cleaned up his workbench and
closed the plane's cowling. When he was done, he beckoned his guest out of
the hangar and into the house. Once inside, he started a pot of tea, then
sat down and examined his guest.
- "Tell me exactly what happened."
- "Well... I'm not really certain. I wanted to practice some of the
stuff I've learned by copying an existing file to a random filename one line
at a time; unfortunately, it seems like the function that I wrote looped
over the file creation subroutine as well as the line copy function. It took
me only a few seconds to realize it and kill the process, but there are now
thousands and thousands of files in my home directory where there used to
be only fifty or sixty! Worse yet, given the naming scheme for the valid
files, it's impossible to tell which ones they are - the names look kinda
random in the first place - and I can't even imagine doing this by hand,
it's impossible. I don't mind telling you, Mr. Foonly, that I'm in a panic.
I tried writing some kind of a function that would loop through and compare
each file with every other one in the directory and get rid of the duplicates,
but I realized half-way through that, one, I'm not up to that skill level,
and two, it adds up to a pretty horrendous number of comparisons overall
- I'll never get it done in time. Tomorrow morning, when I'm supposed to
enter more data into these files, I'll be in deep, deep trouble - and I'd
heard of you and how you've helped people with programming problems before.
Please, Mr. Foonly - I don't know what I'll do if you turn me down!"
- "Hmm. Interesting." Woomert sniffed the brewing tea and closed the
lid tightly, then sat down again. "What kind of files are these?"
- "Text files, all of them."
- "Are they very large?"
- "Well, they're all under 100kB, most of them under 50kB. I'd thought
of taking one file of each size, but it turns out a number of them are different
even though the size is the same."
- "Do you care what the actual remaining file names are, as long as
the files are unique?"
- "Why, no, not at all - when there are only the original files, I
can go through them all in just a few minutes and identify them. Mr Foonly,
do you mean that you see a solution to this problem? Is it possible?"
Woomert shrugged.
- "Let's take a look at it first, shall we? No point in guessing until
we have the solid facts in hand. However, it doesn't look all that difficult.
You're right in saying that comparing the actual files to each other would
be a very long process; tomorrow morning would probably not suffice unless
it was a very powerful computer..." At Willard's hangdog look, Woomert went
on. "I didn't suppose it was, from the way it sounded. Well, let's give it
a shot. How do we get there from here?"
Willard brightened up.
- "I'd followed a number of your cases in the papers, Mr. Foonly,
and knew that you preferred SSH. In fact, I had just convinced our sysadmin
to switch to it - we'd been using telnet, and after I showed him some of
what you'd said about it (I had to censor it a bit, of course), he became
convinced and talked the management into it as well."
- "Not bad, Willard. You're starting off right - in some ways, anyway.
Whatever language you choose to learn, you need to be careful. You
never know what the negative effects could be, so until you're at least semi-competent,
you need to stay away from live systems. When this is over, I suggest you
talk to your sysadmin about setting up a chroot jail, where you can experiment
safely without endangering your working environment."
- "I'll do that, Mr. Foonly, as soon as I get back to the company.
Do you think that fixing this will take long?"
- "Let's see. Go ahead and use that machine over there to log in,
and we'll see what it tells us. What do you know - ``ls -l|head -1''
says ``total 27212'', which tells us that's how many files you've
got. So far, so good. All right - first of all, what did you call the program
that did this?"
- "Um, ``randfile''. I've still got the source..."
- "That's good, because we're going to delete it. I'd hate to have
you accidentally undo everything after it's fixed! Now, let's see... yep,
these look like all text, no problem. Another notch for you, Willard: accurate
problem reporting is a good skill to have, and you seem to be doing well.
All right then..."
perl -MDigest::MD5=md5 -0we'@a=@ARGV;@h{map{md5($_)}<>}=@a;@b=values%h;print"@b\n"' *
Woomert's fingers flew over the keyboard as he fired off the one-liner.
After about a second, he smiled but kept watching the screen - which, after
a another second or two, printed a list of filenames.
- "There you are, Willard - a list of unique names. I'm glad your
system had the module that I needed - it's a common one, but I wasn't
certain. Copy those off to another directory, delete all the others, and
copy them back, and you're all done. You could even automate the process by
writing..." A mischievous grin flashed over Woomert's face as he paused for
a second. "...a program. Well, a one-line shell script, anyway."
- "That... that's it???" Willard stared in hope and disbelief at the
screen where the short list of files beckoned for action. He quickly created
a subdirectory in "/tmp", copied the files by carefully using "cp"
and backticks around Woomert's script, and scanned them by using "less".
When he turned toward Woomert a few seconds later, his face was shining with
joy.
- "Mr. Foonly... you've saved me. I promise I'll be far more careful
from now on, and I'll talk to our administrator about setting up a - what
did you call it, a ``chroot jail''? - anyway, I'm really grateful. How can I
ever repay you?"
- "Well, you could bring me large loads of gold and jewels..." Woomert
stopped and laughed at the look of dismay on the young man's face, "just kidding.
I have a suggestion for you, though, that you might put some thought into.
You seem to have some aptitude for programming - I was just looking at your
"randfile.c", and except for the obvious errors, you were doing
pretty well. I'd suggest you take a few programming courses at the local
vocational school as a start - when you're just starting out, it's difficult
to get anywhere, particularly in languages like C and C++ where there are
many, many traps and pitfalls for the unwary. They work well for their specific
purposes, mind you - but you should have some formal training to understand
the background of what you're doing, or you end up with a mess."
- "A vocational school." Willard seemed struck by the idea. "Say,
I never thought of that; I just knew that college was too expensive for me
right now, and I wanted to learn somehow. Great idea, Mr. Foonly; I'll run
down there and find out what it takes as soon as possible! I'll even put practicing
C aside for now, until I do learn some of the background... what about the
stuff that you were using? I'd heard of PERL before."
- "Well, it's not called ``PERL'', since it's not an abbreviation
- although some people have come up with back-formations for what it stands
for [1]. It's ``Perl'' if you're talking about the language,
and "perl" for the the executable name. Yes, I think that learning Perl would
be a very good idea, especially if you're going to back it up with a later
study of C; you'll find that it's easy to learn and keep learning, allows
you to become competent quickly, and avoids many of the problems of the older
languages that have you dealing with abstruse issues like memory management
and bad pointers. I'd suggest picking up a good book - be careful, there
are many poorly-written books on Perl, but I can definitely recommend "Learning Perl'' by Randal
Schwartz and Tom Phoenix - and studying it. An evening or two of that, and
you'll be able to get in trouble even more efficiently than you did with your
C program." Woomert grinned at the somewhat woebegone-looking Willard, who
finally grinned back.
- "Well, I've actually read up on it a little bit before, but I'd
read all kinds of things on the Net about Perl being hard to read, or hard
to understand, so I was a little reticent about studying it. Actually, "
Willard looked abashed, "after seeing your code, I know what they mean. Is
it always that complicated?"
- "Not at all. I use these one-liners because I understand Perl well,
and because they're not code that I'm leaving for someone else to use. In
fact, if you're interested, I can explain what I did and show how it would
look in a script."
- "Mr. Foonly, I'd be fascinated. After all, I'm going to be learning this
stuff - what better way to start than by hearing you explain it?"
Smiling, Woomert extracted his cell phone from the quick-release waterproof
stainless steel holder that he'd recently invented.
"Hold on while I get Frink. He'd like to see this too, I'm sure. Hello,
Frink? Got a case here... actually, it's solved already, but you might want
to see the method. Ten minutes? See you then." He returned the phone to its
holster. "We'll just have some of this excellent brew that I've made up until
he gets here. It's a pure, fine-pluck, high-altitude rolled Nepalese tea
that's got a wonderful smoky flavor. A cup for you?..."
A bit later, Frink showed up, looking like he'd torn himself away from some
project or another. He also looked disappointed, but Woomert immediately forestalled
him.
- "Frink, I know that you strongly prefer to participate in my cases;
I do also, since you're now going to be my partner. However, there are times
when a case just sneaks up on you and turns into a knotty problem before
you can blink, and you have to get things tied up before it loops and replicates
itself into some huge number of variables." Both of them glanced over at
Willard who was by now unsuccessfully trying to choke down his laughter.
"Willard, for example, understands precisely what I mean. Anyway, be assured
that I would not have left you out if there was not a time element involved;
as it turned out, I was able to solve the problem quickly, but there was
always the chance that we'd need every available second. Let me tell you
about it and judge for yourself."
A few moments sufficed to explain what had come before, and Frink nodded
and smiled at Woomert.
- "Thanks, Woomert. I was feeling left out, and I appreciate
your explaining that. Good communications between partners are important,
aren't they? That's a lesson all its own." The two of them grinned at each
other before turning to the computer.
- "Go ahead, Frink. Can you break this one out for Willard? I'll be
right here, so if you get stuck, I'll keep it going."
- "All right, then. Let's see." Frink stared at the code on the screen,
forehead furrowed in concentration.
perl -MDigest::MD5=md5 -0we'@a=@ARGV;@h{map{md5($_)}<>}=@a;@b=values%h;print"@b\n"' *
- "All right. ``-MDigest::MD5=md5'' is pretty easy: you're
loading the ``Digest::MD5'' module and importing the ``md5''
method from it, just as we've talked about before. ``-we'', we know
about - enable warnings and execute what follows as a script. ``-0'',
now... ah, I remember - a number as an option is the octal code of the end-of-line
definition for the files we're reading in. Oh, I get it! You're effectively
disabling the EOL, thus ``slurping'' entire files, one at a time. Right?"
Woomert silently applauded; Frink grinned and turned back to the screen
before him.
- "Next. You copy @ARGV right at the start - this saves the
list of file names so you can re-use them, since @ARGV is going to
change as we read in the files. Furthermore, you didn't have to use a BEGIN
procedure to do this since we're not looping the entire script, as we would
be with a ``-n'' or a ``-p'' switch. Next... uh, next it
gets pretty tricky. I'll admit that you've just lost me, although I can explain
what you did further on: you copied the values in the %h hash to
an array so you could use Perl's "pretty print" mechanism: an array in double-quotes
is printed with spaces between the elements, which was what you wanted. The
``\n'' at the end also deserves a comment: normally, you'd use the
``-l'' switch on the command line which would append the EOL to
every line that was printed, but you'd redefined EOL as a null, so that wouldn't
help - so you had to use the ``\n''. How's that?"
- "Well done, partner. Now, here's the rest of the story - are you
following this, Willard? Speak up if you don't understand something. While
Frink is ``chanting his beads'', so to speak, and learning in the process,
you're our reviewer for this run: if it's not being clearly explained, we'd
like to hear from you."
Willard cleared his throat.
- "Well - actually, I understand it all so far. I'm guessing that
a ``module'' is like a C library, and ``Digest::MD5'' probably has
to do with, well, generating MD5 sums - I've heard of this but am not really
sure of what that means. Other than that, yes, I think I've got it."
Frink spoke up.
- "An MD5 digest, or sum (sometimes also called a hash), is used as
a unique ID for strings, most commonly file contents. If you get a file and
its MD5 hash, you can check it using commonly available tools to make sure
that the file hasn't changed in any way by generating a new sum from the
file and comparing it with the one you've received. In fact, here's a useful
little utility that I use to do exactly that, instead of having to visually
compare them:
#!/usr/bin/perl
# "md5check" created by Ben Okopnik on Wed Apr 9 21:27:05 EDT 2003
use warnings;
use strict;
use Digest::MD5;
die "Usage: ", $0 =~ /([^\/]+)$/, " <filename> <md5_hex_digest>\n"
unless @ARGV == 2;
open Fh, shift or die "Can't open: $!\n";
my $d = Digest::MD5 -> new -> addfile( *Fh ) -> hexdigest;
print "MD5 sums ", ($d eq shift) ? "" : "*DO NOT* ", "match.\n"
Makes it a little easier, I think. Anyway, back to Woomert's explanation...
I'd like to see how he pulled off this particular trick."
Woomert smiled at his partner.
- "Obviously, you're talking about the ``@h{map{md5($_)}<>}=@a''
bit, right? Yeah, that one is a little complex if you're not used to it.
What I did there is use a hash slice to populate %h - it's
a neat little idiom to keep in mind. If you think about how a hash is structured:
key1 => value1
key2 => value2
key3 => value3
key4 => value4
key5 => value5
...
you'll see that it's an array of keys which point to an array of values.
Consequently, we can treat it as such; as an example, we can create a hash
of the alphabet and letters' numerical positions by saying
@alpha{ 1 .. 26 } = "a" .. "z"; # The range operator, '..' generates the two lists
The ``@'' sigil before the hash name simply indicates the context
of what is going on; what tells us about the type of variable we're using
are the curly braces following the variable name - that indicates a hash.
If we saw square braces, we'd know we were dealing with an array slice
instead.
Still, that doesn't explain everything - so here's the rest of it. Since
we're reading in the file contents one large slurp at a time, meaning that
we get one entire file's worth when we read the special ``<>''
filehandle, I simply used the map function to do an implicit loop
over it - and run the ``md5()'' routine over each of those chunks
of text. I would have had to do something very different if these weren't
text files - a file that contained a null would have thrown off the count
- but they were. My safety margin was in the fact that the ``-w''
switch would warn me if I had an unbalanced hash - which would happen if
there was a null anywhere in there. So, I created a hash of keys which were
MD5 digests of the file contents, and assigned the array of file names that
I'd created earlier as the values. It's important to note that hashes do
not store the key-value pairs in the order that they're assigned... but it
wasn't a factor here, since we were really dealing with arrays which are
stored in order.
Now, Frink, I'll leave this one thing to you. Why did this produce a list
of unique file names?"
Frink laughed.
- "Thanks, Woomert. I actually do know this one. Since a hashes keys
are unique - values don't have to be, but keys do - every time that you added
a key/value pair where the key already existed in the hash, the old value
for that key simply got overwritten. Voila - a unique list. In fact,
I can now break all this out in a script... mmm, I'll have to change a few
things, since the way you did it is implicit in that hash slice mechanism:
#!/usr/bin/perl -w
use Digest::MD5 qw/md5/;
{
local $/;
# Temporarily undefine EOL
@n=@ARGV;
$count = 0;
while ( <> ){
$key = md5($_);
$value = $n[$count++];
$uniq{ $key } = $value;
}
}
print"$_ " for values %uniq
After a moment or two, Willard suddenly spoke up.
- "Say, I think I understand this stuff. Why, that doesn't look complicated
at all! I'm not sure about the ``$_'' and the ``$/'' variables,
but I'd think I can find out about those - Perl does have good documentation,
right?"
Frink and Woomert both laughed, and Frink fielded the question.
- "The best. In fact, it all comes with Perl - and is augmented with
every module you install. It's all available via the ``perldoc''
program; start by reading ``perldoc perldoc'', and you'll never find
yourself at a loss for information about Perl."
Somewhat later, after the very grateful Willard had headed for home and
(finally) a night of sleep, Frink and Woomert were relaxing with a rare recording
of Burundi Ubuhuba nose-singing that was accompanied by a thumb-piano
and zither. As usual, the food accompanying the music was tasty and highly
appropriate: dinner consisted of curried ingelegde vis (a spicy fish
recipe that Woomert had learned at Cape Malay) and futari (squash
and yams) on the side, with East African samosa bread and spicy piri-piri
sauce for the adventurous. Pickled African peaches wrapped up the menu.
Suddenly, there was a loud jangling noise from the outside, followed by
cursing that would blister cheap paint (Woomert had providentially done the
house and the out-buildings in a top-grade epoxy, so they weren't affected),
and by police sirens shortly thereafter.
- "Ah." Woomert casually leaned back in his chair, nibbling on one
last tasty peach. "That would be the Zigamorphs. Back to prison they go for
violating their probation; they had been explicitly told to stay out of my
neighborhood."
- "What... happened, Woomert? It sounded pretty bad."
- "I knew they'd come calling soon, and had set a trap for them. Just
a very basic numerical complement program which would throw a steel-cage exception
when it detected a null [2]. One of these days, Frink, the
criminals will become intelligent - mark my words, it's a simple matter
of selection pressure. Until then, we can all sleep safe in our beds..."
[1] Larry Wall, the creator of Perl, has suggested "Pathologically Eclectic
Rubbish Lister" for those who simply can't stand to have Perl not be
an acronym. "Practical Extraction and Report Language" has also been
suggested for those who have to sell the idea of using it to management,
which is usually well-known for its complete lack of a sense of
humor.
[2] A zigamorph, according to the Jargon File, is a hex 'FF' character
(11111111). A numerical complement of this would, of course,
be all zeros - a null.