Get the Plumbing Right
With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.
It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.
Writing a Ruby script that fits into that scheme is actually very
simple. The simplest piece of code that copies everything from
STDOUT is just three lines long:
while gets print end
Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.
$ cat file1 file2 | csvt -e 2,0 $ csvt -e 2,0 file1 file2
The simple loop shown in Section 6 is not very useful, because it it
does not do any processing of input. It does illustrate the general
csvt script will use two such loops, one for
--extract and one for
--remove. Both start with
a test of the appropriate flag,
if extract_f == true first_f = true
first_f flag is used to avoid the "off by
one" error inside the
while gets data = $_.chop data = data.split(",") data_n = data.length
Every loop cycle starts with a call to gets, which reads a new line
STDIN and stores it in
$_. Next the script
removes the end of line character and splits the line into an array of
if first_f old_data_n = data_n first_f = false end
The size of the array is stored in
data_n. Then it tests
if the line just read was the first line and sets the number of columns on
the non-existent previous line to the number of columns on the first line
to pass the data integrity check (comparing the number of columns in the
previous and the current line).
if data_n != old_data_n $stderr.print "csvt: the number of fields on the " + "following line does not match the number " + "of fields on the previous line\n" $stderr.print $_ exit(1) end
Should the data integrity test fail, the error message followed by the
offending line will be printed to the system log and the execution of
csvt will stop. It is tempting to relax the rules a little
and introduce an option for skipping such errors, but that's a job for a
separate tool; namely, a specialized data integrity checker, which is
usually written with a particular data set in mind and therefore outside
the scope of the
When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:
line = ""
Next we travel the array of arguments for the
option. As you will notice, there is test check, if the column index is
less than the number of fields in the line we just read. If it is,
csvt will complain, suggest the allowed range of indexes and
exit with code 1.
extract_args.each do |column| if !(column < data_n) $stderr.print "csvt: column index out of range, " + "use numbers between 0 and ", data_n - 1, "\n" exit(1) end
If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.
line += data[column] + "," end
Once all columns listed as arguments of
been processed, we can print the contents of the line variable, less the
last character, which we replace with the end of line character.
print line[0, line.length-1], "\n"
The last thing is setting the
old_data_n variable to the
number of columns in the currently processed line, so the data integrity
check can spot any errors.
old_data_n = data_n end end
So it goes until the end of the file or data stream. When all data is
processed, our script ends with a call to
The code used to process
STDIN when the user chooses the
--remove option is similar to the
handler, with a small twist after the line variable initialization.
if remove_f == true first_f = true while gets data = $_.chop data = data.split(",") data_n = data.length if first_f old_data_n = data_n first_f = false end if data_n != old_data_n $stderr.print "csvt: the number of fields on the following " + "line does not match the number of fields on " + "the previous line\n" $stderr.print $_ exit(1) end line = ""
There is an additional loop that sets the columns whose indexes are
listed as arguments of
--remove to "".
remove_args.each do |column| if !(column < data_n) $stderr.print "csvt: field index out of range, " + "use numbers between 0 and ", data_nf - 1, "\n" exit(1) end data[column] = "" end
The rest of the code is identical to the code in the
data.each do |column| if column == "" next else line += column + "," end end print line[0, line.length-1], "\n" old_data_n = data_n end end
We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.
Your script is working now and you could call it quits, but for greater
convenience in the future, try to make an extra effort and make
csvt executable, so you can type just this:
instead of this:
$ ruby csvt.rb
If you are using Unix, simply add this code on the first line of your script:
The actual path to the
ruby interpreter binary might be
different on your system. The easiest way to find out is to use the
$ locate ruby $ which ruby
If either fails, use
$ find / -name "ruby"
This might take a while because
find is searching the
whole directory tree. Once you know the access path to the
ruby binary, paste it after
#! and save the
script to disk. Remember that you need place these instructions on the
very first line of your script or the shell will not be able to recognize
it as a request to use the Ruby interpreter. If you need to list options
for the interpreter, you can list them, but remember that there is no need
to list the name of the script itself.
csvt to disk, and make it executable with
chmod u+x csvt.
u+x argument tells
chmod to mark
csvt as executable only by the owner of the script (that
would be you ...). Other possibilities include
marks the script as executable by all members of the group that the script
is assigned to (
ls -l reveals the script's group);
o+x, which would make the script executable by all other
users (not a good idea); finally,
a+x would make it
executable by all users (this should be avoided as well).
Note that neither the
#! notation nor
command can be used in the Microsoft Windows environment unless you
install the Cygwin package, which turns Windows into a pretty good Unix
environment look-and-feel-alike. When installing Cygwin is not an option,
you can still use
csvt, but it must be preceded with the
ruby command, as in
ruby csvt -e file instead of
csvt -e file.
The following places should be on the list of favorite destinations for everyone learning and using Ruby:
- Ruby binaries and sources
- Ruby mailing lists
- the Ruby newsgroup
- the Cygwin Unix environment for Microsoft Windows
- the Fink Unix environment for Mac OS X (the latest Ruby builds for Mac OS X)
If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.
Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.
Return to ONLamp.com.