Bash for NLP tutorial, advanced topics · John Hewitt
Excerpt
Bash can be used to do complex things faster than you could whip up a Python script to do the same things. However, because of tricky syntax and not altogether intuitive semantics, it tends to push people away when it tries to show love. In other words, it’s frequently misunderstood. As computer scientists, we surely can empathize with bash, and give it a another chance.
Bash can be used to do complex things faster than you could whip up a Python script to do the same things. However, because of tricky syntax and not altogether intuitive semantics, it tends to push people away when it tries to show love. In other words, it’s frequently misunderstood. As computer scientists, we surely can empathize with bash, and give it a another chance.
This tutorial is composed of topical case studies resolving around either solving a specific problem or becoming proficient with a specific tool. It assumes you’ve read through the basic tutorial hosted on this site, and beyond that, have had some time twiddling around on the command line to get a feel for the ropes.
Disclaimer: bash operates on strings
I can’t stress this enough. File paths are strings. output is strings. Command names are strings. There are no types; there is nothing but string.
Process Substitution
Credit for this, my favorite bash tidbit, must be shared with Jonathan May.
Many tools used in bash scripts take a variable number of arguments, each of which must be the location of a file. For example, I use paste
to take a side-by-side look at two similar files:
paste model_output.ug gold_standard.ug
However, frequently the data we’re trying to analyze may be the output of processes. In this case, we have to redirect stdout
to a file for each process, and then paste together the results:
tr ' ' '\n' < model_output.ug | sort | uniq -c | sort -n > model_types.freq
tr ' ' '\n' < gold_standard.ug | sort | uniq -c | sort -n > gold_types.freq
paste model_types.freq gold_types.freq
These commands count the frequency of space-separated words in a file, sort them, output them to a file, and then pastes them side-by-side for human analysis.
Sometimes we want these temporary files (e.g. model_types.freq
, gold_types.freq
); other times we do not.
Process Substitution allows us to treat the output of each command as a file object without actually writing anything to disk. This has obvious I/O benefits, as well as potentially eliminating unwanted temporary files, and allowing for quicker re-execution of similar code. The syntax is as follows:
command_that_takes_files.sh file1.txt <(foo.sh arg1 arg2 ) file3.txt
Here, the stdout of foo.sh
is treated as if we had printed it to a file, and then included that file in the command. Now, let’s re-write the type-frequency command sequence:
paste <(tr ' ' '\n' < model_output.ug | sort | uniq -c | sort -n) \
<(tr ' ' '\n' < gold_standard.ug | sort | uniq -c | sort -n)
More later on how to chain together process substitution commands to make some unnecessarily complex, beautiful bash commands.
for
loops
for
loops in bash iterate over various types of strings. The easiest and most common use is to iterate over the contents of a directory.
Iterating over the contents of a directory
To iterate over the contents of the directory at the current directory , use the following;
for i in $( ls . ); do
echo $i
done
Let’s go through a few subtle aspects of this.
-
First, note that in each iteration of the loop, the variable
i
is assigned the value of some file in the directory at “. To access the value of the variablei
and not the stringi
, we use the dollar sign, thus$i
. -
Second, note the
$()
structure. This (I think?) runs thels
command. You can omit the first space, as in$(ls path )
, but not the second space. In other words,$( ls path)$ is invalid, since you're looking for the path
path)`. -
Third, note that the
for
is matched with a correspondingdo
and a finaldone
. -
Fourth, the semicolon is necessary. However, the command could be inline’d as in
for i in $( ls path ); do echo $i; done
The semicolon after i` must be terminated before the loop is terminated.
-
Fifth, let’s say I’m iterating over the contents of some absolute path:
for i in $( ls /nlp/users/johnhew/goldstandarddata ); do echo $i; done
This will fail! Why? Because the variable i
just stores the path relative to /nlp/users/johnhew/goldstandarddata
. Instead, I should run
goldpath=/nlp/users/johnhew/goldstandarddata
for i in $( ls $goldpath ); do echo $goldpath/$i; done
Note that the value i merely concatenates the two parts of the filepath together with a /
in the middle, since bash only works with strings. This accesses the files where they actually are, not pretending they’re in the current working directory.
Iterating over a sequence or otherwise
What if you want to iterate over something like a sequence of numbers, or a pre-specified set of values? It’s not going to be a problem.
To iterate over a squence of integers, first test the seq
command as follows:
seq start_integer end_integer
seq 1 10
Now recall that bash works on strings, and will be willing to iterate over the string that seq
produces as follows:
for i in `seq 1 10`; do echo $i; done
We’re introduced to new syntax, the backtick (“) notation. This means “execute the command within these bakticks and consider its output as part of the string that bash operates on”. It gets pretty meta. So, what would have happened had we omitted the backticks? the command
for i in seq 1 10; do echo $i; done
Prints out
Which is hilarious, I think, but also the solution to our other question “how do I iterate over a sequence of pre-specified values”.
Tips on for
iteration
-
You’ll frequently want to just iterate over files, or over directories, or over just all
.tsv
files. Modify thels
command to do this for you, as in the last two cases:for i in $( ls -d */ ); do echo "$i is a directory"; done for i in $( ls *.tsv ); do echo "$i is a .tsv"; done
-
You can nest these loops, and life really gets fun then. For example, I use indirected directories when I’m storing over 1 million files. Thus, you could do something that looks like
for i in $( ls $root ); do for j in $( ls $root/$i ); do for k in $( ls $root/$i/$j ); do cat $root/$i/$j/$k; done done done
branching; if
conditionals
Conditionals are very easy if you’d like to check something related to a file system. To check for the existence of a file, the syntax is the following:
if [ -f path_to_file ]; then echo "woo!"; fi
I usually use ifs in the middle of iterating over a directory, for example if you’re looping through directories and you want to check some kind of output if and only if the output file exists for that directory. (You know, because each directory has 1 experiment, and not all of the experiments have finished, but you’re really impatient.)
for dirpath in $( ls path_to_dirs ); do if [ -f $dirpath/results.txt ]; then cat $dirpath/results.txt; fi; done
Note that you have to close the if
s and for
s properly, or bash gives you some well-meaning but useless syntax error.
while
loops
While loops can be of great use in bash. It’s best to use them for very simple purposes; while loops over a file, with complicated actions on each line, might best be done in Python.
Let’s start with the simplest case: a while true
loop for a lazy cron job.
while true; do
echo 'Are you working or just on xkcd?'
sleep 1000
done
Note that the indentation is not necessary; in fact, the whole thing could be done on one line:
while true; do echo 'Are you working or just on xkcd?'; sleep 1000; done
Consider as well the resource I used when writing this section, at tldp.
When might this be useful? Maybe you have a script to update a status page. If you save the following script into update_daemon.sh
while true; do ./update_status.sh; sleep 1000; done
then on your remote server you can have a simple, lazy status updater simply by running:
nohup ./update_daemon.sh &
disown
the nohup
tells the OS not to terminate the process when the disconnection of your ssh
session sends a SIGTERM
which would otherwise cause the updater to die. The disown
command subsequently causes the process to cease to be a “child” process of the bash process through which you ran your command. Together, I’ve found these work well for keeping a process going indefinitely even though I’ve logged off.
A more complicated while
loop will iterate over the contents of a file. This is perhaps more complicated in bash than is worth bothering with; see for example this discussion. In many cases, iterating through the lines of a file and performing some action for each line is something best done in Python. If your file is simple and without special characters, try this first:
while read line; do
echo $line
done < file.txt
which again can be inlined as:
while read line; do echo $line; done < file.txt
What’s going on here? line
is specified as a variable, taking the contents of each line. The whole while loop is considered a process, into which the file file.txt
can be piped via stdin
. This implies you could achieve roughly the same (while raising some eyebrows from purists:)
cat file.txt | while read line; do echo $line; done
Path and file manipulation : sed
, grep
One thing I have to do embarrassingly often is mass-rename a bunch of files. Thankfully, with grep and sed, this is pretty easy and quick!
Let’s say I have a bunch of files, named for example, test-small-E1.yaml, test-small-E2.yaml, test-med-E1.yaml, test-med-E2.yaml, test-large-E1.yaml, test-large-E2.yaml
. These look like they’re experiment config files (because I used .yaml
files for specifying experiment configs!) I’d like to change the experiments I’m running, and I want the filenames to show that. Specificially, I want to change the large
experiments to huge
instead. I would use:
for file in test-large*.yaml; do mv $file `echo $file | sed 's/large/huge/'`; done
What happened here? First, the for loop, for file in test-large*.yaml
uses a wildcard, the *
. Before the command is run, bash
will replace the text test-large*.yaml
with all filenames that match the pattern given, where the wildcard can represent anything. Thus, the command will be resolved to:
for file in test-large-E1.yaml test-large-E2.yaml; do mv $file `echo $file | sed 's/large/huge/'`; done
This behavior is nice because it lets us quickly specify all files of interest while omitting the others. This quick command makes a strong argument for systematically-named configuration files! Next, the mv
command. This is fun; let’s look at it in detail:
mv $file `echo $file | sed 's/large/huge'/`
So we’re moving (renaming) the file from its old place at $file
to a new location. What location? Recall that the backtics ` ` mean “run the command between the backtics and replace the text between the backtics with the result of the command. So we echo
the old filepath, and then use sed
, the “s
tream ed
itor” to modify the path. sed
allows replacements through regular expressions, but this a simple replacement. It finds the first instance of the string large
and replaces it with huge
.
So, once the backtics have done their work, the move command looks more like
mv test-large-E1.yaml test-huge-E1.yaml
and correspondingly for the other file. What fun! Imagine if you had to rename 100 such files. Using this little script, it takes no longer to do.
In-place file substitution with sed
A related, useful task to the mass file movement one above is the mass changing of a lot of similar files. Imagine we had renamed all of our test-large*.yaml
files to the test-huge*.yaml
naming format, but we also needed to change the contents of the file to reflect that. With sed, it’s (potentially) simple process.
In particular, consider that you have some key-value pair in the text, like the following:
and you want to replace it with experiment_size: huge
. A quick change to the script we wrote above, and this is solved:
for file in test-huge*.yaml; do sed -i 's/experiment_size: large/experiment_size: huge'/g $file ; done
What is this doing? The sed command takes the -i
flag, which means “edit the file in-place”, aka, change the contents of the file without moving it. It finds all instances of the string experiment_size: large
(“all” because of the /g
ending to the command) and replaces them with experiment_size: huge
.
With these simple loops, it’s easy to put all experiment parameters into configuration files which you can then commit to git
repositories, thus making it easier for you and others to keep track of what parameters led to what results!
file manipulation: cut
, paste
, column
file mainpulation: sort
, uniq
arithmetic: the dark arts
It’s a bad idea to use bash
for arithmetic-related things. Teaching about it, I’m reminded of one of history’s not-so-great eductors, Horace Slughorn (spoilers.) However, if you insist, here we go:
The bc
command is a “calculator language” which you can use for (no) fun and profit:
It’s a bad idea:
$ echo `seq 1 10` | sed 's/ / + /g' | bc
$ 55
Don’t use it to increment the version number on a bunch of files, for example, aV2.txt
, bV2.txt
, as:
for i in *; do
old_index=`echo $i | grep -o "[0-9]"`
new_index=`echo "$old_index + 1" | bc`
mv$i `echo $i | sed "s/$old_index/$new_index/g"`
done
where now the files will be named aV3.txt
, bV3.txt
. It’s really not worth it! But there you go.
case study on xargs
: when you have too many files.
case study on efficiency in filesystems
Symbolic links
Symoblic links are great when you want to deal with nice pretty filepaths, but your data is in a shared location / on some mega disk somewhere else. They make it seem like there’s a path, right in your cozy directory of choice, to some aribtary other path. The general syntax is:
ln -s ugly_target_filepath_to_type_once nice_filepath
Note that, to be very clear, ugly_target_filepath_to_type_once
already exists, and you’re creating a “file” at nice_filepath
that will act like the ugly path.
Some caveats: symbolic links aren’t quite the same as having the directory right there. Sometimes the behavior is the same. If you try the following:
ls nice_filepath
ls ugly_target_filepath_to_type_once
you get the same thing! However, if you try the following, attempting to calculate the total number of bytes stored under each filepath,
du -sh nice_filepath
du -sh ugly_target_filepath_to_type_once
the ugly filepath will give you the correct answer, but the nice filepath will give you 0. Instead, you should run
(yes, the trailing forward slash makes all the difference) in order to get the correct answer. Intuitively, this trailing slash forces bash to treat the symbolic link as its directory, not as the vacuous file that it actually is in your directory.
Posted on 07 Mar 2017.