‘xargs’ – Handling Filenames With Spaces or Other Special Characters

xargs is a great little utility to perform batch operations on a large set of files.
Typically, the results of a find operation are piped to the xargs command:

   find . -iname "*.pdf" | xargs -I{} mv {} ~/collections/pdf/

The -I{} tells xargs to substitute ‘{}’ in the statement to be executed with the entries being piped through.
If these entries have spaces or other special characters, though, things will go awry.
For example, filenames with spaces in them passed to xargs will result in xargs barfing with a “xargs: unterminated quote” error on OS X.

The solution is use null-terminated strings in both the find and xargs invocation:

   find . -iname "*.pdf" -print0 | xargs -0 -I{} mv {} ~/collections/pdf/

Note the -print0 argument to find, and the corresponding -0 argument to xargs: the former tells find to produce null-terminated entries while the latter tells xargs to expect and consume null-terminated entries.

Useful diff Aliases

Add the following aliases to your ‘~/.bashrc‘ for some diff goodness:

alias diff-side-by-side='diff --side-by-side -W"`tput cols`"'
alias diff-side-by-side-changes='diff --side-by-side --suppress-common-lines -W"`tput cols`"'


p>You can, of course, use shorter alias names in good old UNIX tradition, e.g. ‘ssdiff’ and ‘sscdiff’. You might be wondering why (a) I did not do so, and (b) what is the point, conversely, of having aliases that are almost as long as the commands that they are aliasing. The answer to the first is ‘memory’, and the second is ‘autocomplete’.



Shorter aliases resulted in me constantly forgetting what they were mapped to (I rarely work outside a Git repository, and thus rarely use external diff, relying/needing Git’s diff 99% of the time), and it was easier for me to Google the options than to open up my huge ‘~/.bashrc’ to look up my personal alias. And the being forced not only to look up the options but then type out all those awkward characters again and again meant that I rarely ended up using these neat diff options. However, now, with these aliases, I just type ‘diff’ and then hit ‘TAB’, and let autocompletion show me and finish off the rest the commands for me.

Supplementary Command-History Logging in Bash: Tracking Working Directory, Dates, Times, etc.


Here is a way to create a secondary shell history log (i.e., one that supplements the primary “~/.bash_history“) that tracks a range of other information, such as the working directory, hostname, time and date etc. Using the “HISTTIMEFORMAT” variable, it is in fact possible to store the time and date with the primary history, but the storing of the other information is not as readibly do-able. Here, I present an approach based on this excellent post on StackOverflow.

The main differences between this approach and the original is:

  • I remove the option to log the extra information to the primary history file: I prefer to keep this history clean.
  • I add history number, host name, time/date stamp etc. to the supplementary history log by default.
  • I add field separators, making it easy to apply ‘awk‘ commands.

The (Supplementary) History Logger Function

First, add or source the following to your “~/.bashrc“:


Activating the Logger

Then you need to set this function to execute on every command by adding it to your “$PROMPT_COMMAND” variable, so you need the following entry in your “~/.bashrc“:

    export PROMPT_COMMAND='_loghistory'

There are a number of options that the logging function takes, including the adding terminal information, the adding of arbitrary text or the execution of a function or function(s) that generate appropriate text. See the function documentation for more info.

Add Some Useful Aliases

Add the following to your “~/.bashrc“:

# dump regular history log
alias h='history'
# dump enhanced history log
alias hh="cat $HOME/.bash_log"
# dump history of directories visited
alias histdirs="cat $HOME/.bash_log | awk -F ' ~~~ ' '{print $2}' | uniq"

Checkout the Results! The ‘histdirs‘ command is very useful to quickly list, select (via copy and pasting) and jumping back to a directory.

$ h
14095  [2011-11-23 15:36:20] ~~~ jd nuim
14096  [2011-11-23 15:36:21] ~~~ ll
14097  [2011-11-23 15:36:23] ~~~ git status
14098  [2011-11-23 15:36:33] ~~~ jd pytb
14099  [2011-11-23 15:36:36] ~~~ git status
14100  [2011-11-23 15:36:53] ~~~ git rm --cached config/*
14101  [2011-11-23 15:37:00] ~~~ git pull
14102  [2011-11-23 15:37:11] ~~~ e .gitignore
14103  [2011-11-23 15:37:28] ~~~ git status
14104  [2011-11-23 15:37:35] ~~~ e .gitignore
14105  [2011-11-23 15:37:44] ~~~ git status
14106  [2011-11-23 15:38:10] ~~~ git commit -a -m "stuff"
14107  [2011-11-23 15:38:12] ~~~ git pushall
14108  [2011-11-23 15:50:38] ~~~ ll build_c/
14109  [2011-11-23 15:53:16] ~~~ cd
14110  [2011-11-23 15:53:18] ~~~ ls -l
14111  [2011-11-23 16:00:12] ~~~ cd Documents/Projects/Phyloinformatics/DendroPy/dendropy
14112  [2011-11-23 16:00:15] ~~~ ls -l
14113  [2011-11-23 16:00:22] ~~~ cd dendropy/
14114  [2011-11-23 16:00:24] ~~~ vim *.py

$ hh
[2011-11-23 15:36:20] ~~~ /Users/jeet ~~~ jd nuim
[2011-11-23 15:36:21] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ ll
[2011-11-23 15:36:23] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ git status
[2011-11-23 15:36:33] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ jd pytb
[2011-11-23 15:36:36] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:36:53] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git rm --cached config/*
[2011-11-23 15:37:00] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git pull
[2011-11-23 15:37:11] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ e .gitignore
[2011-11-23 15:37:28] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:37:35] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ e .gitignore
[2011-11-23 15:37:44] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:38:10] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git commit -a -m "stuff"
[2011-11-23 15:38:12] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git pushall
[2011-11-23 15:50:38] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ ll build_c/
[2011-11-23 15:53:16] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ cd
[2011-11-23 15:53:18] ~~~ /Users/jeet ~~~ ls -l
[2011-11-23 16:00:12] ~~~ /Users/jeet ~~~ cd Documents/Projects/Phyloinformatics/DendroPy/dendropy
[2011-11-23 16:00:15] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy ~~~ ls -l
[2011-11-23 16:00:22] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy ~~~ cd dendropy/
[2011-11-23 16:00:24] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy ~~~ vim *.py

$ histdirs

Further Reading

Stripping Paths from Files in TAR Archives

There is no way to get tar to ignore directory paths of files that it is archiving. So, for example, if you have a large number of files scattered about in subdirectories, there is no way to tell tar to archive all the files while ignoring their subdirectories, such that when unpacking the archive you extract all the files to the same location. You can, however, tell tar to strip a fixed number of elements from the full (relative) path to the file when extracting using the “--strip-components” option. For example:

tar --strip-components=2 -xvf archive.tar.gz

This will strip the first two elements of the paths of all the archived files. To get an idea of what this will look like before extracting, you can use the “-t” (“tabulate”, or list) in conjunction with the “--show-transformed option:

tar --strip-components=2 -t --show-transformed -f archive.tar.gz

The “--strip-components” approach only works if all the files that you are extracting are the same relative depth. Files that are “shallower” will not be extracted, while files that are deeper will still be extracted to sub-directories. The only clean solution to this that I can think of would be to extract all the files to a temporary location and then move all the files to single directory:

mkdir /tmp/work
cd /tmp/work
tar -xvzf /path/to/archive.tar.gz
mkdir collected
find . -type f -exec mv {} collected/ \;

Piping Output Over a Secure Shell (SSH) Connection

We all know about using scp to transfer files over a secure shell connection.
It works fine, but there are many cases where alternate modalities of usage are required, for example, when dealing when you want to transfer the output of one program directly to be stored on a remote machine.
Here are some ways of going about doing this.

Let "$PROG" be a program that writes data to the standard output stream.

  • Transfering without compression:
    $PROG | ssh destination.ip.address 'cat > ~/file.txt'
  • Using gzip for compression:
    $PROG | gzip -f | ssh destination.ip.address 'gunzip > ~/file.txt'
  • Better compression can usually be achieved by bzip2
    $PROG | bzip2  | ssh destination.ip.address 'bunzip2 > ~/Scratch/file.txt'

I find this useful enough to source the following function into all my shells:

## xof #######################################################################
# Pipe from standard input to remote file.
xof() {
    if [[ -z $1 || -z $2 ]]
        echo usage: ' | xof  '
        echo Pipe standard input to remote file.
        bzip2 | ssh $1 'bunzip2 > '$2

And when I want to get fancy, I pipe the output directly to my favorite text editor, BBEdit:

## xobb #######################################################################
# Pipe from standard input to bbedit.
xobb() {
    if [[ -z $1 ]]
        echo usage: ' | xobb '
        echo Pipe standard input to BBEdit.
        bzip2 | ssh $1 'bunzip2 | bbedit'

On a tangential note, if you have a large number of files that you want to transfer, the following is more efficient than separately tar-ing and scp-ing:

tar cf - *.t | ssh destination.ip.address "tar xf - -C /home/jeet/projects/bbi2

Neat Bash Trick: Open Last Command for Editing in the Default Editor and then Execute on Saving/Exiting

This is pretty slick: enter “fc” in the shell and your last command opens up for editing in your default editor (as given by “$EDITOR“). Works perfectly with vi. The”$EDITOR” variable approach does not seem to work with BBEdit though, and you have to:

$ fc -e '/usr/bin/bbedit --wait'

With vi, “:cq” aborts execution of the command.

`gcd` – A Git-aware `cd` Relative to the Repository Root with Auto-Completion

The following will enable you to have a Git-aware "cd" command with directory path expansion/auto-completion relative to the repository root. You will have to source it into your "~/.bashrc" file, after which invoking "gcd" from the shell will allow you specify directory paths relative to the root of your Git repository no matter where you are within the working tree.

gcd() {
    if [[ $(which git 2> /dev/null) ]]
        STATUS=$(git status 2>/dev/null)
        if [[ -z $STATUS ]]
        TARGET="./$(git rev-parse --show-cdup)$1"
        #echo $TARGET
        cd $TARGET
    if [[ $(which git 2> /dev/null) ]]
        STATUS=$(git status 2>/dev/null)
        if [[ -z $STATUS ]]
        TARGET="./$(git rev-parse --show-cdup)"
        if [[ -d $TARGET ]]
        dirnames=$(cd $TARGET; compgen -o dirnames $2)
        opts=$(for i in $dirnames; do  if [[ $i != ".git" ]]; then echo $i/; fi; done)
        if [[ ${cur} == * ]] ; then
            COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) )
            return 0
complete -o nospace -F _gcd gcd

Filter for Unique Lines Adjacent or Otherwise While Preserving Original Order

There are two BASH utilities that help you filter input for unique lines: ‘uniq‘ and ‘sort‘:
One gotcha with ‘uniq‘ is that it only filters out duplicate adjacent lines. So if your input looks like:


Then running ‘uniq‘ on it will yield:


But if the input has non-adjacent duplicate lines:


Then the results are:



p>The traditional approach is to sort the input beforehand, i.e. ‘sort | uniq‘, which results in:


Of course, if this is the strategy, then there is no reason not to call just ‘sort -u‘ instead, which does the sorting and filtering in a single step, and results in the same as the above. Furthermore, ‘sort‘ is a considerably more flexible utility, with flags that allow you specify fields and field delimiters.

All well and good … if you do not care about the original order of the lines, or you actually want the lines sorted.

But what happens if you want to filter input for unique lines, but retain the original order (i.e., not sort the lines)? Suprisingly, to me at least, neither ‘uniq‘ nor ‘sort‘ offer any way of doing this, as far as I can tell.

I was about to hack out a Python script to achieve this, when a little bit of googling brought me to this Awk gem:

awk '{ if (!h[$0]) { print $0; h[$0]=1 } }'

Simply piping input to the above awk code will result in all duplicate lines — including non-adjacent duplicate lines — being filtered out, while preserving the original order of the lines.

Effective, efficient, and elegant, while at the same time, perhaps a little arcane. In other words: the classical UNIX way.

Incidentally, while one’s first instinct might be to map the above to an alias, that turned out to be impossible, as the invocation has single quotes, and you cannot escape these inside a single-quoted string (the alias definition has to be quoted with single-quotes because we do not want to expand the ‘$0’ argument during the definition). Instead, we have to define it as a function:

uniqx() {
    awk '{ if (!h[$0]) { print $0; h[$0]=1 } }'

Filesystem Management with the Full Power of Vim

Just discovered vidir, a way to manipulate filenames inside your favorite text editor (better known as Vim).

Previously, I would use complex and cumbersome BASH constructs using “for;do;done”, “sed”, “awk” etc., coupled with the operation itself:

 $ for f in *.txt;  do mv $f $(echo $f | sed -e 's/foo\(\d\+\)_\(.*\)\.txt/bar_\1_blah_\2.txt/'); done

Which almost always involved a “pre-flight” dummy run to make sure the reg-ex’s were correct:

 $ for f in *.txt;  do echo mv $f $(echo $f | sed -e 's/foo\(\d\+\)_\(.*\)\.txt/bar_\1_blah_\2.txt/'); done

This approach works well enough, but apart from being a little cumbersome (especially if there is any fiddling to get the reg-ex’s right), is a little error-prone.

Now, the whole directory gets loaded up into the editor’s buffer (the “editor” being whatever ‘$EDITOR‘ is set to), for you to edit as you wish. When done, saving and exiting results in all the changes being applied to the directory. This allows you to bring the full power of regular expressions, line- and column-based selections etc., to bulk renaming and other operations. Best of all, you get to see the final set of proposed changes in your text editor before committing to the actual operation.

All of this can be rephrased as: ‘fantastic!

To get a dose of this filesystem management awesomeness in your world:

  1. Either clone the vidir repo or download a snapshot release from here.
  2. From inside the vidir directory:
  3. $ perl Makefile.PL
    $ make
    $ sudo make install

Dealing with ‘Argument list too long’ Problems

The solution to this problem is to the “Argument list too long” error when trying to archive a large number of files is the “-T” option of the “tar” command to pass in a list files generated by a “find” command:

  1. Create a list of the files to be archived using the "find" command:
    $ find . -name="*.tre" > filelist.txt
  2. Use the “-T” option of the “tar” command to pass in this list of filenames:
    $ tar cvjf archive.tbz -T filelist.txt

If you want to delete a long list of files, however, this approach will not work, as “rm” does not support the very convenient “-T“/”--files-from” flag or the equivalent (so convenient, in fact, that I have started adding this to virtually every file-processing script or program that I write).

Luckily, however, “finddoes support a “-delete” flag, so to recursively delete all files and directories:

find path/to/dir -delete

You can use a “-type f” argument to limit the operation only to files, and the “-depth 1” argument to limit the operation only to the current directory, so that:

find path/to/dir -type f -depth 1 -delete

will delete files in the specified directory, without touching subdirectories or the files within them.

Note that using “find” in conjunction with the “-delete” flag is probably faster than any other approach, including using the “-exec rm {} \;” argument to “find” or looping over the files in a shell script. However, if you want to get rid of an entire directory and all sub-directories, then simply issuing an “rm -r” is, of course, a better performer.

Add the Following Lines to Your `~/.bashrc` and You Will Be Very Happy

I added the following to my `~/.bashrc` and I am loving it!

## Up Arrow: search and complete from previous history
bind '"\eOA": history-search-backward'
## alternate, if the above does not work for you:
#bind '"\e[A":history-search-backward'

## Down Arrow: search and complete from next history
bind '"\eOB": history-search-forward'
## alternate, if the above does not work for you:
#bind '"\e[B":history-search-forward'

(see the comments below for explanation of the alternate codes)

The first command rebinds the up arrow from “previous-history”, which unconditionally selects the immediately preceding command from your command history, with “history-search-backward”, which selects the previous command in your history that begins with the characters you have already typed in.

The second command does the same, but with the down arrow key, in the opposite direction.

For example, given the following sequence of commands:

So, for example, if you have nothing typed at the prompt, then the up arrow and down arrow keys will work just as before, moving through your history one step up or down respectively.

However, if you type “fi” and then press the up arrow key, the previous command entered that begins with “fi” (e.g., “find …”) will be filled out at the prompt. Press the up arrow key again to select the command beginning with “fi” preceding that, and so on. Down arrow iterates downward over commands beginning with “fi”