Estimate Time for Job Completion (With Progress Updates) When Tar’ing Huge Directories

For the sake of future me, I am recording this here, the coolest shell trick I’ve learned this year:


tar cf - /folder-with-big-files -P | pv -s $(du -sb /folder-with-big-files | awk '{print $1}') | gzip > big-files.tar.gz


tar cf - /folder-with-big-files -P | pv -s $(($(du -sk /folder-with-big-files | awk '{print $1}') * 1024)) | gzip > big-files.tar.gz

with output looking like:

4.69GB 0:04:50 [16.3MB/s] [==========================>        ] 78% ETA 0:01:21

Requires ‘pv’:

Reproduced from this Superuser answer here:

The Traveler’s Restaurant Process — A Better Description of the Dirichlet Process for Partitioning Sets


I. “Have Any of These People Ever Been to a Chinese Restaurant?”

The Dirichlet process is a stochastic process that can be used to partition a set of elements into a set of subsets. In biological modeling, it is commonly used to assign elements into groups, such as molecular sequence sites into distinct rate categories. Very often, an intuitive explanation as to how it works invokes the “Chinese Restaurant Process” analogy. I have always found this analogy very jarring and confusing, as, while it is a good description of how the Dirichlet process works, it is a terrible description of how Chinese restaurants, or, indeed, any other type of restaurant run by (and patronized by) humans[1], works. As Dr. Emily Jane McTavish says, “Have any of these people ever been to a Chinese restaurant?” Indeed. In fact, one may wonder if any of them have been to any restaurant outside of a Kafkaesque performance-art themed one.

II. The “Traveler’s Restaurant Process” Analogy

I believe a far more intuitive analogy might be given by the “Traveler’s Restaurant [Selection] Process”. A common dictum in traveler lore, guides, advice, and general wisdom is that when picking a restaurant for a meal to prefer establishments that appear to be more popular or frequented with locals. And so, we can imagine instead, describing how a group of N travellers distribute themselves among the various restaurants at a food court. As with the original analogy, we have the travelers entering the establishment one-by-one, with the food court initially empty. (Yes, this is still a weakness in the analogy that lends a surreal contrivance to the whole tale, but maybe this can be alleviated somewhat by imagining that it is (a) 1 am in the morning; (b) the travelers have just arrived and are suffering from jet lag; and (c) wander down to the food court after checking in their respective hotels/motels/pensions, and variances in processes times lead them to come in straggling individually. As for the food court open at 1 am — in a number of parts of the world, this is pretty commonplace, I assure you!). The first traveler finds an empty food court, and picks a restaurant at random. The next traveler enters the food court, and looking around, sees just one restaurant occupied. It is possible that she makes a beeline for that restaurant, but maybe she is feeling cranky or does not particular like the other traveler (or other travelers in general), and heads off to a different restaurant. And so on with the next traveler, and the next, and the next, until the last one, each of whom makes an individual decision whether to try a new, empty restaurant (if they are feeling particularly adventurous or misanthrophic or introverted or have squabbled with or are otherwise disdainful of everyone else) or else, at the other extreme, head for the most crowded restaurant (if they are sticking to the true and tried traveler’s restaurant selection dictum or are feeling particular sociable), or everything in between. At the end of it all we take the clusters of travelers as they have distributed themselves into individual restaurants as the partition of the original the set of travelers.

Makes (better) sense?

Maybe not!

III. Working Through the Process

Emily also says that many times just walking through the model expressions and formulas often yields much better intuition than contrived and twisted analogies and metaphors. And maybe she is right!

So, let us take a set of $N$ elements: $t_1, t_2, …, t_N$. We can imagine each element to be a traveler from the “Traveler’s Restaurant Process” analogy, or vice versa, or anything else you prefer. Each element is assigned to a group in turn, creating new groups as needed. With no existing groups, the first element is automatically assigned to a new group. Well, technically, the first element is assigned to new group with probability given by:

\frac{\alpha}{\alpha + n – 1},

where $\alpha$ is known as the concentration parameter, and $n$ is is the 1-based index of the element (i.e., the first element has $n=1$, the second has $n=2$, the third has $n=3$, and so on). The first element has $n=1$, and the above expression evaluates to 1.0, so is always assigned a new group. The more general case, applicable to all subsequent elements up to and including the last one, is that the $n^{\text{th}}$ element is:

  • Either assigned a new group, $g_*$ with probability according to expression (1) above,
  • Or assigned to one of the existing groups, $g_i$, with probability:
    \frac{|g_i|}{\alpha + n – 1},

where $\alpha$ is the concentration parameter and $n$ is is the 1-based index of the element, as previously described, and $|g_i|$ is the number of elements in group $g_i$.

IV. How/Why Does it Work?

How do we know that each element will definitely be assigned to a group? We can check that the probabilities of either being assigned to a new group or existing group sum to one to reassure ourselves of this. Let us say that we are dealing with $n^{\text{th}}$ element out of $N$ elements, where $1 \leq n \leq N$. As described above, the probability that this element will be assigned to a new group is given by expression (1). On the other hand, the probability of being assigned to any one of the existing groups is given by:

\sum_{i=1}^{M}\frac{|g_i|}{\alpha + n – 1}

where $M$ is the number of existing groups, $g_1, g_2, …, g_M$. Now, while we do not know in this general case exactly how the previous $n-1$ elements are distributed amongst the various $M$ existing groups, we do know that, regardless, there must be a total of $n-1$ elements across all groups. So, then, the above expression reduces to:

\frac{\sum_{i=1}^{M} |g_i|}{\alpha + n – 1} = \frac{n-1}{\alpha + n – 1}.

So the probability of either being assigned to a new group or being assigned to an existing group is given by the sum of Expression (1) and (4), which is:

\frac{\alpha}{\alpha + n – 1} + \frac{n-1}{\alpha + n – 1} &= \frac{\alpha + n – 1}{\alpha + n – 1} \\
&= 1.

And thus, we are assured of all elements being assigned to a group, whether a new one or an existing one, as well as being able to sleep at night knowing that the distribution of partitions under the Dirichlet process is a proper probability as it sums to 1.0.

V. The (Anti-?)Concentration Parameter

We have mentioned and used the concentration parameter, $\alpha$, above, without actually explaining it. Basically, the concentration parameter, as its name implies, determines how “clumpy” the process is. Unfortunately, contrary to what its name implies, at low values the process is more “clumpy” — i.e., yielding partitions where elements tend to be grouped together more frequently, resulting in larger and correspondingly fewer subsets — while at high values the process is less “clumpy” — yielding partitions where elements tend to be more dispersed, resulting in smaller and correspondingly more subsets. Yeah, it really should have been called the “anti-concentration” or “dispersion” parameter. Or perhaps folks should just stick to using the more neutral and less evocative, but non-misleading, standard term: the “scaling” parameter.

(See updates here and here that clear up this issue! Basically, the “concentration” parameter gets its name not due to way it concentrates elements, as naive and incorrect intuition led me to think, but rather due to how increasing it leads to the distribution of values across the subsets converging on the base distribution! What values? What base distribution? EXACTLY! Those concepts do not really enter into any of the analogies we have discussed so far, or, indeed, even in the way we have explained the process with reference to the equations and model. Only when considering a version of the DP process where our elements are not just anonymous exchangeable atoms, but value-bearing elements that we want to cluster based on values, does the base distribution enter the picture, and only then does the “concentration” parameter do its concentrating the higher it gets!)

VI. In Action

This Gist wraps up all this logic in a script that you can play with to get a feel for how different concentration parameters work.

With low values of the concentration parameter, we pretty much get all the elements packed into a single set:

# python -n100 -v0 -a 0.01
Mean number of subsets per partition: 1.04
  Mean number of elements per subset: 9.8

On the other hand, with high values of the scaling parameter, we the trend is toward each element being in its own set, with nearly as many subsets in the partition as there are elements in the full set:

# python -n100 -v0 -a 100
Mean number of subsets per partition: 9.65
  Mean number of elements per subset: 1.03944444444

And, of course, moderate values result in something in between:

# python -n100 -v0 -a 1
Mean number of subsets per partition: 3.03
  Mean number of elements per subset: 3.92166666667

# python -n100 -v0 -a 5
Mean number of subsets per partition: 5.76
  Mean number of elements per subset: 1.88055555556

# python -n100 -v0 -a 10
Mean number of subsets per partition: 7.01
  Mean number of elements per subset: 1.49297619048

VII. The Script

The script to run this is shown below. If you are interested in downloading and using it, please visit:


  1. Interestingly enough, I can imagine a restaurant that was patronized by cats working pretty much exactly like the traditional analogy, given the way some cats tend to be clumpers and others loners. So, if you don’t like the “Traveler’s Restaurant Process” analogy, please feel free to use the “Cat Restaurant Process” analogy, or, better yet, the “Cat Room/Furniture Occupation Process” analogy.
  2. Update 2017-07-25:
    So it seems that the concentration parameter gets its name not from the fact that it (inversely) controls the concentration or clustering or elements, but because of its relationship to the base distribution of the Dirichlet process. The base distribution is something that we have not talked about, and I will cover it in a future post once I understand it well enough to talk about it in a useful and interesting way! But for the time being, it is sufficient to say that it is the distribution over the elements themselves before they are assigned into groups. The concentration parameter is so called because, as it increases in value, it increasingly “concentrates” the values of elements on the base distribution (while at the same time increasingly disperses the elements to distinct groups). So the reason for the counter-intuitive name for the parameter is wrong intuition — it is not referring to how “concentrated” the elements are in terms of the subsets of partition, but how closely the distribution of elements resemble the base distribution. More details on this can be found here, where the concentration parameter is described as a prior belief in the base distribution.
  3. Update 2017-07-26:
    So, here is a GREAT explanation of what is going on: Summarizing: basically, if we understand the Dirichlet process using the Chinese Restaurant Process or the Traveler’s Restaurant Process or the Hair-Clog Process etc. etc., we are ignoring the base distribution as we do not care about the values of the elements that are clustered, or the distribution of those values within each subset. Only with reference to, e.g. the Polya Urn Model, where each element has associated with it a value which, and is clustered to a group based on this values to the distribution of values associated with each group, while parameters of the distribution of each “urn” are drawn from a base distribution, do things make sense. As the concentration parameter increases, the elements spread out across more and more subsets/urns/tables, and as the parameters of the distribution of each subset/urn/table’s “value” are sampled from the base distribution, in effect, the base distribution itself gets more and more (and better and better) sampled: as each subset/urn/table represents an independent sampling of the base distribution. Thus the distribution across all subset/urns/tables converges on the base distribution. Conversely, as the concentration parameter decreases, the elements cluster into fewer groups, and, with fewer groups we get more limited sampling of the the base distribution.

“Joy Plots” — Great Plot Style for Visualizing Distributions on Discrete/Categorical or Multiple Continuous Variables

R doing what R does really, really, really, really, really, really, *R*eally well: visualization. Folks, this might be THE plot to use to visualize distributions of discrete/categorical variables or simultaneous distributions of multiple continuous variables, replacing or at least taking up a seat alongside the violin plots as the current best approach IMHO.

(EDIT: This plot style is named after the “Joy Division”, due to a similar graphic on one of their album covers. Not being at all familiar with Joy Division or their work, the name that comes to mind when I see this plot is “Misty Mountain Plot”, after the maps in “The Hobbit”)

Solving the “Could not find all biber source files” Error

Biblatex is a fantastic bibliography/citation manager for LaTeX. It trumps the older bibtex for its much easier customizability and configuration. It does however, have one bug that can be very perplexing to figure out due to the misleading error message that results: “Could not find all biber source files“. At first glance this message seemed straightforward enough to send me poking about the project file structure and build system, checking paths and names. When all that seemed intact, I started trying building the document from different locations. Then I checked out older version-controlled revisions of the project that I was sure I had built successfully, and when these, too, failed, I started to look at my TeX installation. And so on and so on, and before I knew it … poof! there went most of my morning.

This was all a wild goose chase, though, and luckily I came across this discussion before I got too far. (Well, at further too far, at any rate.) Turns out that this is a known bug with the Biblatex engine, “biber”. The fix is to clear the “biber” cache. You can locate the “biber” cache by running:

$ biber --cache

and then you can “rm -rf” it with extreme glee or just do it all in one step with:

$ rm -rf $(biber --cache)

Vim: Insert Mode is Like the Passing Lane

Insert mode is not the mode for editing text.

It is a mode for editing text, because both normal and insert modes are modes for editing text.

Insert mode, however, is the mode for inserting new/raw text directly from the keyboard (as opposed to, e.g., from a register or a file).

Thus, you will only be in insert mode when you are actually typing in inserting (raw) text directly. For almost every other editing operation, normal mode is where you will be. Once you grok this you will realize that, the bulk of most editing sessions is not insert mode, and you actually spend most of your time in normal mode, just dipping into insert mode to add text and then dipping out again.

Insert mode is thus like the passing lane on the highway. Just like you are should only be in the passing lane when you are actually passing other vehicles, you should only be in insert mode when you are inserting text.

Some snapshots from my own learning experiences here and here.

From Acolyte to Adept: The Next Step After NOP-ing Arrow Keys in Vim

René Descartes's illustration of dualism. Inputs are passed on by the sensory organs to the epiphysis in the brain and from there to the immaterial spirit. Public domain image. Sourced from: Wikimedia Commons

We all know about no-op’ing arrow keys in Vim to get us to break the habit of relying on them for inefficient movement.
But, as this post points out, it is not the location of the arrow keys that makes them inefficient, but the modality of the movement: single steps in insert mode is a horrible way to move around when normal mode provides so much better functionality.

But here is the thing: while normal mode provides for much better and more efficient ways to move around than insert mode, it also provides for ways to move that are just as inefficient as arrow keys. In fact, there is nothing that makes, e.g. “j” significantly better than “<DOWN>“, and so if we replace <DOWN><DOWN><DOWN><DOWN> with jjjj, we are just slapping on some fresh paint on a rusty bike and calling it “faster”. We have not even replaced one bad habit with another, we are indulging in the same bad habit, albeit with a different “it”. A bad habit that is not only inefficient, but, perhaps a much worse sin in the Vim-world, inelegant.

This customization will help break you of that habit by forcing you to enter a count for each of the basic moves (“h“, “j“, “k“, “l“, “gj“, “gk“).
This itself will make you more efficient for any move of three repeats or more: “3j” is more efficient than the uncounted “jjj” equivalent.
But, in addition, it will also have the side-effect of making you come up with more efficient moves yourself: as your eyes focus on the point where you want to go, instead of counting off the lines (or reading off the line count if you have “:set relativenumber“), you might find it more natural to, e.g. “/<pattern>” .

In fact, you might find that in many cases, you do not even need to actually move as such. For example, instead of moving to a line 8 lines down and deleting it, “8jdd“, you could just “:+8d“. Or instead of moving to a line four lines up, yanking it, moving back to where you were and pasting it, “4kyy4jp“, you can just “:-4y” and “p“. Once you get good enough at it, it will seem like magic the way you can manipulate lines far from your current position without moving! And what you will find is that, beyond the increased efficiency in number of keystrokes, there is an increase in mental efficiency: the microseconds of visually re-orienting yourself after each move is no longer a cost that you have to pay over and over and over again.

Naturally, you are going to find things less efficient and less elegant at first.
But that is just the pain that of stressing out mental muscles that have not been exercised enough, like that first leg day after the holidays (or maybe even the first leg day ever after signing up at the gym 4 years ago).

Eventually, your efficiency will increase.

But more than efficiency, the elegance of your moves will eventually increase as well. Dramatically. As far as editing text goes, at any rate.

So, stick the following into your “~/.vimrc“, and be prepared for some pain and frustration and swearing and clumsiness as you retrain your muscle memory and your mind, before gaining a new level of enlightened “acting-without-doing”.

NOTE: One of the greatest impediments to me naturally working with counted-movements was the fact that counting the number of lines to go in each direction is disruptive: it completely breaks my “flow”, jarringly derailing my train of thought. See below for the solution to this, the implementation of which I consider a mandatory pre-requisite to working this way.


Displaying Relative Numbers vs. Absolute Numbers

I find the need to count line offsets before every move or operation as conducive to my “flow” as having an air horn stuffed down my throat while frozen mayonnaise is blasted into my ears. This was why I resisted count-based ergonomics in Vim for so long.
Vim has a feature, “`:set relativenumber`” that shows relative numbers, and this makes things tremendously better, in that you can simply read of the line offset to your navigation target …. except that you must choose to show either relative numbers or absolute numbers. The fact is, the only time relative numbers are useful is for motions/operations in the current window or split, but when you have other splits open, relative numbers are worse than useless, as you need to have absolute numbers to make sense of what part of the buffer you are seeing in the non-focal splits. Showing both absolute and relative numbers at the same time would be ideal, but Vim does not support that natively (there is a plugin to help with that, but it exploits the “sign” feature, which can be a problem if you use signs to display something else, like marks, as I do). So the dilemma is that in most cases you want absolute numbers, but count-based motions/operations in the current window are annoying and mentally-disruptive if you do not have relative numbers showing to avoid you breaking your train of thought to count the lines to the target.

Luckily, a Vim plugin provides the answer: vim-numbers. This plugin automatically sets relative numbers on for the split/window in focus, and restores the previous numbering (absolute in my case) when focus moves to another split or window. It was this that made my move to strict count-required based motion possible.

EDIT: It was pointed out by /u/VanLaser that the following in your “~/.vimrc” is sufficient to achieve the relative-numbering-in-focal-window-and-absolute-everywhere-else dynamics without the need for a plugin:

set number
if has('autocmd')
augroup vimrc_linenumbering
    autocmd WinLeave *
                \ if &number |
                \   set norelativenumber |
                \ endif
    autocmd BufWinEnter *
                \ if &number |
                \   set relativenumber |
                \ endif
    autocmd VimEnter *
                \ if &number |
                \   set relativenumber |
                \ endif
augroup END

Setting up a Python Scientific Environment (NumPy, SciPy, pandas, StatsModels, etc.) in OS X 10.9 Mavericks

It is better than a nightmare from which you cannot wake up …

  1. Install Homebrew:

    $ ruby -e "$(curl -fsSL"
  2. Find problems and fix them (typically resulting from Homebrew becoming very stroppy if it does not have exclusive access to “/usr/local“):

    $ brew doctor

    VERY IMPORTANT NOTE: Make ABSOLUTELY sure that the DYLD_LIBRARY_PATH environmental variable is NOT set. Having this set will cause all sorts of problems as conflicts arise between libraries that homebrew installs that some of your other packages need (e.g., "libpng" and MacVim, respectively). Worse, some builds get confused and fail (e.g., “matplotlib“). What happens if you need DYLD_LIBRARY_PATH for other applications? Please direct your questions to the homebrew folks.

  3. Install Homebrew’s Pythons:

    $ brew install python
    $ brew install python3
  4. Either create and source a virtual environment using Homebrew’s Python:

    $ virtualenv -p /usr/local/python3 homebrew-stats-computing
    $ . homebrew-stats-computing/bin/activate

    Or make sure Homebrew’s Python is at the head of the “$PATH“:

    $ export PATH=/usr/local/bin:$PATH
  5. Install The NumPy + SciPy + pandas stack:

    $ brew install gfortran
    $ pip3 install numpy
    $ pip3 install scipy
    $ pip3 install pandas
  6. Install StatsModels:

    $ pip3 install cython
    $ pip3 install patsy
    $ cd /tmp
    $ git clone
    $ cd statsmodels
    $ python3 build
    $ python3 install
  7. Install matplotlib:

    $ brew install libpng freetype pkgconfig
    $ cd /tmp
    $ git clone
    $ cd matplotlib
    $ python3 install

Taking it to a 11: Dramatically Speeding Up Keyboard/Typing Responsiveness in OSX

If you use a Mac/OSX, then enter the following commands in your shell and reboot:

    $ defaults write -g KeyRepeat -int 0
    $ defaults write -g InitialKeyRepeat -int 15

If you live in a text editor or the shell, or otherwise spend most of your typing hammering away at the keyboard like I do,
then this makes an absolutely wonderful difference in the responsiveness of any typing activity. It will make your previous typing feel like you were pecking away in slow motion at the bottom of a pit of cold tar!

Yes, I know you can set the keyboard repeat rate in `System Preferences`.

I, too, did that a long time ago.

But this trick takes the speed to a 11

Dynamic On-Demand LaTeX Compilation

Most of the existing approaches to integrating LaTeX compilation into a LaTeX writing workflow centered around a text editor (as opposed to a fancy-schmancy IDE) are horrendously bloated creatures, aggressively and voraciously hijacking so many key-mappings and normal functionality that it makes your Vim feel like it is diseased and is experiencing a pathological personality disorder of some kind. Yes, LaTeX-Suite, I am looking at mainly at you.

I did not want a platoon of obnoxiously cheery elves to insert two hundred environments into the document while a marching band parades around the room when I hit the `$` key. I just wanted a way to quickly and easily compile the current document, and optionally open it for viewing. Thus, I wrote a little plugin to do exactly that.

It has worked fine enough all this time. But today, thanks to a comment in a reddit discussion, I discovered something magical: latexmk

Not all that great if you use TexShop or another LaTeX IDE for latexing, but if you use a Plain Old Text Editor, then the following may change your life:

    $ latexmk -pdf -pvc nature-publication-2014.tex

Two neat things going on here.

First, “latexmk”. This is the smart-single-click-do-all latex compiler (and is almost always available with all TeX distributions). Takes care of all those multiple compilation passes with BibTeX and such. So, no matter how complex your document, a single command Gets It Done, from compilation to viewing:

    $ latexmk -pdf nature-publication-2014.tex

Then there are the ‘-pvc‘ flags shown above. This is where things get sexy. This invokes “`latexmk`” on the document which then compiles it as per usual. But then, instead of exiting, it waits there monitoring the file for any changes. Upon detecting changes (like `make`), it recompiles and refreshes the document in the viewer!

So, if you have a shell with this command running in the background (either in a separate window or through, e.g., tmux or screen), you can merrily edit away in your POTE, and have constant refreshes as you save.

Setting up the Text Editor in My Computing Ecosystem


Image from WikiMedia Commons

Basic Setup of Shell to Support My Text Editor Preferences

By “text editor”, I mean Vim, of course. There are pseudo-operating systems that include rudimentary text-editing capabilities (e.g. Emacs), and integrated development environments that allow for editing of text, but there really is only one text editor that deserves the title of “text editor“: Vim, that magical mind-reading mustang that carries out textual mogrifications with surgical precision and zen-like elegance.

However, while Vim is the only true text editor, there is more than one flavor of Vim. There is the original, universal one, i.e., console Vim. And then there are various lines of graphical Vim’s specific to various operating systems. On my primary development box, a Mac, I prefer MacVim over straight-up console Vim, for many, many, many reasons, even though I practically live in the shell otherwise. For working with files on remote machines, I prefer using SSH to login and editing with console Vim, even though I can also edit these files through my own local MacVim instance. My computing ecosystem is shared between all my accounts, and this includes a collection of scripts as well as resources for various applications (e.g., not only my entire ‘~/.vimrc‘, but the entire accompanying ‘~/.vim' directory).

I have the following line in my ‘~/.bashrc‘:

    export EDITOR=' -f'
    export EDITOR=''
alias e=''

This invokes the following script, ‘‘ which is in the ‘$PATH‘ of all my accounts:

#! /bin/bash

if [[ -f /Applications/ ]]
    # bypass mvim for speed
    VIMPATH='/Applications/ -g'
elif [[ -f /usr/local/bin/mvim ]]
    # fall back to mvim
    # fall back to original vim


Easy Opening of Multiple Project Files

I have FuzzySnake installed on my system. My totally unbiased opinion as the author of FuzzySnake is that it is absolutely awesome, and that you should all be using it as well. In addition to the default behavior of opening the selected targets in my ‘$EDITOR‘ (which is, as shown above, ‘‘), I also use the nifty ‘--L/--list-and-exit‘ facility of FuzzySnake to discover files of particular types and pass them to Vim. I could define a set of aliases like ‘alias epy="e $(fz -L --python)“‘, but this approach makes it impossible to pass options to the editor. Instead, I have a set of convenience functions like:

# open all C/C++ and header files
function ecc() {
    e $(fz -L --cpp --exclude-file-patterns="config.h" "$@")
# open all C/C++ and header files, as well as autotools files
function ecca() {
    e $(fz -L --cpp --autotools --exclude-file-patterns="config.h" "$@")
function ecm() {
    e $(fz --cmake "$@")
function eccm() {
    e $(fz -L --cpp --cmake --exclude-file-patterns="config.h" "$@")
# open all Python files
function epy() {
    e $(fz -L --py "$@")

Of course, for on-the-fly use-cases, I usually just do something like:

$ e $(fz -L --rst)

Allowing for Opening of Files in an Existing MacVim Instance

In MacVim’s Preferences, I make sure that ‘Open files from applications‘ is set to ‘in the current window‘. Then I define the following alias in my ‘~/.bashrc‘:

    # alias ee="open \"mvim://open?url=file://$1\""
    alias ee="open -a MacVim"

With this, while typing

$ e foo bar

will open a new Vim session (either MacVim or console vim, depending on whether I am editing a file on my local machine or remotely through SSH), with files ‘foo’ and ‘bar’ loaded into buffers, typing

$ ee foo bar

will open ‘foo’ and ‘bar’ in an existing MacVim session.

Opening Remote Files in My Local (Mac)Vim Session, with Auto-Completion of Paths/Filenames

Sometimes I do prefer to edit remote files in my desktop MacVim (one might wonder why I do not always prefer this …). Lack of path completion when invoking the command has alway been an irritation for me, until I defined the following function:

function edit-remote() {
    if [[ -z $1 ]]
        echo "usage: remote-edit [[user@]host1:]file1 ... [[user@]host2:]file2"
    declare -a targs=()
    for iarg in $@
        targ="scp://$(echo $iarg | sed -e 's@:/@//@' | sed -e 's@:@/@')"
        targs=("${targs[@]}" $targ)
    $EDITOR ${targs[@]}
complete -F _scp -o nospace remote-edit

Smart (`infercase`) Dictionary Completions in Vim While Preserving Your Preferred `ignorecase` Settings

Dictionary completions in Vim can use a ‘infer case’ mode, where, e.g.,
“Probab” will correctly autocomplete to, e.g., “Probability”, even though the
entry in the dictionary might be in a different case. The problem is that this
mode only works if `ignorecase` is on. And sometimes, we want one (`infercase`)
but not the other (`ignorecase`).

The following function, if added to your “`~/.vimrc`”, sets it up so that `ignorecase` is
forced on when dictionary completions are invoked via the `` keys,
and then restored to its original state when exiting.

[gist id=10015995]

A better approach than binding all the exit keys would be an autocommand on leaving the pop-up completion menu, but I could only find a trigger for entering the popupmenu.

Building MacVim Natively on OS X 10.7 and Higher

You might want to do this if you want to install the latest snapshot and no pre-built release is available.

OR you might want MacVim to use a custom Python installation instead of the default one on the system path.

This latter was my motivation.

Once you have downloaded and unpacked the code base that you want to build, step into the `src/` subdirectory:

$ cd src

Before proceeding, make sure that your Python installations have been built with the ““–enable-shared“”! If this is not the case, no matter how you build MacVim, you will not have Python available. Rebuild your Pythons with this flag and reinstall if necessary before proceeding.

Then configure the build with:

$ export LDFLAGS=/usr/platform/lib
$ CC=clang ./configure \
    --enable-perlinterp \
    --enable-pythoninterp \
    --enable-python3interp \
    --enable-rubyinterp \
    --enable-cscope \
    --enable-gui=macvim \
    --with-mac-arch=intel \

Then build:

$ make

The “``” build product will be in the “`MacVim/build/Release`” subdirectory, and can be tested by:

$ open MacVim/build/Release/

Or installed by:

$ cp MacVim/build/Release/ /Applications

Using Python’s “timeit” Module to Benchmark Functions Directly (Instead of Passing in a String to be Executed)

All the basic examples for Python’s timeit module show strings being executed. This lead to, in my opinion, somewhat convoluted code such as:

#! /usr/bin/env python

import timeit

def f():

if __name__ == "__main__":
    timer = timeit.Timer("__main__.f()", "import __main__")
    result = timer.repeat(repeat=100, number=100000)

For some reason, the fact that you can call a function directly is only (again, in my opinion) obscurely documented. But this makes things so much cleaner:

#! /usr/bin/env python

import timeit

def f():

if __name__ == "__main__":
    timer = timeit.Timer(f)
    result = timer.repeat(repeat=100, number=100000)

Much more elegant, right?

One puzzling issue is a Heisenbug-ish issue (i.e., the observation affecting the outcome): the second version consistently and repeatedly results in faster performance timings. I can see differences in overall benchmark script execution time, due to differences in the way overhead resources are allocated/loaded, but I would hope that actual performance timing would be invariant to this. Maybe with “real” code, instead of the dummy execution body, things will be more consistent? Or is this a real issue?

‘xargs’ – Handling Filenames With Spaces or Other Special Characters

xargs is a great little utility to perform batch operations on a large set of files.
Typically, the results of a find operation are piped to the xargs command:

   find . -iname "*.pdf" | xargs -I{} mv {} ~/collections/pdf/

The -I{} tells xargs to substitute ‘{}’ in the statement to be executed with the entries being piped through.
If these entries have spaces or other special characters, though, things will go awry.
For example, filenames with spaces in them passed to xargs will result in xargs barfing with a “xargs: unterminated quote” error on OS X.

The solution is use null-terminated strings in both the find and xargs invocation:

   find . -iname "*.pdf" -print0 | xargs -0 -I{} mv {} ~/collections/pdf/

Note the -print0 argument to find, and the corresponding -0 argument to xargs: the former tells find to produce null-terminated entries while the latter tells xargs to expect and consume null-terminated entries.

YonderGit: Simplified Git Remote Repository Management

One of the great strengths of Git is the multiple and flexible ways of handling remote repositories. Just like Subversion, they can be "served" out of a location, but more generally, if you can reach it from your computer through any number of ways (ssh, etc.), you can git it.

YonderGit wraps up a number of a common operations with remote repositories: creating, initializing, adding to (associating with) the local repository, removing, etc.

You can clone your own copy of the YonderGit code repository using:

git clone git://

Or you can download an archive directly here:

After downloading, enter "sudo python" in the YonderGit directory to install. This will just copy the "" script to your system path. After that, enter " commands?" for a summary of possible commands, or " --help" for help on options.

Quick Summary of Commands

$ setup  

Create directory specified by "REPO-URL", using either the "ssh" or local filesystem transport protocol, initialize it as repository by running "git init", and add it as a remote called "NAME" of the local git repository. Will fail if directory already exists.

$ create 

Create directory specified by "REPO-URL", using either the "ssh" or local filesystem transport protocol, and then initialize it as repository by running "git init". Will fail if directory already exists.

$ init 

Initialize remote directory "REPO-URL" as a repository by running "git init" in the directory. Will fail if directory does not already exist.

$ add  

Add "REPO-URL" as a new remote called "NAME" of the local git repository.

$ delete 

Recursively remove the directory "REPO-URL" and all subdirectories and files.

Valid Repository URL Syntax

Secure Shell Transport Protocol

  • ssh://user@host.xz:port/path/to/repo.git/
  • ssh://user@host.xz/path/to/repo.git/
  • ssh://host.xz:port/path/to/repo.git/
  • ssh://host.xz/path/to/repo.git/
  • ssh://user@host.xz/path/to/repo.git/
  • ssh://host.xz/path/to/repo.git/
  • ssh://user@host.xz/~user/path/to/repo.git/
  • ssh://host.xz/~user/path/to/repo.git/
  • ssh://user@host.xz/~/path/to/repo.git
  • ssh://host.xz/~/path/to/repo.git
  • user@host.xz:/path/to/repo.git/
  • host.xz:/path/to/repo.git/
  • user@host.xz:~user/path/to/repo.git/
  • host.xz:~user/path/to/repo.git/
  • user@host.xz:path/to/repo.git
  • host.xz:path/to/repo.git
  • rsync://host.xz/path/to/repo.git/

Git Transport Protocol

  • git://host.xz/path/to/repo.git/
  • git://host.xz/~user/path/to/repo.git/

HTTP/S Transport Protocol

  • http://host.xz/path/to/repo.git/
  • https://host.xz/path/to/repo.git/

Local (Filesystem) Transport Protocol

  • /path/to/repo.git/
  • path/to/repo.git/
  • ~/path/to/repo.git
  • file:///path/to/repo.git/
  • file://~/path/to/repo.git/

Some Vim Movement Tips

  • Within-line character-based movement:
    • `h` and `l` move you left and right one character, respectively.
    • `fc` or `Fc` will take you forward to the next or back to the previous, respectively, occurrence of character “c` on the current line (e.g., `fp` will jump you forward to the next occurrence of “p” on the line, while `Fp` will jump you back to the previous occurrence of “p” on the line).
    • `tc` or `Tc` will take you forward to just before the next or back to just after the previous, respectively, occurrence of character “c` on the current line.
    • `;` or `,` repeats, forward or backward, respectively, the last `f`/`F`/`t`/`T` command.
    • `0` jumps you back to the beginning of the line while `$` jumps you to the end of the line.
    • `^` jumps you to the beginning of the first non-whitespace character on the current line.
    • Typing in a number and then typing `|` takes you to a column of that number on the current line.
  • Word-based movements:
    • `w` jumps you forward to the next “beginning-of-word”, while `b` jumps you back to the previous “beginning-of-word” (`W` and `B` for the corresponding WORD forms).
    • `e` jumps you forward to the next “end-of-word”, while `ge` jumps you back to the previous “end-of-word” (`E` and `gE` for corresponding WORD forms).
  • Line-based movements:
    • `j` and `k` move you down and up one line, respectively.
    • `+` and `-` move you to the first non-whitespace character on the next or previous line, respectively.
    • `G` jumps you to the last line of the buffer.
    • Typing in a number and then typing `G` takes you to line of that number.
    • `gg` jumps you the the first line of the buffer.
    • `+` moves you to the first non-blank character on the previous line (same effect as `k^`), while `-` moves you the the first non-blank character on the next line (same effect as `j^`).
    • `H` jumps you to the top (mnemonic=”home”) line of the current window.
    • `M` jumps you to the middle line of the current window.
    • `L` jumps you to the last line of the current window.
  • Page-based movements:
    • `<C-U>` moves you up half a page, while `<C-D>` moves you down by a half a page.
    • `<C-F>` moves you forward a full page, while `<C-B>` moves you backward a full page.
    • Go to a brace, bracket, parenthesis, quote, etc. Type `%` to jump to the matching brace, bracket, parenthesis, quote. etc.
  • Search-based movements:
    • With `:set incsearch` on, type `/` and starting typing in a search expression. As you type characters of the expression, you will be taken to the first location forward of the cursor position where that matching term appears in the buffer (use `?` to search backwards instead of forwards). Hit `<ENTER>` to start working from this new position, or `<ESC>` to cancel and return to your original location. To find the next matching expression, hit `<ENTER>` and then `N` to iterate through all matches in the buffer. If `:set wrapscan` is on, then the search will wrap around the buffer. If search highlighting is turned on (`:set hlsearch`), all occurrences of the expression will be highlighted.
    • Position the cursor over any word in an open, and (in normal mode, of course), type `*`. This will jump you to the next occurrence of the word under the buffer. If search highlighting is turned on (`:set hlsearch`), all occurrences of the word will be highlighted.
    • Now type `#`. This time, you will be taken back to the previous occurrence of the word under the cursor.
    • Typing `n` or `N` will jump to the next position in the buffer that matches the last-entered search expression (i.e., either through `/`, `?`, `*`, or `#`).
  • History-based movements:
    • “.` or `’.` will take you back to the last position or line, respectively, where you modified the file.
    • “^` or `’^` will take you back to the last position or line, respectively, where you left insert mode.
    • You can use `<C-O>` and `<C-I>` to take you backward and forward through these and other “jump” positions.
  • If you are editing source code, then:
    • `]m` takes you forward to the next “start of a method” (i.e., the beginning of the next method).
    • `[m` takes you back to the previous “start of a method” (i.e., the beginning of the current method if you are in the middle of one, or the beginning of the previous method if you are “in between” methods).
  • Window adjustment:
    • This is not a movement tip per se, but it is relevant in the sense that it changes the spatial relationship of the cursor with respect to the window: `zb`, `zt`, and `zz` scroll the screen to make the current line at the top, bottom, or middle, respectively.

Safe and const-correct std::map Access in C++ STL

The Standard Template Library std::map[] operator will create and return a new entry if passed a key that does not already exist in the map.
This means that you cannot use this operator when you do not want to create a new entry (i.e., you expect the key-value pair to already exist in the map), or in a const context (i.e., in a const method or when using a const object).
Instead, in these situations, you need to first pull a (const) iterator using std::map.find(), and then check to see if its value equals std::map.end(), and only if not proceed with referencing the result.

This means that instead of, for example:

    double v = split_lengths[s];

You need to:

    std::map::const_iterator it = split_lengths.find(s);
    if (it != split_lengths.end()) {
        // do what you want ... finally!
    } else {
        // raise exception

Yup. This is a pain. Victorian kitchen cooking, indeed.

The new C++0x supplies an method, which throws a std::out_of_range exception if the value is not found.
If I am going to still be programming in C++ in a couple of decades when this version becomes widespread enough, that would be the way to go. In the meantime, however, the following template makes life a little easier:

const typename T::value_type::second_type& map_at(const T& container,
        const typename T::value_type::first_type key) {
    typename T::const_iterator it = container.find(key);
    if (it == container.end()) {
        throw std::out_of_range("Key not found");
    return it->second;

Now a safe yet succint (in C++/STL terms) way of getting items out of a map is:

    double v = map_at(split_lengths, s);

Bringing the Victorian kitchen into the Edwardian era one new-fangled gadget at a time.

Useful diff Aliases

Add the following aliases to your ‘~/.bashrc‘ for some diff goodness:

alias diff-side-by-side='diff --side-by-side -W"`tput cols`"'
alias diff-side-by-side-changes='diff --side-by-side --suppress-common-lines -W"`tput cols`"'


p>You can, of course, use shorter alias names in good old UNIX tradition, e.g. ‘ssdiff’ and ‘sscdiff’. You might be wondering why (a) I did not do so, and (b) what is the point, conversely, of having aliases that are almost as long as the commands that they are aliasing. The answer to the first is ‘memory’, and the second is ‘autocomplete’.



Shorter aliases resulted in me constantly forgetting what they were mapped to (I rarely work outside a Git repository, and thus rarely use external diff, relying/needing Git’s diff 99% of the time), and it was easier for me to Google the options than to open up my huge ‘~/.bashrc’ to look up my personal alias. And the being forced not only to look up the options but then type out all those awkward characters again and again meant that I rarely ended up using these neat diff options. However, now, with these aliases, I just type ‘diff’ and then hit ‘TAB’, and let autocompletion show me and finish off the rest the commands for me.