Setting up a Python Scientific Environment (NumPy, SciPy, pandas, StatsModels, etc.) in OS X 10.9 Mavericks

It is better than a nightmare from which you cannot wake up …

  1. Install Homebrew:

    $ ruby -e "$(curl -fsSL"
  2. Find problems and fix them (typically resulting from Homebrew becoming very stroppy if it does not have exclusive access to “/usr/local“):

    $ brew doctor

    VERY IMPORTANT NOTE: Make ABSOLUTELY sure that the DYLD_LIBRARY_PATH environmental variable is NOT set. Having this set will cause all sorts of problems as conflicts arise between libraries that homebrew installs that some of your other packages need (e.g., "libpng" and MacVim, respectively). Worse, some builds get confused and fail (e.g., “matplotlib“). What happens if you need DYLD_LIBRARY_PATH for other applications? Please direct your questions to the homebrew folks.

  3. Install Homebrew’s Pythons:

    $ brew install python
    $ brew install python3
  4. Either create and source a virtual environment using Homebrew’s Python:

    $ virtualenv -p /usr/local/python3 homebrew-stats-computing
    $ . homebrew-stats-computing/bin/activate

    Or make sure Homebrew’s Python is at the head of the “$PATH“:

    $ export PATH=/usr/local/bin:$PATH
  5. Install The NumPy + SciPy + pandas stack:

    $ brew install gfortran
    $ pip3 install numpy
    $ pip3 install scipy
    $ pip3 install pandas
  6. Install StatsModels:

    $ pip3 install cython
    $ pip3 install patsy
    $ cd /tmp
    $ git clone
    $ cd statsmodels
    $ python3 build
    $ python3 install
  7. Install matplotlib:

    $ brew install libpng freetype pkgconfig
    $ cd /tmp
    $ git clone
    $ cd matplotlib
    $ python3 install

Taking it to a 11: Dramatically Speeding Up Keyboard/Typing Responsiveness in OSX

If you use a Mac/OSX, then enter the following commands in your shell and reboot:

    $ defaults write -g KeyRepeat -int 0
    $ defaults write -g InitialKeyRepeat -int 15

If you live in a text editor or the shell, or otherwise spend most of your typing hammering away at the keyboard like I do,
then this makes an absolutely wonderful difference in the responsiveness of any typing activity. It will make your previous typing feel like you were pecking away in slow motion at the bottom of a pit of cold tar!

Yes, I know you can set the keyboard repeat rate in `System Preferences`.

I, too, did that a long time ago.

But this trick takes the speed to a 11

Dynamic On-Demand LaTeX Compilation

Most of the existing approaches to integrating LaTeX compilation into a LaTeX writing workflow centered around a text editor (as opposed to a fancy-schmancy IDE) are horrendously bloated creatures, aggressively and voraciously hijacking so many key-mappings and normal functionality that it makes your Vim feel like it is diseased and is experiencing a pathological personality disorder of some kind. Yes, LaTeX-Suite, I am looking at mainly at you.

I did not want a platoon of obnoxiously cheery elves to insert two hundred environments into the document while a marching band parades around the room when I hit the `$` key. I just wanted a way to quickly and easily compile the current document, and optionally open it for viewing. Thus, I wrote a little plugin to do exactly that.

It has worked fine enough all this time. But today, thanks to a comment in a reddit discussion, I discovered something magical: latexmk

Not all that great if you use TexShop or another LaTeX IDE for latexing, but if you use a Plain Old Text Editor, then the following may change your life:

    $ latexmk -pdf -pvc nature-publication-2014.tex

Two neat things going on here.

First, “latexmk”. This is the smart-single-click-do-all latex compiler (and is almost always available with all TeX distributions). Takes care of all those multiple compilation passes with BibTeX and such. So, no matter how complex your document, a single command Gets It Done, from compilation to viewing:

    $ latexmk -pdf nature-publication-2014.tex

Then there are the ‘-pvc‘ flags shown above. This is where things get sexy. This invokes “`latexmk`” on the document which then compiles it as per usual. But then, instead of exiting, it waits there monitoring the file for any changes. Upon detecting changes (like `make`), it recompiles and refreshes the document in the viewer!

So, if you have a shell with this command running in the background (either in a separate window or through, e.g., tmux or screen), you can merrily edit away in your POTE, and have constant refreshes as you save.

Setting up the Text Editor in My Computing Ecosystem


Image from WikiMedia Commons

Basic Setup of Shell to Support My Text Editor Preferences

By “text editor”, I mean Vim, of course. There are pseudo-operating systems that include rudimentary text-editing capabilities (e.g. Emacs), and integrated development environments that allow for editing of text, but there really is only one text editor that deserves the title of “text editor“: Vim, that magical mind-reading mustang that carries out textual mogrifications with surgical precision and zen-like elegance.

However, while Vim is the only true text editor, there is more than one flavor of Vim. There is the original, universal one, i.e., console Vim. And then there are various lines of graphical Vim’s specific to various operating systems. On my primary development box, a Mac, I prefer MacVim over straight-up console Vim, for many, many, many reasons, even though I practically live in the shell otherwise. For working with files on remote machines, I prefer using SSH to login and editing with console Vim, even though I can also edit these files through my own local MacVim instance. My computing ecosystem is shared between all my accounts, and this includes a collection of scripts as well as resources for various applications (e.g., not only my entire ‘~/.vimrc‘, but the entire accompanying ‘~/.vim' directory).

I have the following line in my ‘~/.bashrc‘:

    export EDITOR=' -f'
    export EDITOR=''
alias e=''

This invokes the following script, ‘‘ which is in the ‘$PATH‘ of all my accounts:

#! /bin/bash

if [[ -f /Applications/ ]]
    # bypass mvim for speed
    VIMPATH='/Applications/ -g'
elif [[ -f /usr/local/bin/mvim ]]
    # fall back to mvim
    # fall back to original vim


Easy Opening of Multiple Project Files

I have FuzzySnake installed on my system. My totally unbiased opinion as the author of FuzzySnake is that it is absolutely awesome, and that you should all be using it as well. In addition to the default behavior of opening the selected targets in my ‘$EDITOR‘ (which is, as shown above, ‘‘), I also use the nifty ‘--L/--list-and-exit‘ facility of FuzzySnake to discover files of particular types and pass them to Vim. I could define a set of aliases like ‘alias epy="e $(fz -L --python)“‘, but this approach makes it impossible to pass options to the editor. Instead, I have a set of convenience functions like:

# open all C/C++ and header files
function ecc() {
    e $(fz -L --cpp --exclude-file-patterns="config.h" "$@")
# open all C/C++ and header files, as well as autotools files
function ecca() {
    e $(fz -L --cpp --autotools --exclude-file-patterns="config.h" "$@")
function ecm() {
    e $(fz --cmake "$@")
function eccm() {
    e $(fz -L --cpp --cmake --exclude-file-patterns="config.h" "$@")
# open all Python files
function epy() {
    e $(fz -L --py "$@")

Of course, for on-the-fly use-cases, I usually just do something like:

$ e $(fz -L --rst)

Allowing for Opening of Files in an Existing MacVim Instance

In MacVim’s Preferences, I make sure that ‘Open files from applications‘ is set to ‘in the current window‘. Then I define the following alias in my ‘~/.bashrc‘:

    # alias ee="open \"mvim://open?url=file://$1\""
    alias ee="open -a MacVim"

With this, while typing

$ e foo bar

will open a new Vim session (either MacVim or console vim, depending on whether I am editing a file on my local machine or remotely through SSH), with files ‘foo’ and ‘bar’ loaded into buffers, typing

$ ee foo bar

will open ‘foo’ and ‘bar’ in an existing MacVim session.

Opening Remote Files in My Local (Mac)Vim Session, with Auto-Completion of Paths/Filenames

Sometimes I do prefer to edit remote files in my desktop MacVim (one might wonder why I do not always prefer this …). Lack of path completion when invoking the command has alway been an irritation for me, until I defined the following function:

function edit-remote() {
    if [[ -z $1 ]]
        echo "usage: remote-edit [[user@]host1:]file1 ... [[user@]host2:]file2"
    declare -a targs=()
    for iarg in $@
        targ="scp://$(echo $iarg | sed -e 's@:/@//@' | sed -e 's@:@/@')"
        targs=("${targs[@]}" $targ)
    $EDITOR ${targs[@]}
complete -F _scp -o nospace remote-edit

Smart (`infercase`) Dictionary Completions in Vim While Preserving Your Preferred `ignorecase` Settings

Dictionary completions in Vim can use a ‘infer case’ mode, where, e.g.,
“Probab” will correctly autocomplete to, e.g., “Probability”, even though the
entry in the dictionary might be in a different case. The problem is that this
mode only works if `ignorecase` is on. And sometimes, we want one (`infercase`)
but not the other (`ignorecase`).

The following function, if added to your “`~/.vimrc`”, sets it up so that `ignorecase` is
forced on when dictionary completions are invoked via the `` keys,
and then restored to its original state when exiting.

[gist id=10015995]

A better approach than binding all the exit keys would be an autocommand on leaving the pop-up completion menu, but I could only find a trigger for entering the popupmenu.

Building MacVim Natively on OS X 10.7 and Higher

You might want to do this if you want to install the latest snapshot and no pre-built release is available.

OR you might want MacVim to use a custom Python installation instead of the default one on the system path.

This latter was my motivation.

Once you have downloaded and unpacked the code base that you want to build, step into the `src/` subdirectory:

$ cd src

Before proceeding, make sure that your Python installations have been built with the ““–enable-shared“”! If this is not the case, no matter how you build MacVim, you will not have Python available. Rebuild your Pythons with this flag and reinstall if necessary before proceeding.

Then configure the build with:

$ export LDFLAGS=/usr/platform/lib
$ CC=clang ./configure \
    --enable-perlinterp \
    --enable-pythoninterp \
    --enable-python3interp \
    --enable-rubyinterp \
    --enable-cscope \
    --enable-gui=macvim \
    --with-mac-arch=intel \

Then build:

$ make

The “``” build product will be in the “`MacVim/build/Release`” subdirectory, and can be tested by:

$ open MacVim/build/Release/

Or installed by:

$ cp MacVim/build/Release/ /Applications

Using Python’s “timeit” Module to Benchmark Functions Directly (Instead of Passing in a String to be Executed)

All the basic examples for Python’s timeit module show strings being executed. This lead to, in my opinion, somewhat convoluted code such as:

#! /usr/bin/env python

import timeit

def f():

if __name__ == "__main__":
    timer = timeit.Timer("__main__.f()", "import __main__")
    result = timer.repeat(repeat=100, number=100000)

For some reason, the fact that you can call a function directly is only (again, in my opinion) obscurely documented. But this makes things so much cleaner:

#! /usr/bin/env python

import timeit

def f():

if __name__ == "__main__":
    timer = timeit.Timer(f)
    result = timer.repeat(repeat=100, number=100000)

Much more elegant, right?

One puzzling issue is a Heisenbug-ish issue (i.e., the observation affecting the outcome): the second version consistently and repeatedly results in faster performance timings. I can see differences in overall benchmark script execution time, due to differences in the way overhead resources are allocated/loaded, but I would hope that actual performance timing would be invariant to this. Maybe with “real” code, instead of the dummy execution body, things will be more consistent? Or is this a real issue?

‘xargs’ – Handling Filenames With Spaces or Other Special Characters

xargs is a great little utility to perform batch operations on a large set of files.
Typically, the results of a find operation are piped to the xargs command:

   find . -iname "*.pdf" | xargs -I{} mv {} ~/collections/pdf/

The -I{} tells xargs to substitute ‘{}’ in the statement to be executed with the entries being piped through.
If these entries have spaces or other special characters, though, things will go awry.
For example, filenames with spaces in them passed to xargs will result in xargs barfing with a “xargs: unterminated quote” error on OS X.

The solution is use null-terminated strings in both the find and xargs invocation:

   find . -iname "*.pdf" -print0 | xargs -0 -I{} mv {} ~/collections/pdf/

Note the -print0 argument to find, and the corresponding -0 argument to xargs: the former tells find to produce null-terminated entries while the latter tells xargs to expect and consume null-terminated entries.

YonderGit: Simplified Git Remote Repository Management

One of the great strengths of Git is the multiple and flexible ways of handling remote repositories. Just like Subversion, they can be "served" out of a location, but more generally, if you can reach it from your computer through any number of ways (ssh, etc.), you can git it.

YonderGit wraps up a number of a common operations with remote repositories: creating, initializing, adding to (associating with) the local repository, removing, etc.

You can clone your own copy of the YonderGit code repository using:

git clone git://

Or you can download an archive directly here:

After downloading, enter "sudo python" in the YonderGit directory to install. This will just copy the "" script to your system path. After that, enter " commands?" for a summary of possible commands, or " --help" for help on options.

Quick Summary of Commands

$ setup  

Create directory specified by "REPO-URL", using either the "ssh" or local filesystem transport protocol, initialize it as repository by running "git init", and add it as a remote called "NAME" of the local git repository. Will fail if directory already exists.

$ create 

Create directory specified by "REPO-URL", using either the "ssh" or local filesystem transport protocol, and then initialize it as repository by running "git init". Will fail if directory already exists.

$ init 

Initialize remote directory "REPO-URL" as a repository by running "git init" in the directory. Will fail if directory does not already exist.

$ add  

Add "REPO-URL" as a new remote called "NAME" of the local git repository.

$ delete 

Recursively remove the directory "REPO-URL" and all subdirectories and files.

Valid Repository URL Syntax

Secure Shell Transport Protocol

  • ssh://user@host.xz:port/path/to/repo.git/
  • ssh://user@host.xz/path/to/repo.git/
  • ssh://host.xz:port/path/to/repo.git/
  • ssh://host.xz/path/to/repo.git/
  • ssh://user@host.xz/path/to/repo.git/
  • ssh://host.xz/path/to/repo.git/
  • ssh://user@host.xz/~user/path/to/repo.git/
  • ssh://host.xz/~user/path/to/repo.git/
  • ssh://user@host.xz/~/path/to/repo.git
  • ssh://host.xz/~/path/to/repo.git
  • user@host.xz:/path/to/repo.git/
  • host.xz:/path/to/repo.git/
  • user@host.xz:~user/path/to/repo.git/
  • host.xz:~user/path/to/repo.git/
  • user@host.xz:path/to/repo.git
  • host.xz:path/to/repo.git
  • rsync://host.xz/path/to/repo.git/

Git Transport Protocol

  • git://host.xz/path/to/repo.git/
  • git://host.xz/~user/path/to/repo.git/

HTTP/S Transport Protocol

  • http://host.xz/path/to/repo.git/
  • https://host.xz/path/to/repo.git/

Local (Filesystem) Transport Protocol

  • /path/to/repo.git/
  • path/to/repo.git/
  • ~/path/to/repo.git
  • file:///path/to/repo.git/
  • file://~/path/to/repo.git/

Some Vim Movement Tips

  • Within-line character-based movement:
    • `h` and `l` move you left and right one character, respectively.
    • `fc` or `Fc` will take you forward to the next or back to the previous, respectively, occurrence of character “c` on the current line (e.g., `fp` will jump you forward to the next occurrence of “p” on the line, while `Fp` will jump you back to the previous occurrence of “p” on the line).
    • `tc` or `Tc` will take you forward to just before the next or back to just after the previous, respectively, occurrence of character “c` on the current line.
    • `;` or `,` repeats, forward or backward, respectively, the last `f`/`F`/`t`/`T` command.
    • `0` jumps you back to the beginning of the line while `$` jumps you to the end of the line.
    • `^` jumps you to the beginning of the first non-whitespace character on the current line.
    • Typing in a number and then typing `|` takes you to a column of that number on the current line.
  • Word-based movements:
    • `w` jumps you forward to the next “beginning-of-word”, while `b` jumps you back to the previous “beginning-of-word” (`W` and `B` for the corresponding WORD forms).
    • `e` jumps you forward to the next “end-of-word”, while `ge` jumps you back to the previous “end-of-word” (`E` and `gE` for corresponding WORD forms).
  • Line-based movements:
    • `j` and `k` move you down and up one line, respectively.
    • `+` and `-` move you to the first non-whitespace character on the next or previous line, respectively.
    • `G` jumps you to the last line of the buffer.
    • Typing in a number and then typing `G` takes you to line of that number.
    • `gg` jumps you the the first line of the buffer.
    • `+` moves you to the first non-blank character on the previous line (same effect as `k^`), while `-` moves you the the first non-blank character on the next line (same effect as `j^`).
    • `H` jumps you to the top (mnemonic=”home”) line of the current window.
    • `M` jumps you to the middle line of the current window.
    • `L` jumps you to the last line of the current window.
  • Page-based movements:
    • `<C-U>` moves you up half a page, while `<C-D>` moves you down by a half a page.
    • `<C-F>` moves you forward a full page, while `<C-B>` moves you backward a full page.
    • Go to a brace, bracket, parenthesis, quote, etc. Type `%` to jump to the matching brace, bracket, parenthesis, quote. etc.
  • Search-based movements:
    • With `:set incsearch` on, type `/` and starting typing in a search expression. As you type characters of the expression, you will be taken to the first location forward of the cursor position where that matching term appears in the buffer (use `?` to search backwards instead of forwards). Hit `<ENTER>` to start working from this new position, or `<ESC>` to cancel and return to your original location. To find the next matching expression, hit `<ENTER>` and then `N` to iterate through all matches in the buffer. If `:set wrapscan` is on, then the search will wrap around the buffer. If search highlighting is turned on (`:set hlsearch`), all occurrences of the expression will be highlighted.
    • Position the cursor over any word in an open, and (in normal mode, of course), type `*`. This will jump you to the next occurrence of the word under the buffer. If search highlighting is turned on (`:set hlsearch`), all occurrences of the word will be highlighted.
    • Now type `#`. This time, you will be taken back to the previous occurrence of the word under the cursor.
    • Typing `n` or `N` will jump to the next position in the buffer that matches the last-entered search expression (i.e., either through `/`, `?`, `*`, or `#`).
  • History-based movements:
    • “.` or `’.` will take you back to the last position or line, respectively, where you modified the file.
    • “^` or `’^` will take you back to the last position or line, respectively, where you left insert mode.
    • You can use `<C-O>` and `<C-I>` to take you backward and forward through these and other “jump” positions.
  • If you are editing source code, then:
    • `]m` takes you forward to the next “start of a method” (i.e., the beginning of the next method).
    • `[m` takes you back to the previous “start of a method” (i.e., the beginning of the current method if you are in the middle of one, or the beginning of the previous method if you are “in between” methods).
  • Window adjustment:
    • This is not a movement tip per se, but it is relevant in the sense that it changes the spatial relationship of the cursor with respect to the window: `zb`, `zt`, and `zz` scroll the screen to make the current line at the top, bottom, or middle, respectively.

Safe and const-correct std::map Access in C++ STL

The Standard Template Library std::map[] operator will create and return a new entry if passed a key that does not already exist in the map.
This means that you cannot use this operator when you do not want to create a new entry (i.e., you expect the key-value pair to already exist in the map), or in a const context (i.e., in a const method or when using a const object).
Instead, in these situations, you need to first pull a (const) iterator using std::map.find(), and then check to see if its value equals std::map.end(), and only if not proceed with referencing the result.

This means that instead of, for example:

    double v = split_lengths[s];

You need to:

    std::map::const_iterator it = split_lengths.find(s);
    if (it != split_lengths.end()) {
        // do what you want ... finally!
    } else {
        // raise exception

Yup. This is a pain. Victorian kitchen cooking, indeed.

The new C++0x supplies an method, which throws a std::out_of_range exception if the value is not found.
If I am going to still be programming in C++ in a couple of decades when this version becomes widespread enough, that would be the way to go. In the meantime, however, the following template makes life a little easier:

const typename T::value_type::second_type& map_at(const T& container,
        const typename T::value_type::first_type key) {
    typename T::const_iterator it = container.find(key);
    if (it == container.end()) {
        throw std::out_of_range("Key not found");
    return it->second;

Now a safe yet succint (in C++/STL terms) way of getting items out of a map is:

    double v = map_at(split_lengths, s);

Bringing the Victorian kitchen into the Edwardian era one new-fangled gadget at a time.

Useful diff Aliases

Add the following aliases to your ‘~/.bashrc‘ for some diff goodness:

alias diff-side-by-side='diff --side-by-side -W"`tput cols`"'
alias diff-side-by-side-changes='diff --side-by-side --suppress-common-lines -W"`tput cols`"'


p>You can, of course, use shorter alias names in good old UNIX tradition, e.g. ‘ssdiff’ and ‘sscdiff’. You might be wondering why (a) I did not do so, and (b) what is the point, conversely, of having aliases that are almost as long as the commands that they are aliasing. The answer to the first is ‘memory’, and the second is ‘autocomplete’.



Shorter aliases resulted in me constantly forgetting what they were mapped to (I rarely work outside a Git repository, and thus rarely use external diff, relying/needing Git’s diff 99% of the time), and it was easier for me to Google the options than to open up my huge ‘~/.bashrc’ to look up my personal alias. And the being forced not only to look up the options but then type out all those awkward characters again and again meant that I rarely ended up using these neat diff options. However, now, with these aliases, I just type ‘diff’ and then hit ‘TAB’, and let autocompletion show me and finish off the rest the commands for me.

Setting Up Git to Use Your Diff Viewer or Editor of Choice

Git offers two ways of viewing differences between commits, or between commits and your working tree: diff and difftool.
The first of these, by default, dumps the results to the standard output.
This mode of presentation is great for quick summaries of small sets of changes, but is a little cumbersome if there are a large number of changes between the two commits being compared and/or you want to closely examine the changes, browsing back-and-forth between different files/lines, search for specific text, fold away or hide non-changed lines etc.
In these cases, you would like to use an external or third-party diff program/viewer to review the differences, and
Git offers two ways to allow for this.

The Less-Than-Ideal Approach

You can set a configuration variable to send the results to a third-party diff program by adding the following line to your “~/.gitconfig“:

    external = 

where “” is a script that invokes your diff program/viewer of choice.
The reason you need to use a wrapper script rather than the external program directly is because Git calls the program specified by passing it seven arguments in the following order: “path“, “old-file“, “old-hex“, “old-mode“, “new-file“, “new-hex“, “new-mode“.
So, depending on your diff program, you would need to filter out unneeded/unused arguments, or add switches/flags as appropriate.
For example, if you want to use Vim, the wrapper script may look something like:

#! /bin/sh
vimdiff "$2" "$5"

This approach is less than ideal, however, at least for me, because once you have configured your Git this way, then all invocations of “git diff” will launch the external applications.
And the fact is that there are many times (the majority, in my case) where this is simply overkill and the short summary in standard output does just fine.
You can, of course, still get the native Git standard output diff dump even with the external diff program configured as above by passing in an appropriate flag, but, trust me, this is a bit of a pain.

The Ideal Approach

Git, fortunately, offers a second approach: difftool.
This is essentially a wrapper around diff, taking the same arguments and options, but instead calls the external diff program/viewer by default.
This approach thus allows you to retain “git diff” for standard output reviews of changes, while unleashing the power of a more sophisticated diff program/viewer for more extended/flexible/complex reviews by invoking “git difftool“.

Git offers a range of difftools pre-configured “out-of-the-box” (kdiff3, kompare, tkdiff, meld, xxdiff, emerge, vimdiff, gvimdiff, ecmerge, diffuse, opendiff, p4merge and araxis), and also allows you to specify your own.
To use one of the pre-configured difftools (for example, “vimdiff”), you add the following lines to your “~/.gitconfig“:

    tool = vimdiff

Specifying your own difftool, on the other hand, takes a little bit more work, but is still pretty simple … IF you know how to do it.

I did not. And, unfortunately, the documentation did not help me very much.
It took quite a bit of Googling and experimentation before I figured it out.

You basically need to add the following lines to your “~/.gitconfig“:

    tool = default-difftool

[difftool "default-difftool"]
    cmd = $LOCAL $REMOTE

You can, of course, replace “default-difftool” with anything you care to name your preferred difftool, and “” with whatever you end up calling your wrapper script.

My difftool of choice is Vim, and, while “vimdiff” is a pre-configured option, I did not want to use it because I wanted the flexiblity to invoke MacVim when I am using my laptop but console Vim when I am working remotely on a Linux box (my Git configuration, and for that matter, my entire work environment from the shell to Vim to what-have-you, all 37MB, is shared across multiple machines … and all managed/synced using Git, of course).
So my wrapper script looks like the following:

#! /bin/bash

if [[ -f /Applications/ ]]
    # bypass mvim for speed
    VIMPATH='/Applications/ -g -dO -f'
elif [[ -f /usr/local/bin/mvim ]]
    # fall back to mvim
    VIMPATH='mvim -d -f'
    # fall back to original vim


And that’s all there is to it!

One Last Tweak

I find it very annoying to have to hit “” before moving on to the next file. The following lines added to your “~/.gitconfig” put a stop to that:

    prompt = false

Using DendroPy Interoperability Modules to Download, Align, and Estimate a Tree from GenBank Sequences

The following example shows how easy it can be to use the three interoperability modules provided by the DendroPy Phylogenetic Computing Library to download nucleotide sequences from GenBank, align them using MUSCLE, and estimate a maximum-likelihood tree using RAxML. The automatic label composition option of the DendroPy genbank module creates practical taxon labels out the original data. We also pass in additional arguments to RAxML to request that the tree search be carried out 250 times (['-N', '250']).

(The tree below is shown just as an example of the output; some errant taxa placement is evident).

                                           /--- AF098348_Trictena_atripalpis
|                                          \--- AF098349_Trictena_sp.
|                                          /--- AF098335_Aoraia_lenis
|                                      /---+
|      /-------------------------------+   \--- AF098336_Aoraia_rufivena
|      |                               |
|      |                               \------- AF098334_Aoraia_enysii
|      |
|  /---+  /------------------------------------ AF098347_Oxycanus_sphragidias
|  |   |  |
|  |   |  |                            /------- AF098346_Oxycanus_antipoda
|  |   |  |                         /--+
|  |   |  |                         |  |   /--- AF098345_Oxycanus_dirempta
|  |   \--+   /---------------------+  \---+
|  |      |   |                     |      \--- AF098344_Oxycanus_australis
|  |      |   |                     |
|  |      |   |                     \---------- AF098343_Jeana_robiginosa
|  |      |   |
|  |      |   |                     /---------- AF098337_Cladoxycanus_minos
|  |      \---+                     |
|  |          |   /-----------------+      /--- AF098338_Dioxycanus_fuscus
|  |          |   |                 |  /---+
|  |          |   |                 \--+   \--- AF098339_Dioxycanus_oreas
|  |          |   |                    |
|  |          |   |                    \------- AF098342_Heloxycanus_patricki
+  |          \---+
|  |              |  /------------------------- AF098340_Dumbletonius_characterifer
|  |              |  |
|  |              |  |   /--------------------- AF098351_Wiseana_cervinata
|--+              |  |   |
|  |              \--+   |             /------- AF098355_Wiseana_jocosa
|  |                 |   |          /--+
|  |                 |   |          |  |   /--- AF098356_Wiseana_mimica
|  |                 |   |          |  \---+
|  |                 \---+      /---+      \--- AF098354_Wiseana_fuliginea
|  |                     |      |   |
|  |                     |      |   |  /------- AF098359_Wiseana_umbraculata
|  |                     |      |   \--+
|  |                     |  /---+      |   /--- AF098357_Wiseana_signata
|  |                     |  |   |      \---+
|  |                     |  |   |          \--- AF098358_Wiseana_signata
|  |                     \--+   |
|  |                        |   \-------------- AF098350_Wiseana_cervinata
|  |                        |
|  |                        |          /------- AF098341_Dumbletonius_unimaculatus
|  |                        \----------+
|  |                                   |   /--- AF098353_Wiseana_copularis
|  |                                   \---+
|  |                                       \--- AF098352_Wiseana_copularis
|  |
|  \------------------------------------------- AF098333_Aenetus_virescens
\---------------------------------------------- AF098332_Fraus_simulans

Vim Regular Expression Special Characters: To Escape or Not To Escape

Vim‘s regular expression dialect is distinct from many of the other more popular ones out there today (and actually predates them).
One of the dialect differences that always leaves me fumbling has to do with which special characters need to be escaped.
Vim does have a special “very magic” mode (that is activated by “\v” in the regular expression) that makes thing very clean and simple in this regard: only letters, numbers and underscores are treated as literals without escaping.
But I have never got used to the habit of preceding my expressions with “\v“, though maybe I should.

In the mean time however, I thought I would put up a quick reference that lists all the special regular expression characters in the default “magic” mode Vim dialect, divided into those that do not need to be escaped vs. those that do.

Regular Expression Special Characters Not Requiring Escaping

The following special characters are interpreted as regular expression operators without escaping (escaping will result in them being intepreted as literals):

\ Escape next character (use “\\” for literal backslash).
^ Start-of-line (at start of pattern).
$ End-of-line.
. Matches any character.
* Matches 0 or more occurrences of the previous atom.
~ Matches last given substitute string.
[] Matches any of the characters given within the brackets.
[^] Matches any character not given within the brackets.
& In replacement pattern: insert the whole matched search pattern.

Regular Expression Special Characters Requiring Escaping

The following special characters are interpreted as regular expression operators only when escaped (otherwise they will be interpreted as literals):

\< Matches beginning of a word (left word break/boundary).
\> Matches end of a word (right word break/boundary).
\(...\) Grouping into an atom.
\| Separating alternatives.
\_. Matches any single character or end-of-line.
\+ 1 or more of the previous atom (greedy).
\= 0 or one of the previous atom (greedy).
\? 0 or one of the previous atom (greedy).
\{ Multi-item count match specification (greedy).

\{n,m} n to m occurrences of the preceding atom (as many as possible).
\{n} Exactly n occurrences of the preceding atom.
\{n,} At least n occurrences of the preceding atom (as many as possible).
\{,m} 0 to n occurrences of the preceding atom (as many as possible).
\{} 0 or more occurrences of the preceding atom (as many as possible).
\{- Multi-item count match specification (non-greedy).

\{-n,m} n to m occurrences of the preceding atom (as few as possible).
\{-n} Exactly n occurrences of the preceding atom.
\{-n,} At least n occurrences of the preceding atom (as few as possible).
\{-,m} 0 to n occurrences of the preceding atom (as few as possible).
\{-} 0 or more occurrences of the preceding atom (as few as possible).

Unconditionally Accepting All Merging-In Changes During a Git Merge

Merge conflicts suck. It is not uncommon, however, that you often just know that you really just want to accept all the changes from the branch that you are merging in. Which makes things a lot simpler conceptually. The Git documentation suggests that this can also be procedurally simple as well, as it mentions the “-s theirs” merge strategy which does just that, i.e., unconditionally accept everything from the branch that you are merging in:

$ git merge -s theirs 

Unfortunately, however, running the above command results in an error message along the line of “theirs” is not a known strategy. This is because, as discussed here (original reference here), this option has been removed from Git. I am sure the reasons for this are sound. But it is too bad that the documentation (as of my current installation, 1.7.7) has not been updated to reflect these changes. Really too bad. Really, really, really, really too bad. Because it takes a potentially dangerous, always stressful, and sometimes frustrating experience and makes it all the worse due to outright false documentation. Still, as an open source developer myself, I recognize that the time, energy, and effort demands of maintaining open source software have to be fitted in in between the demands of the other aspects of life, and how there consequently is sometimes considerable lag in updating what usually receives the lowest priority: the documentation.

In any case, luckily the page referred to previously provides some nice solutions for achieving the same effect as “-s theirs, which I am summarizing here for my own reference.

Method #1

git merge -s ours ref-to-be-merged
git diff --binary ref-to-be-merged | git apply -R --index
git commit -F .git/COMMIT_EDITMSG --amend

Method #2

git checkout MINE
git merge --no-commit -s ours HERS
git rm -rf .
git checkout HERS -- .
git checkout MINE -- debian # or whatever, as appropriate

Method #3

# get the contents of another branch
git read-tree -u --reset 
# selectivly merge subdirectories
# e.g superseed upstream source with that from another branch
git merge -s ours --no-commit other_upstream
git read-tree --reset -u other_upstream     # or use --prefix=foo/
git checkout HEAD -- debian/
git checkout HEAD -- .gitignore
git commit -m 'superseed upstream source' -a

Of these, I tried the first method, and it worked like a charm. YMMV.

The Power and Precision of Vim’s Text Objects: Efficent, Elegant, Awesome.

Vim’s text objects are not only a powerful, flexible and precise way to specify a region of text, but also intuitive and efficient.
They can be used with any command that can be combined with a motion (e.g., “d“, “y“, “v“, “r“), but in this post I will be using the “c” command (“change”) to illustrate them.

Imagine you were on a line looked like this, with the cursor on the letter “r” of the word “dry”:

print “Enter run mode (‘test’, ‘dry’, or ‘full’)”

Then, after typing “c” to start the “change” command, you can type “i” or “a” followed by another character to define a text object to which to apply the “change”.

Using “i” gives you the “inner” text object (i.e., a less-inclusive or non-greedy version that excludes the wrapping characters or delimiters) while “a” gives the full object (i.e., a more-inclusive or greedy version, that includes the wrapping characters or delimiters).

The third and final character of the command sequence gives the criteria by which the text object is defined.

This is typically the initial letter of the name of the object (e.g., “w” for “word”, “s” for “sentence”) or a character that delimits/wraps a region of text (e.g. a parenthesis for text in parentheses, a quote for quoted text, a curly brace for text in curly-braces).

For example:

  • Typing “ci'” will delete the word [dry], not including the quotes, and enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


    print “Enter run mode (‘test’, ‘ ‘, or ‘full’)”
  • Typing “ca'” will delete the word [‘dry’], including the quotes, and enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


    print “Enter run mode (‘test’,  , or ‘full’)”
  • Typing “ci(” or “ci)” will delete everything inside the parentheses, but not the parentheses themselves, and enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


    print “Enter run mode ( )”
  • Typing “ca(” or “ca)” will delete the everything inside the parentheses as well as the parentheses themselves, and then enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


    print “Enter run mode  
  • Typing “ci"” will delete everything inside the double-quotes, but not the double-quotes themselves, and enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


  • Typing “ca"” will delete everything inside the double-quotes, as well as the double-quotes themselves, and enter insert mode:


    print “Enter run mode (‘test’, ‘dry’, or ‘full’)”


<p>The idiom is naturally extended to many types of different delimiting characters in the most intuitive way possible (i.e., simply by specifying the delimiting character or first character of the name of the text object):

    <li>"<code>i{</code>" or "<code>i}</code>" selects curly-brace surrounded text, with the greedier versions being "<code>a{</code>" and "<code>a}</code>", respectively.
    <li>"<code>i[</code>" or "<code>i]</code>" selects square-bracket surrounded text, with the greedier versions being "<code>a[</code>" and "<code>a]</code>" respectively.
    <li>"<code>i'</code>" selects single-quote surrounded text, with the greedier version being "<code>a'</code>".
    <li>"<code>it</code>"  selects the text within the surrounding HTML/XML tag or container, with the greedier version being "<code>at</code>".
    <li>"<code>iw</code>"  selects the surrounding word, with the greedier version being "<code>aw</code>".
    <li>"<code>is</code>"  selects the surrounding sentence, with the greedier version being "<code>as</code>".
    <li>"<code>ip</code>"  selects the surrounding paragraph, with the greedier version being "<code>ap</code>".
    <li> etc.

Once you start using Vim's <a href="">text objects</a>, there really is no going back.
The power and precision they provide to specify regions of text to pass onto other Vim commands makes for an extremely efficient and elegant <i>modus operandi</i>, for which there simply is no equivalent in any other text editor.

For more information, please refer to the Vim documentation, either online, or by typing “:help text-objects” inside Vim.

Supplementary Command-History Logging in Bash: Tracking Working Directory, Dates, Times, etc.


Here is a way to create a secondary shell history log (i.e., one that supplements the primary “~/.bash_history“) that tracks a range of other information, such as the working directory, hostname, time and date etc. Using the “HISTTIMEFORMAT” variable, it is in fact possible to store the time and date with the primary history, but the storing of the other information is not as readibly do-able. Here, I present an approach based on this excellent post on StackOverflow.

The main differences between this approach and the original is:

  • I remove the option to log the extra information to the primary history file: I prefer to keep this history clean.
  • I add history number, host name, time/date stamp etc. to the supplementary history log by default.
  • I add field separators, making it easy to apply ‘awk‘ commands.

The (Supplementary) History Logger Function

First, add or source the following to your “~/.bashrc“:


Activating the Logger

Then you need to set this function to execute on every command by adding it to your “$PROMPT_COMMAND” variable, so you need the following entry in your “~/.bashrc“:

    export PROMPT_COMMAND='_loghistory'

There are a number of options that the logging function takes, including the adding terminal information, the adding of arbitrary text or the execution of a function or function(s) that generate appropriate text. See the function documentation for more info.

Add Some Useful Aliases

Add the following to your “~/.bashrc“:

# dump regular history log
alias h='history'
# dump enhanced history log
alias hh="cat $HOME/.bash_log"
# dump history of directories visited
alias histdirs="cat $HOME/.bash_log | awk -F ' ~~~ ' '{print $2}' | uniq"

Checkout the Results! The ‘histdirs‘ command is very useful to quickly list, select (via copy and pasting) and jumping back to a directory.

$ h
14095  [2011-11-23 15:36:20] ~~~ jd nuim
14096  [2011-11-23 15:36:21] ~~~ ll
14097  [2011-11-23 15:36:23] ~~~ git status
14098  [2011-11-23 15:36:33] ~~~ jd pytb
14099  [2011-11-23 15:36:36] ~~~ git status
14100  [2011-11-23 15:36:53] ~~~ git rm --cached config/*
14101  [2011-11-23 15:37:00] ~~~ git pull
14102  [2011-11-23 15:37:11] ~~~ e .gitignore
14103  [2011-11-23 15:37:28] ~~~ git status
14104  [2011-11-23 15:37:35] ~~~ e .gitignore
14105  [2011-11-23 15:37:44] ~~~ git status
14106  [2011-11-23 15:38:10] ~~~ git commit -a -m "stuff"
14107  [2011-11-23 15:38:12] ~~~ git pushall
14108  [2011-11-23 15:50:38] ~~~ ll build_c/
14109  [2011-11-23 15:53:16] ~~~ cd
14110  [2011-11-23 15:53:18] ~~~ ls -l
14111  [2011-11-23 16:00:12] ~~~ cd Documents/Projects/Phyloinformatics/DendroPy/dendropy
14112  [2011-11-23 16:00:15] ~~~ ls -l
14113  [2011-11-23 16:00:22] ~~~ cd dendropy/
14114  [2011-11-23 16:00:24] ~~~ vim *.py

$ hh
[2011-11-23 15:36:20] ~~~ /Users/jeet ~~~ jd nuim
[2011-11-23 15:36:21] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ ll
[2011-11-23 15:36:23] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ git status
[2011-11-23 15:36:33] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/nuim ~~~ jd pytb
[2011-11-23 15:36:36] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:36:53] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git rm --cached config/*
[2011-11-23 15:37:00] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git pull
[2011-11-23 15:37:11] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ e .gitignore
[2011-11-23 15:37:28] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:37:35] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ e .gitignore
[2011-11-23 15:37:44] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git status
[2011-11-23 15:38:10] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git commit -a -m "stuff"
[2011-11-23 15:38:12] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ git pushall
[2011-11-23 15:50:38] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ ll build_c/
[2011-11-23 15:53:16] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/pytbeaglehon ~~~ cd
[2011-11-23 15:53:18] ~~~ /Users/jeet ~~~ ls -l
[2011-11-23 16:00:12] ~~~ /Users/jeet ~~~ cd Documents/Projects/Phyloinformatics/DendroPy/dendropy
[2011-11-23 16:00:15] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy ~~~ ls -l
[2011-11-23 16:00:22] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy ~~~ cd dendropy/
[2011-11-23 16:00:24] ~~~ /Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy ~~~ vim *.py

$ histdirs

Further Reading

Stripping Paths from Files in TAR Archives

There is no way to get tar to ignore directory paths of files that it is archiving. So, for example, if you have a large number of files scattered about in subdirectories, there is no way to tell tar to archive all the files while ignoring their subdirectories, such that when unpacking the archive you extract all the files to the same location. You can, however, tell tar to strip a fixed number of elements from the full (relative) path to the file when extracting using the “--strip-components” option. For example:

tar --strip-components=2 -xvf archive.tar.gz

This will strip the first two elements of the paths of all the archived files. To get an idea of what this will look like before extracting, you can use the “-t” (“tabulate”, or list) in conjunction with the “--show-transformed option:

tar --strip-components=2 -t --show-transformed -f archive.tar.gz

The “--strip-components” approach only works if all the files that you are extracting are the same relative depth. Files that are “shallower” will not be extracted, while files that are deeper will still be extracted to sub-directories. The only clean solution to this that I can think of would be to extract all the files to a temporary location and then move all the files to single directory:

mkdir /tmp/work
cd /tmp/work
tar -xvzf /path/to/archive.tar.gz
mkdir collected
find . -type f -exec mv {} collected/ \;