ilter for Unique Lines Adjacent or Otherwise While Preserving Original Order

There are two BASH utilities that help you filter input for unique lines: ‘uniq’ and ‘sort‘: One gotcha with ‘uniq’ is that it only filters out duplicate adjacent lines. So if your input looks like:

apple
apple
apple
chicory
chicory
chicory
banana
banana

Then running ‘uniq’ on it will yield:

apple
chicory
banana

But if the input has non-adjacent duplicate lines:

apple
banana
banana
chicory
apple
banana
chicory
banana
banana
apple
apple
apple
banana
chicory

Then the results are:

apple
banana
chicory
apple
banana
chicory
banana
apple
banana
chicory

The traditional approach is to sort the input beforehand, i.e. ‘sort | uniq’, which results in:

apple
banana
chicory

Of course, if this is the strategy, then there is no reason not to call just ‘sort -u’ instead, which does the sorting and filtering in a single step, and results in the same as the above. Furthermore, ‘sort’ is a considerably more flexible utility, with flags that allow you specify fields and field delimiters. All well and good ... if you do not care about the original order of the lines, or you actually want the lines sorted. But what happens if you want to filter input for unique lines, but retain the original order (i.e., not sort the lines)? Suprisingly, to me at least, neither ‘uniq’ nor ‘sort’ offer any way of doing this, as far as I can tell. I was about to hack out a Python script to achieve this, when a little bit of googling brought me to this Awk gem:

awk '{ if (!h[$0]) { print $0; h[$0]=1 } }'

Simply piping input to the above awk code will result in all duplicate lines — including non-adjacent duplicate lines — being filtered out, while preserving the original order of the lines. Effective, efficient, and elegant, while at the same time, perhaps a little arcane. In other words: the classical UNIX way. Incidentally, while one’s first instinct might be to map the above to an alias, that turned out to be impossible, as the invocation has single quotes, and you cannot escape these inside a single-quoted string (the alias definition has to be quoted with single-quotes because we do not want to expand the ‘\$0’ argument during the definition). Instead, we have to define it as a function:

uniqx() { awk ‘{ if (!h[$0]) { print $0; h[$0]=1 } }’ } ```

Share