Personal tools
You are here: Home Linux Linux commandline tips and tricks

Linux commandline tips and tricks

This is my (currently still small) but growing set of tips on Unix commandline tools.

Memory efficient "sort | uniq -c"

When you want to do a   

zcat -f * | cut -d' ' -f17 | sort | uniq -c | sort -nr

on a dataset of several hunderd GB your system will run out of memory.

The reasons is that the initial 'sort' will have to read the entire list into memory before it can start producing output for uniq -c to count. The reason for the sort is the way 'uniq -c' does the counting.

What I wanted is a program that does this by using the 'size of the output' amount of RAM instead of the 'size of the input'.

A friend of mine (Martijn Lievaart) suggested this little thingy that does exactly that:

perl -na -e '$seen{$F[0]}++;END{ print "$seen{$_}: $_\n" for keys %seen }'

I put this in a little script called  sort_uniq_count   for easier reuse (note that I'm NOT a perl expert):

#!/usr/bin/perl -na
$seen{$F[0]}++;END{ print "$seen{$_} $_\n" for keys %seen }

Example usage:

zcat -f * | cut -d' ' -f17 | sort_uniq_count | sort -nr
Document Actions
« July 2024 »