A fast(er) Python interleaver, and a simple parallel sorting script.

Take a look at this short python script:

#!/usr/bin/python

from sys import stdin, argv

outputs = []
numargs = 0
for arg in argv[1:]:
  outputs.append(open(arg, 'a'))
  numargs += 1

s = "str"
counter = 0
p = 0
while len(s) > 0:
  s = stdin.read(32 * 1024)
  p = s.rfind('\n') + 1
  if p > 0:
    outputs[counter % numargs].write(s[:p])
    counter += 1
    outputs[counter % numargs].write(s[p:])
  else:
    outputs[counter % numargs].write(s)
    counter += 1

It does the file interleaving task that I’ve talked about previously, and does it reasonably fast. Its a reasonable replacement for the C++ version, and can rip through a ~600MB file in about 0.75s.

I’ve been playing with a simple wrapper script for executing parallel sorts, like this:

#!/bin/bash
rm     a b c d e f g h i j
mkfifo a b c d e f g h i j

cat $1 | interleave.py a b c d &

sort -S 25% a -o e &
sort -S 25% b -o f &
sort -S 25% c -o g &
sort -S 25% d -o h &

sort -m e f -o i &
sort -m g h -o j &
sort -m i j

Which does a 4-way parallel sort, and then a 3-way parallel merge of the resultant sorted data. (Yes, doing a parallel merge actually makes things a little faster). It uses unix FIFOs instead of keeping the temporary files on disk, which saves a lot of I/O of writing and reading the temporary files.

This code can sort a ~600MB file in 25.1s, whereas a vanilla unix ’sort’ command on the same data takes 34.7s, so I’ve got a speedup of about 35%. That’s pretty good, when you think that the maximum possible speedup is 50%.

3 Responses to “A fast(er) Python interleaver, and a simple parallel sorting script.”

  1. Paddy3118 Says:

    Neat,
    Can’t wait to get delivery of a five year old works quad chip server so I can try things like this.

    Question: why interleave rather than first 25%, second 25% … into a,b,c,d?

    - Paddy.

  2. slacy Says:

    There’s actually a very interesting reason why you need to interleave and not just take the first 25%, second 25%, etc.

    I’m going to elaborate on this in a subsequent post, but the underlying reason is that you don’t get any parallelism due to the unbuffered nature of unix FIFOs. (In fact, it may even deadlock, but I’m not positive)

  3. Paddy3118 Says:

    Oh.

    By the way, I was thinking of using split as it is already available, and its option:

    -C, –line-bytes=SIZE put at most SIZE bytes of lines per output file

    - Paddy.

Leave a Reply