Here’s a programming challenge / interview question that I like to think about, and gives me that tingly feeling of “I think there’s a really clever, efficient algorithm for this” but I haven’t been able to come up with a really clever answer yet. Here’s the problem:

Given a file containing N random positive integers less than N, write a program that runs on O(n) time and produces a collection of sorted files containing the input data, but where each file is itself in strictly sorted order.

Give it a try and send me the code and we can compare algorithms. I’ve been working with N=10000 and have a solution that produces about 200 unique sorted files, and can get as good as about 160 unique sorted files if I allow a fixed constant sized space usage (i.e. a small internal buffer).

I like to think of this operation as “semi-sorting”. The output is a collection of sorted files, which can be merged together by a traditional merge operation.

# prepare the output dir
for path in os.listdir(output_dir):
os.remove(os.path.join(output_dir, path))

# “sort”
for x in input_list:
with open(os.path.join(output_dir, “%i.txt” % x), “a”) as f:
f.write(“%i\n” % x)

The old:

int x[N];
while read i from file {
x[i]++
}
for i = 0; i 0) {
print i
x[i]–;
}
}

uses too much memory or something?

Not the best formatting ever. Sigh. Maybe pre works.

while read i from file {
x[i]++;
}

for i from 0 to N {
while (x[i]) {
print i;
x[i]–;
}
}

Bucket sort is linear when your range is small, [0, N] in this case, but as soon as you’re constrained by memory it becomes O(N * ln(max of N)) where the base depends on your available RAM. ln(max of N) is basically ln(N)

When N is large we can make (m) passes over the input and consider only (1/m) of the range [0, N]. You would think that as long as “double[N/m]” fits in RAM you end up with your sorted files in linear time (and have the benefit of not needing to do a merge!) but your new O(N*m) running is no better because m=N/ln(max of N) and [I think] that’s still logarithmic. Maybe someone can correct me?

I did some interesting work in college (solutions looking for a problem) where the O(n*n) algorithm would be better vs. the traditional best algorithm. For example, in computer graphics you sort triangles by their distance from a camera, and during animation the order doesn’t change much. Insertion sort works better than quicksort once you have an *almost* sorted list. Fun stuff.

I’m going to be pedantic here and give you a solution that technically works but is pretty useless Python is fun, I miss it sometimes.

#/usr/bin/python

import os

import random

output_dir = “/tmp/semisort”

size = 10000

input_list = map(lambda x: random.randrange(size), range(size))

# prepare the output dir

for path in os.listdir(output_dir):

os.remove(os.path.join(output_dir, path))

# “sort”

for x in input_list:

with open(os.path.join(output_dir, “%i.txt” % x), “a”) as f:

f.write(“%i\n” % x)

The old:

int x[N];

while read i from file {

x[i]++

}

for i = 0; i 0) {

print i

x[i]–;

}

}

uses too much memory or something?

Not the best formatting ever. Sigh. Maybe pre works.

while read i from file {

x[i]++;

}

for i from 0 to N {

while (x[i]) {

print i;

x[i]–;

}

}

Bucket sort is linear when your range is small, [0, N] in this case, but as soon as you’re constrained by memory it becomes O(N * ln(max of N)) where the base depends on your available RAM. ln(max of N) is basically ln(N)

When N is large we can make (m) passes over the input and consider only (1/m) of the range [0, N]. You would think that as long as “double[N/m]” fits in RAM you end up with your sorted files in linear time (and have the benefit of not needing to do a merge!) but your new O(N*m) running is no better because m=N/ln(max of N) and [I think] that’s still logarithmic. Maybe someone can correct me?

I did some interesting work in college (solutions looking for a problem) where the O(n*n) algorithm would be better vs. the traditional best algorithm. For example, in computer graphics you sort triangles by their distance from a camera, and during animation the order doesn’t change much. Insertion sort works better than quicksort once you have an *almost* sorted list. Fun stuff.