Here’s a programming challenge / interview question that I like to think about, and gives me that tingly feeling of “I think there’s a really clever, efficient algorithm for this” but I haven’t been able to come up with a really clever answer yet. Here’s the problem:
Given a file containing N random positive integers less than N, write a program that runs on O(n) time and produces a collection of sorted files containing the input data, but where each file is itself in strictly sorted order.
Give it a try and send me the code and we can compare algorithms. I’ve been working with N=10000 and have a solution that produces about 200 unique sorted files, and can get as good as about 160 unique sorted files if I allow a fixed constant sized space usage (i.e. a small internal buffer).
I like to think of this operation as “semi-sorting”. The output is a collection of sorted files, which can be merged together by a traditional merge operation.
I’m going to be pedantic here and give you a solution that technically works but is pretty useless
Python is fun, I miss it sometimes.
#/usr/bin/python
import os
import random
output_dir = “/tmp/semisort”
size = 10000
input_list = map(lambda x: random.randrange(size), range(size))
# prepare the output dir
for path in os.listdir(output_dir):
os.remove(os.path.join(output_dir, path))
# “sort”
for x in input_list:
with open(os.path.join(output_dir, “%i.txt” % x), “a”) as f:
f.write(“%i\n” % x)
The old:
int x[N];
while read i from file {
x[i]++
}
for i = 0; i 0) {
print i
x[i]–;
}
}
uses too much memory or something?
Not the best formatting ever. Sigh. Maybe pre works.
while read i from file {
x[i]++;
}
for i from 0 to N {
while (x[i]) {
print i;
x[i]–;
}
}
Bucket sort is linear when your range is small, [0, N] in this case, but as soon as you’re constrained by memory it becomes O(N * ln(max of N)) where the base depends on your available RAM. ln(max of N) is basically ln(N)
When N is large we can make (m) passes over the input and consider only (1/m) of the range [0, N]. You would think that as long as “double[N/m]” fits in RAM you end up with your sorted files in linear time (and have the benefit of not needing to do a merge!) but your new O(N*m) running is no better because m=N/ln(max of N) and [I think] that’s still logarithmic. Maybe someone can correct me?
I did some interesting work in college (solutions looking for a problem) where the O(n*n) algorithm would be better vs. the traditional best algorithm. For example, in computer graphics you sort triangles by their distance from a camera, and during animation the order doesn’t change much. Insertion sort works better than quicksort once you have an *almost* sorted list. Fun stuff.