Programming challenge: Semi-sort a list of random numbers.

Here’s a programming challenge / interview question that I like to think about, and gives me that tingly feeling of “I think there’s a really clever, efficient algorithm for this” but I haven’t been able to come up with a really clever answer yet.  Here’s the problem:

Given a file containing N random positive integers less than N, write a program that runs on O(n) time and produces a collection of sorted files containing the input data, but where each file is itself in strictly sorted order.

Give it a try and send me the code and we can compare algorithms.  I’ve been working with N=10000 and have a solution that produces about 200 unique sorted files, and can get as good as about 160 unique sorted files if I allow a fixed constant sized space usage (i.e. a small internal buffer).

I like to think of this operation as “semi-sorting”.  The output is a collection of sorted files, which can be merged together by a traditional merge operation.

4 thoughts on “Programming challenge: Semi-sort a list of random numbers.”

  1. I’m going to be pedantic here and give you a solution that technically works but is pretty useless :) Python is fun, I miss it sometimes.


    import os
    import random

    output_dir = “/tmp/semisort”
    size = 10000
    input_list = map(lambda x: random.randrange(size), range(size))

    # prepare the output dir
    for path in os.listdir(output_dir):
    os.remove(os.path.join(output_dir, path))

    # “sort”
    for x in input_list:
    with open(os.path.join(output_dir, “%i.txt” % x), “a”) as f:
    f.write(“%i\n” % x)

  2. The old:

    int x[N];
    while read i from file {
    for i = 0; i 0) {
    print i

    uses too much memory or something?

  3. Not the best formatting ever. Sigh. Maybe pre works.

    while read i from file {

    for i from 0 to N {
    while (x[i]) {
    print i;

  4. Bucket sort is linear when your range is small, [0, N] in this case, but as soon as you’re constrained by memory it becomes O(N * ln(max of N)) where the base depends on your available RAM. ln(max of N) is basically ln(N)

    When N is large we can make (m) passes over the input and consider only (1/m) of the range [0, N]. You would think that as long as “double[N/m]” fits in RAM you end up with your sorted files in linear time (and have the benefit of not needing to do a merge!) but your new O(N*m) running is no better because m=N/ln(max of N) and [I think] that’s still logarithmic. Maybe someone can correct me?

    I did some interesting work in college (solutions looking for a problem) where the O(n*n) algorithm would be better vs. the traditional best algorithm. For example, in computer graphics you sort triangles by their distance from a camera, and during animation the order doesn’t change much. Insertion sort works better than quicksort once you have an *almost* sorted list. Fun stuff.

Leave a Reply