Holy crap locales!

Here’s something fun to try.  Create a text file that looks like this (Note: utf-8 encoded!):

A
ß
C
ßa
ßz
a
B
b
c
S
s
SS
ss
SA
sa
SZ
sz

Just to be really clear, here are the exact bytes I’m talking about:

$ hexdump -C  /tmp/foo.txt 
00000000  41 0a c3 9f 0a 43 0a c3  9f 61 0a c3 9f 7a 0a 61  |A....C...a...z.a|
00000010  0a 42 0a 62 0a 63 0a 53  0a 73 0a 53 53 0a 73 73  |.B.b.c.S.s.SS.ss|
00000020  0a 53 41 0a 73 61 0a 53  5a 0a 73 7a 0a           |.SA.sa.SZ.sz.|
0000002d
$ md5sum /tmp/foo.txt
ac2be5e453dd79c070da74d0e67aa6b2 /tmp/foo.txt

Now, compare the output of the following commands:

$ sort /tmp/foo.txt
$ LC_ALL='en_US' sort /tmp/foo.txt
$ LC_ALL='en_US.utf8' sort /tmp/foo.txt
$ LC_ALL='en_US.iso88591' sort /tmp/foo.txt
$ LC_ALL='C' sort /tmp/foo.txt
$ LC_ALL='de_DE.utf8' sort /tmp/foo.txt

How’s that for rocking your world? So, the next time your friend says “hey, can you return those results sorted for me?” then you’ll have something really fun to think about when you can’t sleep at night.

And just when you thought “Oh, well great, at least all the UTF-8 versions sort the same” then comes along this little gem:

$ LC_ALL="jp_JP.utf8" sort /tmp/foo.txt

Oh, and just when you thought “Well, I guess I’ll be OK with en_US.utf8 and at least English will sort the way I want worldwide!” then along comes your friends to the North with this awesome zinger:

$ LC_ALL="en_CA.utf8" sort /tmp/foo./txt

Programming challenge: Semi-sort a list of random numbers.

Here’s a programming challenge / interview question that I like to think about, and gives me that tingly feeling of “I think there’s a really clever, efficient algorithm for this” but I haven’t been able to come up with a really clever answer yet.  Here’s the problem:

Given a file containing N random positive integers less than N, write a program that runs on O(n) time and produces a collection of sorted files containing the input data, but where each file is itself in strictly sorted order.

Give it a try and send me the code and we can compare algorithms.  I’ve been working with N=10000 and have a solution that produces about 200 unique sorted files, and can get as good as about 160 unique sorted files if I allow a fixed constant sized space usage (i.e. a small internal buffer).

I like to think of this operation as “semi-sorting”.  The output is a collection of sorted files, which can be merged together by a traditional merge operation.

Real-world Python deployment using pip & virtualenv. (Outline)

Real-world Python deployment. (Outline/notes)

Introduction

You’ve got a great development setup and now you want to “do the right thing” in production.

You’re using virtualenv (good!)

You’re using “pip install…” for all your dependencies (good!)

You’re probably not keeping a requirements.txt up to date (that’s OK!)

You’re using “django-admin.py runserver” or similar (not gonna cut it!)

You’ve got all your source code in a git repo (self hosted or github or other, good!)

You’re ready to write your first fabric script! (good!)

Now let’s get that code out to production!

Goals

  1. Deploy a git repository to production.
  2. Let’s not use eggs for our own source.
    1. This is a debatable point, but for specific use cases like deploying a Django application, building and maintaining eggs is harder than it should be (mainly because of static resources)
    2. Deploying from a source directory is actually more straightforward.
    3. We’ll still use “python ./setup.py install” for our own code.
  3. Use virtualenv for environment management.
  4. Use pip to install dependencies.
  5. Reproducible deployment.
  6. No external network dependencies.
  7. Fast-ish deployment & dependency installation.
  8. Match development & production environments as closely as possible.

Pitfalls

  1. Think about security for just a couple seconds.
    1. ssh keys in production?
    2. Could an attacker gain access to your git repository?
  2. “pip install” is a heavyweight process.
    1. Goes out to pypi.python.org and fetches metadata.
    2. Fetches each package from it’s own home hosting provider.
    3. These hosts go down.  Do you want YOUR deployment to depend on their servers being up?
  3. Cut your external network access and see what happens.
    1. Imagine if outbound network access from production was disallowed.  Could you still deploy?
  4. How do I rollback?
  5. How do I manage my system configuration
    1. apache configs
    2. nginx configs
    3. gunicorn
    4. crontabs
    5. init scripts (start/stop, etc)
    6. supervisord configs

Solutions

  1. Security & key management
    1. Never ssh from production to anywhere.  Only ssh into production.
    2. Production machines should never have private keypairs.  (authorized_keys is OK)
  2.  git access in production
    1. Use the “push-pull” strategy.
    2. Your development machine does “git push” into a bare git repo in production
    3. Production machine then turns around and does “git pull” from it’s own local repository.
    4. Or, build some eggs. (This has its own issues that I won’t cover here)
    5. You can modify code in production, and commit it, but it won’t make it back to your repository unless you “git pull” from that repo.  This is a good thing.  Manage production customization in a reproducible way.
  3. users
    1. don’t deploy as root
  4. virtualanv & pip
    1. virtualenv is great!
    2. don’t rely on system packages!
    3. Make a new virtualenv every time you deploy!
    4. never run “sudo virtualenv…”
    5. never run “sudo pip…”
    6. PIP_DOWNLOAD_CACHE is NOT your friend.
      1. why with examples
    7. Solution: Separate “install this package” from “download this package”
      1. Step 1: pip install –no-install –use-mirrors -I –download=$CACHE_DIR …
      2. Step 2: pip install –no-index –index-url=file:///dev/null …
    8. Use some helper scripts to make this easier. (github link TBD)
    9. You’re still screwed sometimes.
      1. outline when TBD
    10. Automatic dependency downloads can still bite you
    11. Periodically re-download everything
      1. This makes sure that if dependencies change
    12. Managing package upgrades.
      1. When you want to upgrade, re-download the package and you’ll pick up the latest version.
      2. modify requirements.txt
  5. Managing your system configurations
    1. In principle: Make a “mock /etc” in your repo
    2. Copy “mock /etc” on top of the “system /etc” to install. (using fabric)
    3. A couple other commands to enable system services (a2ensite, and friends, etc.)
    4. supervisord, but it’s outside the scope of this talk
      1. system supervisord or a self-installed supervisord?
      2. How to start up supervisord?
      3. web interface to production or not?

Apple vs. Samsung: The cost of Android fragmentation.

I’ve commented in the past about how Android fragmentation isn’t really as huge of an issue as some developers have made it to be.  But, there is one elephant in the room that no one is talking about:

Android fragmentation and lack of updates increased damages in the Apple vs. Samsung lawsuit.

How?  Why?  Well, the sad truth of it is that Android is evolving over time to be “less infringing” on Apple patents.  One great example of this is the “bounce-back scrolling feature” patent.  This feature did not exist in Android 1.x, was implemented (poorly, I might add) in Android 2.x, and was then removed in Android 3.x and 4.x and replaced with a non-infringing color-overlay scrolling feedback mechanism.

So, if Samsung (or any other vendor) had been able to keep devices up to date more quickly, they would have been less liable to Apple for damages.  Similarly, if they had been quicker to adopt new Android versions (say, 4.0) then they would not be liable for any damages against this patent.

There are many other examples of how Google is evolving the Android User Interface to NOT infringe on Apple patents .  This is the first case that I can think of where fragmentation (more specifically: Lack of keeping software versions up-to-date) has cost anyone real money.  By the books, it’s Samsung who’s paying, and maybe this means they’ll be more aggressive with keeping up to date.

I also hope Google sees this lesson, and helps the hardware manufacturers with hardware drivers and other issues that hold back software on many devices.

Amazed at how many different Mars gallery interfaces there are.

Here, have this pile of links:

I’m going to keep updating this list as I find more, so check back.  I’ve already added 3 new links since I first wrote this.

Supercharge your bash prompt with git status goodness.

Here’s a thought:

Wouldn’t it be awesome if your bash prompt could show you:

  • Your current working directory.
  • Which git repository you’re currently in.
  • Which git branch you’re currently on (if not master).
  • How many outstanding files you have (files that need to be added or committed).
  • How many changes ahead (or behind) origin/HEAD you currently are.
  • Your current virtualenv (for Python development, but doesn’t hurt other languages)

Well, all this is possible (and more, probably!).  I worked a bit on getting all these features working this afternoon.  The source code is pretty rough, but I think this could be useful enough for others that I should start to share it.  I’ll likely put this in it’s own github repository eventually.  But, for now, here’s a simple gist with my ~/.bash_prompt source.

To use this, just copy it to your home directory, and add the following to the bottom of your ~/.bashrc:

source ~/.bash_prompt

Adding custom launchers to Gnome3’s Favorites.

This is totally non-obvious, so here goes.

At a shell prompt, run:

$ gnome-desktop-item-edit ~/.local/share/applications/mylauncher.desktop –create-new

Go through the dialog to create the launcher and make sure you give it an easy to remember name.  When you’re done, that application should show up under “Applications” in that search thing.

Installing pip dependencies without touching the ‘net.

@jacobian just tweeted:

A little shitty-wifi-inspired hack to make pip install not have to touch the ‘net at all: http://bit.ly/IRrRcn

Yeah, been there, done that (using PIP_DOWNLOAD_CACHE).  It’s a good idea, but pip itself has better support for doing this.  I learned this technique from the pip development team, specifically @carljm over IRC and some bugs.

“pip install –no-install” first

Use an “sdist cache” and not PIP_DOWNLOAD_CACHE.  An “sdist cache” caches the actual distributed files, not the “pip-ified” files from pypi.   Pick a directory to store these sdist files in.  From now on out, I’m going to assume you’re putting them in $SDIST_DIR, wherever you decide that should be.

If you’re adding a new dependency, and you want that dependency to be able to be installed later without touching the ‘net, you need to download it first, and then install it from that download.  For example, if I wanted to include Django, I’d do this:

pip install --no-install --no-input --use-mirrors -I --download=$SDIST_DIR django

Which will put a file named something like Django-1.4.tar.gz (note the nice filename!) into $SDIST_DIR.  You can then put $SDIST_DIR under version control.

“pip install –find-links” second

Then, you can install django (or any other dependency that you’ve previously downloaded) without touching the ‘net by executing:

pip install -I --find-links=file://$SDIST_DIR --no-index --index-url=file:///dev/null django

Use requirements.txt, but not like they taught you

Unfortunately, this technique breaks “pip install -r requirements.txt”.  (I don’t remember the exact details but I do remember it’s broken)  But, the format of requirements.txt is simple enough that you can basically say:

for dependency in $(cat requirements.txt); do
    pip install -I --find-links=file://$SDIST_DIR --no-index --index-url=file:///dev/null $dependency

Just put this into a shell script to make your life easier, which leads us to…

Wrap it all up into a collection of shell scripts

Now that you know the general technique, you’ll need to wrap these two up into a couple different shell scripts.  Here’s what I do (without source — but I’ll share soon).

./add_dependency.sh: Download a new single dependency, per the pip line above, and then immediately install it.  This leaves a file in $SDIST_DIR, but that’s good, because it reminds me (via source control) that I’m out of sync with what everyone else thinks the dependencies are.

./download_all_dependencies.sh: Run “pip freeze” and download every package currently installed into the current virtualenv.  This is good because often times “pip install foo” will download several dependencies, and the ./add_dependency.sh script above doesn’t properly handle those cases.  I think this is a bug in pip.

./install.sh: Take “requirements.txt” and process it line-by-line running the “install but don’t download” commandline from above.

 

Make your tests 7x faster in Django 1.4

In Django 1.4, the default password hasher has been switched to the extremely secure PBKDF2 algorithm.

But, each encrypt and decrypt using PBKDF2 can take a pretty long time (on my system, about 150ms for each hashing, which happens twice per unit test that I’m writing). For your test cases, (that create users and log them in & out, probably) this extra security is probably pointless, and runtime is paramount.

So, create a custom settings.py for your test cases, and set the PASSWORD_HASHERS setting to exclude PBKDF2. You can also use this technique to setup an inmemory sqlite2 database as your backend, which also speeds things up quite a bit. Here’s a snippet from my settings_test.p:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': ':memory:',
        'USER': '',                      # Not used with sqlite3.
        'PASSWORD': '',                  # Not used with sqlite3.
        'HOST': '',                      # Not used with sqlite3.
        'PORT': '',                      # Not used with sqlite3.
    }
}

PASSWORD_HASHERS = (
    # 'django.contrib.auth.hashers.PBKDF2PasswordHasher',
    # 'django.contrib.auth.hashers.PBKDF2SHA1PasswordHasher',
    # 'django.contrib.auth.hashers.BCryptPasswordHasher',
    'django.contrib.auth.hashers.SHA1PasswordHasher',
    'django.contrib.auth.hashers.MD5PasswordHasher',
    # 'django.contrib.auth.hashers.CryptPasswordHasher',
)

Note that you have to have the sha1 algorithm enabled to allow the django.contrib.auth tests work. You can use different settings for test either get this by something like:

DJANGO_SETTINGS_MODULE=’yourapp.settings_test’ django-admin.py test

(probably putting this in a Makefile or shell alias) or I’ve also seen people put this at the top of their default settings:

if 'test' in sys.argv:
    from settings_test import *

These 2 changes took my test run time from 15.5 seconds down to 2.0 seconds, an improvement of about 7x! Woot!