What Larry Page really needs to do to return Google to its startup roots

I worked at Google from 2005-2010, and saw the company go through many changes, and a huge increase in staff.  Most importantly, I saw the company go from a place where engineers were seen as violent disruptors and innovators, to a place where doing things “The Google Way” was king, and where thinking outside the box was discouraged and even chastised.  So, here’s a quick list of things I think Larry could do to bring the startup feel back to Google:

Let engineers do what they do best, and forget the rest.

This is probably the most important single point.  Engineers at Google spend way too much time fussing about with everything other than engineering and product design.  Focusing on shipping great, innovative products needs to be put before all else.  Here’s a quick rundown of engineering frustrations at Google when I left:

  • Compiling & fixing other people’s code. This is a huge problem for the C++ developers at Google.  They spend massive amounts of time compiling (and bug fixing) “the world” to make their project work.  This needs to end.  Put an end to source-code distributions for cross-team dependencies.  Make teams (bigtable, GFS, Stubby, Chubby, etc.) deliver binaries & headers in some reasonable format.
  • Machine Resource Requests for products in the “less than a petabyte” class. Just hand out the resources pro-bono, track usage, and if they exceed some very high limit, then start charging.  Why is this a struggle?
  • LCE & SRE “blockers”.  Having support for Launch Coordination & Site Reliability is great, but when these people say “you can’t launch unless…” then you know they’re being a hindrance, and not a help.
  • Meetings.  Seriously, people are drenched in “status update” and “team” meetings. If your company has to have “No meetings Thursday” then you’re doing it wrong. How about “No meetings except for Thursday”.  That would make for a productive engineering team, not the other way around.
  • Weekly Snippets, perf, etc. I was continually amazed by the amount of “extra cruft work” that goes on.  I know it sounds important, but engineers should be coding & designing.
  • Perf, Interviews & lengthly interview feedback. The old fashioned model of getting together in a room to discuss a candidate is way more efficient.  Make sure that every single engineer in the building is participating in the interview process to spread the load more evenly.  Don’t let the internal recruiters pick engineers for interviews, as they have favoritism and are improperly motivated.   Limit to 1 interview per week, maximum.   Make a simple system for “I can’t make this interview” and “I think this resume looks shitty and don’t want to talk to this candidate.”
  • Discourage of open source software. There is so much innovation going on in the open source world: Hadoop, MongoDB, Redis, Cassandra, memcached, Ruby on Rails, Django, Tornado (web framework), and many, many other products put Google infrastructure to shame when it comes to ease-of-use and product focus.  Engineers are discouraged from using these systems, to the point where they’re chastised for even thinking of using anything other than Bigtable/Spanner and GFS/Colossus for their products.

Get rid of the proprietary cluster management system.

Yes, seriously.  What they have is a glorified batch-scheduling system that makes modern datacenters feel like antiquated mainframes.  Dedicated machines and resources are what startups have, so give them to your best engineers, and they’ll do great things. You should have learned this from the teragoogle team.  Start building a better, Virtual Machine based system where engineers can own & manage machine images themselves, all the way down to the operating system, dependencies, etc.  If more structure is needed, use existing open source packages or develop new systems in house, and open source those.  Build new “non-standard” data centers that don’t use the old system, and that every engineer can use.

The cluster management system’s fatal flaw is that it requires too large of an ecosystem, and pidgeon-holes running jobs into a far too restrictive container.  It doesn’t allow persistent local disk storage, since jobs can be terminated and relocated at any time.  Services running there are then cajoled into using Bigtable and/or Colossus for their persistent storage, which rules out virtually all other external database systems (MySQL, etc.).   This is an antiquated and overly constrained model for job allocation.

Switch to team-based distributed source control.

Teams or large related teams should manage their own source code.  Provide git-based hosting, and nothing else.  Cross-team deliverables should be done at the binary release level, not at the source code level.  Hard Makefile-type dependencies between teams need to be abolished.

Be the Bazaar, not the Cathedral.

Rethink the “lots of redundant, unreliable hardware” mantra.

Having to launch a simple service in multiple datacenters around the world, and having to deal with near-weekly datacenter maintenance shutdowns is unacceptable for an agile startup.  Startups need to focus on product, not process and infrastructure.  One persistent Amazon EC2 instance is much more valuable than a 100 batch scheduled jobs in a cluster that goes down for maintenance every week. Stop doing that.

Eliminate NIH-syndrome

Google has a very, very strong NIH (Not Invented Here) syndrome.  Alternate solutions (Hadoop, MongoDB, Redis, Cassandra, MySQL, RabbitMQ, etc.) are all seen as technically inferior and poorly engineered systems.  Google needs to get off it’s high horse, and look at what’s happening outside of it’s organization.  Hugely scalable services like Twitter are built on almost entirely open source stack, and they’re doing it really efficiently.  Open source solutions have a product-focus that’s missing from much of Google’s infrastructure for infrastructure’s sake engineering endeavors.  Focusing on the product first, and using any available solution is the agile way to experiment in new spaces.

Additionally, by eliminating the NIH syndrome, Google needs to allow these open systems into it’s production environment.  Amazon and RackSpace have nailed this with reliable, virtual hosting solutions, and this is allowing services built on those platforms to be portable, efficient, and agile.

Remember that small, special-purpose is more agile than big, general-purpose.

Google is famously good at building huge pieces of infrastructure that solve big, important problems. GFS & Colossus for file storage, Bigtable, Blobstore and Spanner for structured data storage, Caffeine for document storage and indexing.

But, when faced with a new problem or new requirements, projects are expected to pidgeon-hole their needs into one of these systems, or be chastised for “doing it wrong”. Additionally, when your application needs inevitably don’t fit or grow out of existing infrastructure capabilities, requests for improvement or enhancement are lost in the noise. This means small teams are crippled by the lack of agility of these monstrous systems.

Google’s engineers need to think & act like startup founders. Only develop what’s absolutely necessary to get your job done. Simplicity counts. Complex systems are hard to learn, debug, and maintain. Keep it small and focused.

Implement an in-house incubator.

Do this right now.  When a current employee comes to you and submits their resignation letter, and says they’re joining a startup, you should immediately respond with “Oh! Well, let me tell you about our in-house startup incubator…”

Put smart people together in a room, let them think freely about products and infrastructure, and good things will come of it.  In fact, I might argue that every Staff level engineer or higher should “go on sabbatical” to the in-house incubator for a period of a minimum of 6 months.   Rotate people in & out, and let them bring their incubator learnings back on to the main campus.  Have one incubator per geography, at a minimum, possibly more.  Let people choose their best freinds/coworkers, and go off and do something great for 6 months.  No managers, no meetings, no supervision.

Make it very clear that good, small ideas matter.

This is so important.  One of the things I heard over and over was “If your product isn’t a billion-dollar idea, then it’s not worth Google’s time.”  This message sucks.  What you’re saying is “your great idea that might make millions per year is less important than a small tweak to ads or search”.  Even if it’s true, you need to foster innovation of much lesser initial impact.

Google acquisitions of companies in the $5-50mm range means that at some level, small businesses are valued.  Make this very clear.  It sucks to have someone say “your $5mm idea isn’t big enough” on the one hand, and then watch Google buy up companies for $5mm each. This is bad precedent.

Eliminate internal language and framework cronyism.

By this, I mean: “Stop forcing people to do things The Google Way”.   There were several times where I had seen “unGoogly” system desgins get shot down because they didn’t use Bigtable, GFS, Colossus, Spanner, MegaStore, BlobStore, or any of the other internal systems.

For example, languages like Python are shunned upon because they’re “too slow for web frontends”.  Let teams use whatever tools and languages they want, and are most efficient in. Don’t pass judgement on infrastructure, pass judgement on Products.  If someone launches a great system based on Oracle and a bunch of Perl CGI scripts running on Sun Sparc 5’s, then you should praise them. If they’re crushed under load, then praise them even more for their success.

Engineers at Google spend huge amounts of their time being forced to prematurely optimize their backend and frontend infrastructure.  Most of the time, this benefits no one, as small products never get big enough to need such heavyweight systems, and are bogged down with the cost of multiple redundancy, and by using poorly behaved internal APIs that don’t meet direct product needs.

Make a general purpose cloud for internal use.

Amazon EC2 is a better ecosystem for fast iteration and innovation than Google’s internal cluster management system.  EC2 gives me reliability, and an easy way to start and stop entire services, not just individual jobs.  Long-running processes and open source code are embraced by EC2, running a replicated sharded MongoDB instance on EC2 is almost a breeze.  Google should focus on making a system that works within the entire Open Source ecosystem.

Acknowledge that 20% time is a lie.

Virtually no one I knew in my entire career there had an effective use of 20% time.  There are stories about how some products are launched exclusively via 20% time, and I’ve seen people use their 20% time to effectively search for a new internal position, but for the vast majority of engineers, 20% time is a myth.

I think it’s a great idea, and it needs to be made effective.  1 day per week isn’t reasonable (you can’t get enough done in just one day and it’s hard to carry momentum).  1 week per month would be great, but doesn’t do justice to your “main” project.  Something needs to budge here, and engineers need to be encouraged to take large amounts of time exploring new ideas and new directions.  Really fostering internal tools and collaboration might be the right answer.  I’m not sure, maybe they should just give up on it and give everyone a 20% raise.  Oh wait, they did that already.

Repeat your mistakes.

Engineers learn by doing, and learn by making mistakes.  Having rules about system design puts unnecessary constraints on thinking and products.  Having internal lore around things like “Google will never let another thing like Orkut ever happen again” is blatantly wrong.  Orkut was (and still is) a huge success, period.  None of the infrastructure stuff matters.  Even recent mistakes (Wave, etc.) should be praised and engineers should be encouraged to repeat those mistakes.

“Google Scale” is a myth.

Yes, I said it.

Google Search (the product) requires vast resources.  Almost nothing else does, and yet is constrained and forced to run “at Google scale” when it’s completely unnecessary.

Giving engineers the freedom to think & design out of the box with respect to infrastructure and systems means you’ll be more efficient in the long run.  Providing reliable platforms and data centers means you’ll have less redundancy, and be more efficient.

Given that a single machine can easily have 64GB of RAM, 10TB of disk, and 8 CPUs, it’s amazing that any product launch needs more than just a couple of that class of machine.  Let engineers push the boundaries, make mistakes, and run on the edge.

A small system that falls down under load is a huge success

A large system that’s wasting resources and has only a few users is a huge failure.

129 thoughts on “What Larry Page really needs to do to return Google to its startup roots”

  1. Interesting stuff, I think python maybe used now more than it was while you still worked at Google?

    Just wanted to point out a typo:
    ‘“unGoogly” system desgins’ -> ‘“unGoogly” system designs’

    Cheers!

  2. Thanks for this interesting and comprehensive post. After working in a startup that did everything you asked for, I feel that the Google Way makes a lot of problems disappear that bogged us down. Especially scaling small MySQL installations, coping with stubborn engineers writing excuciatingly slow web frontends in Python etc.

    I do not agree that only search needs a huge infrastructure. What about maps, video, mail, books? In fact, every important Google product benefits from its infrastructure and interoperability. Everything else should be done in startups.

    Furthermore, I am thankful that Google does not run an incubator. Instead, they provide a huge service to the world by releasing trained engineers into the wild.

  3. Richard,

    Really? “It would be easy to implement a ‘search the Internet’ feature on Facebook“?

    Google wasn’t exactly the first company to build a search engine. Why is it that the world’s using Google today and not one of the plethora of search engines that were around before Google? How did Google come out on top?

    I can’t tell you what made up other people’s minds, but I can tell you what got me to switch: useful and relevant results. Web searching back in the day was a chore—you could either get a few results on a narrow range of topics from one site, or you could get a huge flood of mostly irrelevant and useless garbage from another. I’d sort of assumed things had to be this way, that search engines could index lots of pages or they could return quality results from a few pages.

    Google broke that assumption. Suddenly, I could search a huge sample of the Web and do less digging to find what I wanted. They’ve gotten better at it since then, too.

    If it were as easy as you seem to think it is to get this right, other search engines would’ve quickly countered Google’s moves early on and this newcomer would’ve never stood a chance.

    That “search the Internet” feature, from what I can gather, is actually kinda hard to get right.

  4. «Meetings. Seriously, people are drenched in “status update” and “team” meetings. If your company has to have “No meetings Thursday” then you’re doing it wrong. How about “No meetings except for Thursday”. That would make for a productive engineering team, not the other way around»

    Seriously, you’re my hero.

    Very true. If social tools to share status and ideas are there for us to use, why would you need to meet with people?

  5. Hi,

    Some of the things you said might be true, but in a very constrained environment. The way I see it is the people who make a choice to join a startup are really passionate about their work. They generally are partial entrepreneurs in themselves, who know how to utilize the freedom provided to them. All the things you said will be true if Google was a startup, because then it will have employees like that.

    Now, most of the people join Google because its ‘Google’. It gives you a fat salary, it is a market leader and you get a lot other perks. These are good engineers, just that not all are passionate enough to do all the things you just said. Had Google been a startup today, it would take risks to let engineers go build a great product which is not reliable. But today, if Google launches something 10 million people jump in within a couple of weeks (Google+), you can’t afford to make mistakes at such a huge scale else it will affect the name Google is today.

    Also each of you bullet points are change in policies which needs atleast 3-5 years of time to be implemented in a company of ‘Google’s scale’. And if it does not work out you don’t have a way to come back.

    I feel the way Google work to day is it gives engineers enough freedom to come up with ideas and work freely on whatever projects they work on. And that is still appreciable, because not many companies do that. But, at the same time it keeps a control over employees so that if someone is making a wrong decision, it does not affect too much.

    As far as running search on a couple of machines with huge resources is concerned, answer this – would you use Google search if it goes down just because a someone spilled coffee over a transformer placed in Mt. View?

  6. I believe the biggest problem of Google is not the datacenters that look funny, from developer’s point of view, compared to, say, rackspace – but the googlers themselves. All this trouble with NIH, territorialism, SREs preventing a launch unless Larry or Marissa tells them to stop, all that greed and envy – they are just incurable. Just move on. No company was ever cured by nice people trying to do the right stuff.

  7. There are things that may be true, but with an organisation with over 30k people worldwide there should be some constraint to allow it work properly.

    Brainstorm is important, but exceed in time is counterproductive hence I agree materialise all the need in one go.

    I do not agree with freedom in the wild. Giving people the chance of use what product suit them best is not the answer for making a better product. What if your code work and then need to be re-written in a different language to make it usable and deplorable in the company?
    An what from a maintenance poit of view. The engineers that take care of the infrastructure need to be aware of so many situations that sooner this will go in the wild.
    You can pretend some agility on your own like virtual machine, but not all the rest you said.

    Certainly considering thir party products already on the market and making them better or simply using them, avoiding the NIH syndrome is somewhat they can think about but I’m not a googler so I can’t tell you more.

  8. Well, great diatribe but it all pre-supposes two very important things, one that Google did anything earth-shattering and that the so-called innovations were really innovations and not just re-worked decades old ideas. Really, you think Google invented technology or something? Creating “big table” is not exactly earth-shattering, try maybe a inventing the wheel like Codd did. But nice self-aggrandized rant anyway.

  9. Pingback: Quora

Leave a Reply