Not the best week at Google. On Wednesday they had an outage that lasted one hour and meant that document lists, documents, drawings and Apps Scripts were inaccessible for the majority of  users. Google uses Google Docs themselves every day, so they feel your pain and are very sorry.

So what happened? The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.

Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines – making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday.

The automated monitoring noticed that attempts to access documents were failing at an increased rate, and alerted Google 60 seconds later after the failure rate increased sharply. The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, Google doubled the capacity of the lookup service to mitigate the impact of the memory management bug. The rollback completed 24 minutes later, and 5 minutes after that the outage was effectively over as the additional capacity restored normal function.

Since resolution, they have been assembling and scrutinizing the timeline of this event, and have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect. They intend to take all these steps; some are not easy, but they’re committed to keeping Google’s services exceptionally reliable. In the meantime, rest assured that they take every outage very very seriously, and as always they’ll post a full incident report of what happened to the Apps Dashboard once the investigation is complete. Again, Google apologizes for the inconvenience and frustration which the outage has caused.