Oh, The Huge Manatee

A blog about technology, open source, and the web... from someone who works with all three.

Drupalgeddon: Best Practices Aren't Good Enough Anymore

Last week’s Public Service Announcement from the Drupal security team caused a lot of attention. And rightfully so – it told us that the vast majority of Drupal 7 sites around the world are considered compromised. A mere 7 hours after critical security patch SA-CORE-2014-005 was released, robots were spotted in the wild, bulk-hacking Drupal 7 sites with this vulnerability. This is something that’s never happened to the Drupal community before, and it is extremely serious. In some way it’s our own version of Heartbleed and other highly-publicized critical vulnerabilities in open source software.

This issue should not reflect badly on the Drupal community, or the Drupal product at all. Vulnerabilities happen to every software project – particularly the large and complex ones like Drupal! In this case it was the result of a choice in the database abstraction layer to use emulated prepared statements. There’s a great dissection of the whole vulnerability at ircmaxell, but the point here is that it was an intentional decision. We were aware of a theoretical security risk, just as we are in making lots of decisions. But theoretical risks don’t mean much compared with real, measurable losses from the available alternatives. As I said before, this can happen to any software project, and Drupal is a relatively responsible, well written one. What’s interesting now, is the response.

First of all, I am amazed to read responses from many Drupal users who are panicked at having to run a diff of their sites, because they don’t have appropriate tools in place. If you are developing without a VCS and automated backups, you are doing more harm than good. Just stop. Take a week to learn the basic requirements of a development environment, and start employing them. End of story.

I’m not concerned for those people – their sites were disasters waiting for an excuse anyway. What’s frightening about this particular situation is that even if you are working with backups and a VCS, even if you patch critical security vulnerabilities on an aggressive schedule, it still wasn’t good enough.

All of my Drupal 7 sites are affected by this PSA. I work for a large, well-respected agency, with access to leading-edge workflows and tools. We follow best practices. All of my sites are secured well beyond PCI requirements. But PCI requirements say that critical security patches have to be applied within 30 days of release. Our best practices include patch review from the tech lead, and validating patches on test environments before pushing them live. With only 7 hours between patch release and exploits in the wild, there isn’t time for any of that.

I’ve heard people complain that it’s too difficult to update Drupal. “drush up —security-only” seems pretty simple to me, or at least simple enough that further simplification won’t address the real problem. That’s because the real problem isn’t that it’s difficult to apply updates – it’s that a human be ing has to initiate them. I live in the Central European timezone, GMT+6. The patch was released at 10PM for me, and bots were exploiting it by 5AM the following morning. I went to work that day and initiated the patching process, so that my patches could be “responsibly” deployed to live with 24-48 hours of client validation time on my dev and staging environments. Despite being relatively on top of patches and responding relatively quickly, the fact that I’m human, and my clients are human, meant we never stood a chance of patching this issue fast enough. Even if we skipped validation, and even if the update process was just one button (rather than two commands), we would still have failed to update in time. I find myself reminded of the Battlestar Galactica pilot, where the Cylon robots are chasing the humans. After each hyperspace jump, the humans have 33 minutes to complete the calculations for another jump before the machines catch up with them. After 130 hours and 237 jumps, it becomes apparent that the humans’ need for sleep is a critical vulnerability.

The only solution is automated patching. It’s hard to figure out a workflow that allows it; indeed you’re forced into post-hoc testing, which means engineering an easy rollback solution. The truth is that 99% of security patches will not affect any of the functionality you’ve customized or upon which you rely, so hopefully this will be an edge case. But it’s a problem that actually has to be addressed. Here’s how I’m adapting my own projects over the coming weeks:

Every wednesday, every hour, my Jenkins instance runs a script which checks modules and core for each project for security updates. When an update is available, it automatically creates a branch off of Stable (my staging branch), applies the updates, and pushes the result up to the server. My git scripts already create a new subdirectory environment for every pushed branch. Once the environment is ready, Jenkins runs all available behat tests against the new branch. If all tests pass, the branch is automatically merged back into Master, Stage, and Live, and pushed. This push operation triggers a normal Jenkins deployment, which takes a backup anyway. An email is generated to the project administrator advising them which security updates were automatically applied, and linking to the relevant changefiles.

I’m excited about implementing this new layer of automation, because it builds on the best practice workflows I already like (test driven development, git flow VCS organization, automated deployment and backups…) to produce a tangible time savings and security improvement for my sites. At the same time, I can’t say that this is something I recommend for EVERYONE, precisely because it requires such a high level of environment maintenance. When you’re a one-person development shop, it’s hard to afford the time to set up the perfect development environment. It’s hard to convince those bottom-of-the-food-chain clients to pay for things like automated testing and deployment. And certainly once you have those things set up, you don’t get paid for maintaining them!

I’m going to be keeping my eyes open for better solutions that can be applied by the “developer on the street.” Something relatively easy, but which allows the same kind of automated, fast response time for security patches. I’m interested in any ideas you want to post in the comments!