Updating kernels for Amazon (AWS) Instances

{% raw %}
Lately I've been getting hit by an odd kernel bug on one of the servers I maintain.  It's discussed in some depth on Launchpad, but the gist is that when trying to find an idle cpu for a new thread, the scheduler divides by cpu_power.  On EC2 instances, it's possible for cpu_power to be 0, which causes a divide by zero kernel crash.  Why cpu_power is 0 is still an unknown; this should never be the case.  But in the meantime, there's a patch committed to ec2 kernel 2.6.32-313.26 which adds a check before running this division.  If cpu_power is 0, it waits a bit and tries again.  I believe it also adds a kernel yell, to help with tracking down the underlying problem.  In the meantime, this bug was crashing one of my instances.

So it's time to update the kernel.  This should be easy, right?  apt-get dist-upgrade seems to do it... but if you reboot, uname -r will tell you you're still on the old kernel.  That's when the swearing begins.

You see, cloud based hosting is a very seductive thing.  Those servers LOOK like real servers.  They FEEL like real servers.  And in 99% of all ways they ACT like real servers.  But they are not real servers.  And that 1% of difference is where all of the pain comes in.  As an aside, succumbing to this illusion is setting yourself up for disaster with any cloud based server.

One of the differences is that you don't have a real boot sector, or a real BIOS.  Everything you know about the first steps of booting a system has to go out the window.  Amazon uses Xen virtualization, which means that all that boot stuff you know should be replaced with PVGrub.  Actually, don't bother learning too much about it; we're just interested in upgrading our kernel.

Amazon documents this in a PDF, which is a bit painful to work through.  I've walked you through my process here.

The first thing to know is that you need kernel support for PVGrub (or PV-Grub, because they're inconsistent with the naming).  Don't sweat this too much, pvgrub is supported in most big distros.  Amazon claims to have tested with these ones:

• Fedora 8‐9 Xen kernels
• Fedora 13 (‐147 and higher)
• SLES/openSUSE 10.x, 11.0, 11.1 Xen
• SLES/openSUSE 11.x EC2 Variant
• Ubuntu EC2 Variant kernels
• RHEL 5.x kernels
• RHEL 6.x kernels
• CentOS 5.x kernels
• Gentoo

Ubuntu, at least, is smart enough to install the EC2 variant kernels in that apt-get dist-upgrade you ran earlier, so that much is done for you already.

The next thing you need to know is that PVGrub has a strong historical connection to PXE booters.  If you haven't worked with network booting, that's what PXE does for you.  The main difference is that you have to go back to having separate initrd and vmlinuz images in your /boot directory.  To be honest, I never trusted booting without vmlinuz, so this made me a little happy.  It made me even happier to see that apt-get had dropped both images in there for me already.

Now you have to create a grub boot list.  On a real server, this is the file which tells Grub which kernels are options for booting.  Create /boot/grub/menu.lst with the following content:

default 0 timeout 3
title EC2
root (hd0)
kernel /boot/vmlinuz-2.6.32-314-ec2 root=/dev/sda1 initrd /boot/initrd.img-2.6.32-314-ec2

Those are the vmlinuz and initrd filenames that I got from my update - since you're following this guide somewhere in the future, those version numbers will be different.  Have a quick look to see what you've got in /boot , and replace those filenames in your own menu.lst.

That's it - but in order to make this active, you have to bundle it as an AMI and start a new instance.  This actually enforces a very good, conservative workflow.  You get to run your updates, and test the kernel update on a non-live environment before moving your live IP over.

Because you're using a non-standard kernel, you do have to manually specify a few things in your AMI build process.  Here's the run-down:
# ec2-bundle-vol -r [ARCH] -d [DESTINATION] -p [AMI-NAME] -u [AWS-USERID] -k [KEYFILE] -c [CERTFILE] -s 10240 -e [EXCEPTIONS] --kernel [KERNEL-ID]
This is your standard ec2-bundle-vol command.  If you don't use this too often, here are the variables:

  • [ARCH] is the CPU architecture for your instance.  It's either i386 or x86_64
  • [DESTINATION] is the temporary filespace to use when building the AMI.  Do not do this on the root partition.  I use /mnt , which is the 100gb EBS volume provided with the instance.
  • [AMI-NAME] is the name for the AMI. It can be anything, I used production-kernel-2.6.32-314-ec2
  • [AWS-USERID] Your AWS userID number.  You can find it in the top right corner of the AWS account page.
  • [KEYFILE] and [CERTFILE] are your AWS keyfile and certificate files, which are used to authenticate with AWS.
  • [EXCEPTIONS] is a comma delineated list of directories to leave out of the image file.  In my case this is /mnt,/tmp,/ebs .  Note that if you leave /tmp out, the tempdir created automatically will have incorrect permissions.
[KERNEL-ID] doesn't fit into a bullet point.  This is an ID that identifies the architecture of the instance for Amazon's bootloader, and it's important.  Choose the right KID depending on your CPU, instance type, and region:
32-bit Instance64-bit Instance32-bit EBS64-bit EBS

One important note: in Amazon's PDF documentation, they use an odd character set.  This means that the hyphen in the middle of the AKI is not a unicode hyphen.  I spent more than an hour trying to figure out why my AMI was complaining of an invalid AKI over this!  Either copy and paste the ID from this blog post, or type it in by hand.

From here on in ,it's a standard AMI upload and register process:

# ec2-upload-bundle -b [BUCKET-LOCATION] -m [MANIFEST] -a [KEY] -s [SECRET KEY]

  • [BUCKET-LOCATION] is the name of an S3 bucket where you want to save this AMI.  If the bucket doesn't exist, ec2-upload-bundle will create it.
  • [MANIFEST] is the manifest file created with ec2-bundle-vol above.  It'll be in the same directory, with the same name, just with the extension .manifest.xml .  For example, /mnt/production-kernel-2.6.32-314-ec2.manifest.xml
  • [KEY] and [SECRET KEY] are your AWS key and secret key, which you can get from your AWS Account area.

# ec2-register –-name [AMI NAME] [S3 MANIFEST]

  • [AMI NAME] is the name of your AMI (duh)
  • [S3 MANIFEST] is the path of the manifest file you just uploaded, including the bucket path, on S3.  For example prod-upgrade/production-kernel-2.6.32-314-ec2.manifest.xml
Now start your new AMI, and test like hell before replacing that production instance! 
{% endraw %}
comments powered by Disqus