The second half of this Hudson-adoption case study sees the team working through some challenges and setbacks. Do they meet their goals? Find out how this virtualization journey ends.
In part 1 of this article (Virtual Hudson Continuous Build Environments: Out with the Old) I described the trials and tribulations of our Hudson build environment at my workplace. This environment started out as a simple system that could build and test our code in a few minutes. Over the years, the build time increased until we had to wait far too long for feedback from the system, and I wanted to solve this problem by trying a pool of virtualized build servers.
We have been using server virtualization around the office for about three years now. We’ve even had some virtualized servers in our production environment. This technology is great and works as advertised.
We decided to buy a single eight-core machine and split it into eight virtual build slaves. On paper, this seemed like a perfect solution to our problem, so it was surprising that we just couldn’t get the money approved for it. Eight core servers (two CPUs with four cores each) are standard and not that expensive right now (about $3,000), especially considering the cost of having highly paid engineers wait for a build. However, this upgrade seemed always to be put on the back burner until the issue happened again.
Here We Go Again
At that point, our main compile build was generating 738 MB of data. This build ran in isolation on the master server, as moving that much data across the wire back to the master from a slave would have added to the build time, which was already at fifteen minutes.
On August 2, the master started to crash. Lisa Crispin, our tester, sent an email to the team at 8 p.m. that said, “Hudson just start freaking out.” Our main Linux guy responded, “The server is seriously ill,” and included the following log information:
Aug 2 19:57:19 <syslog.err> hudson syslogd: /var/log/messages: Read-only file system
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid: aborting-216341995 cmd=2a <c=2 t=0 l=0>
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid abort: 216341995:19[255:128], fw owner
Aug 2 19:57:21 <kern.warn> hudson kernel: megaraid mbox: critical hardware error!
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 <kern.err> hudson kernel: sd 0:2:0:0: timing out command, waited 360s
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error
Aug 2 19:57:24 <kern.err> hudson kernel: sd 0:2:0:0: rejecting I/O to offline device
Aug 2 19:57:24 <kern.crit> hudson kernel: EXT3-fs error (device dm-0): ext3_find_entry: reading directory #15958056 offset 0
I read the emails and knew we had just lost the disks. The thing was RAID 5 hardware, but it was no use. In the morning, our Unix guru tried to restart the box, but it did not work—the controller (Dell PERC 4) just started to reinitialize the drives. We had officially lost our entire configuration.
We had an old Dell PE 850 powered off in the rack, and I decided to rebuild on that while the rest of the team was sharpening the pitchforks. It took about a day just to get the compile build back working again. This was a slower machine, so the build time went up to seventeen minutes, but at least the team put the pitchforks away.
Time to Implement Something New
It took a long time to rebuild everything and, at the same time, we had some major software architecture changes that made it hard to determine whether a build was failing because of a new Hudson configuration issue or because of our code changes.
The good news was that this failure prompted management to approve not only our original request but also a new Hudson master to replace the failed box. After some debates and a lot of planning, we decided to make everything virtual—even the master—in order to guard against another hardware failure that we knew would happen at some time in the future. If the system crashed again, any virtual machines (VMs) on the crashed box could migrate to the working box. If we did this correctly, we would no longer have any downtime due to hardware failures.
The Dawn of a New Generation
Before I could commit 100 percent to the virtualization path, I needed the performance data to back up the decision. Recall from part one that our precrash Hudson server could do the compile in fifteen minutes; the old, postcrash server could do it in seventeen. But, I needed to know what overhead virtualization caused. The following table shows the results of my performance testing:
Server | Time |
HUDSON Precrash | 15 minutes |
HUDSON Postcrash | 17 minutes |
New Eight-core Server (Non Virtualized) | 10 minutes |
New Eight-core Server (Virtualized) | 12 minutes |
New Eight-core Server (Virtualized with iSCSI SAN for VM Storage) | 13 minutes |
The holy grail of virtualization, at least in my mind, is to be able to move a VM from server to server without stopping the VM. To do this, you need some sort of shared storage between virtualized hosts. The last entry on the table above is a virtualized host running its VMs on an iSCSI SAN. Considering what we gain with that, thirteen minutes is an awesome feat. The overhead of virtualization is well worth it. We will be able to further decrease our build time by parallelizing the builds even more, and adding capacity is pretty simple, too, via the addition of more virtual hosts.
Conclusion
We didn’t make our seven-minute build time goal, and I’m not sure we will ever see that short a time again. We probably could if we hadn’t virtualized any of the build servers, but that is a price we are willing to pay to have a more reliable build system. Overall, our build will be faster, as our queue should not be that deep anymore.
This solution is very effective at getting every single ounce of capacity out of a server (the bosses will like that). Even though we didn’t spend a lot of money on this system and it doesn’t have the fastest servers on the block, it is what we have for now and it works well.