Theoretical physicist Richard Feynman spoke of self-deception as a trap to be carefully avoided or—failing that—diligently escaped. He sought phenomena that would falsify his assertions about how the world works. We need this same spirit when we approach our software.
Testing never tells us the codebase is bug-free, but it can show where our bugs are not. Marry this fact with Feynman’s attitude and we can not only trust our software, but demonstrate that it is as good as we believe it to be.
This is true in any context, but when rewriting an existing system, the risk is particularly pernicious. We presume to understand the issues better because we can see the as-is system working and suppose the to-be system should do likewise.
Though I’ll use examples of legacy software rewrites, these principles are just as important when making systemic changes to fairly recent code. For example, consider a start-up’s snazzy web application that works fine but uses algorithms or components that don’t scale with the increased volume of business. The as-is system takes us where we want to go, but we need the to-be system to do the same thing more efficiently.
I recently encountered a codebase nearing its twentieth birthday. It isn’t bad code; it was made with the best tooling available then. But as time marched on, standards of user experience, adaptability, testing, and maintainability dramatically raised the bar. The old tooling is incapable of meeting these new demands. Moreover, the as-is system was written without any unit tests. This is an understandable, because test-driven development was not yet a thing back then, but today, it’s a flaw.
Now the stakeholders have agreed to a rewrite the software using contemporary tooling and current best practices. You may have anticipated the requirements: “Make it do everything the old system did, but make it new and shiny.”
Years before I’d had the occasion to write three different generations of the same basic financial calculation in three different languages. My office at the time looked like an episode of Hoarders, and I had retained its years-old paperwork. I dug out the original requirements to explore what we’d been working with.
During my first rewrite I happened upon a decade-old file with the original requirements and some calculations done by hand. This paperwork was priceless. I turned it into unit tests that I used for both that rewrite as well as another rewrite after that. I got lucky.
Getting lucky is much likelier when we’re looking at a relatively recent system. The implementors of the as-is system may still be available to talk to, their memories are fresher, and if the paperwork survived, it isn’t covered with years of dust.
But suppose we’re looking at a system rewrite where the stakeholders have none of the original engineering documentation described. What can we do?
I started by inquiring after any surviving acceptance tests of the old system. No joy. Then I asked if we had the original specifications. No, nor did we have any engineering change orders. How about an issue-tracking system describing bug fixes and feature requests? Nope.
What about the source code repository? If we could see which modules got checked in a lot, we could infer hotspots in the code where the original implementors had problems, and this is a good place to look for trouble in the new system. Some functionality is inherently tricky, and the first guy probably wasn’t an idiot, so we’re as likely to fall into the same pitfalls as the old system’s implementors. But my bad luck persisted when I learned that although the as-is system’s code had been migrated to the current source code repository, its modification history had not been retained.
This unhappy situation isn’t blameworthy or even surprising. Such things have scarce value after a system goes into production. Documentation becomes obsolete—or even misleading—as the system changes, and corresponding docs aren’t updated.
If we can’t be lucky, we must build our own luck.
I got hold of a virtual machine running the twenty-year-old development environment. Once I was able to rebuild the as-is system, I started looking at how I could use it to generate baseline data. Because the as-is system never used a command line, we can bolt on a scripting interface without otherwise touching the code. We can use that to automate data entry and button clicks for the scenarios stakeholders care most about.
With this, we can run the as-is system in a repeatable, automated fashion. Add some scripting to initialize a test environment before invoking the as-is system script, then save its outputs, and we’ll have a significant body of baseline data to feed the new system’s acceptance tests.
This sounds more straightforward than it really is, because the new system isn’t about to pave the cow path. Though both old and new go from equivalent starting points to equivalent end points, they take different routes. And these equivalent points are not exactly the same.
The different systems may handle round-off error differently, we may find the outputs occur in a different sequence, or equivalent data may be represented differently. We shouldn’t fail a test because the old system says water freezes at 32 degrees Fahrenheit and the new system says it’s 0 degrees Celsius. Consequently, we can’t just diff two files; we have to consider the specifics of how they differ.
I once made the mistake of building a hundred unit tests that ran the desired function, serializing a key data structure to disk, then comparing it to a baseline file generated from manually verified good data. These tests would always fail on January 1 of the new year, or when an unrelated third-party library changed from version N to version N+1. I forgot that along with comparing the relevant portions of these data structures, I was looking at a hundred irrelevancies. This broad-brush approach was easy to implement and hard to maintain.
A better way would have been to target only the data directly relevant to the concern of each unit test. Targeted asserts provide insight into why the test is failing. We knew what was significant when we wrote the test, but with time, memories fade.
In our rewrite, the as-is and to-be systems can work according to different paradigms taking different approaches to do their work. Although they both start at equivalent (not the same) and proceed to equivalent destinations, we have to balance the impedance mismatch between them. This means normalizing inputs, presenting them to the desired system in a form it expects, and normalizing outputs we accept from it. It also means restricting our focus to only the concern of the current unit test.
It also means comparing expected and actual values in a way that glosses over irrelevancies. We may not care if one or the other system introduces a trailing space, changes letters from upper to lowercase, or differs at the tenth place to the right of the decimal point.
The blueprints for a machined part show the tolerances the part must satisfy. And when ascertaining whether we care about differences in certain data, we must verify our tolerances with the stakeholders. Too tight and we risk paving the cow path; too loose and we risk compromising fit and finish. We need to write our unit tests to enable tuning them to any desired tolerance.
In any system, there will be a multitude of data we might examine. Some will be clearly irrelevant, and others will be clearly vital. Between these poles we have to be smart about allocating limited resources to only those data that will provide the greatest insight into the new system’s fitness.
If we’re lucky, we can leverage legacy engineering documentation and expertise. Otherwise, we have to dive into the existing system’s outputs to seek patterns in the data that can guide a Pareto analysis. There will be a significant minority of those data that bear the most on the system’s correctness, and that’s where we will want to invest our resources.
This sort of thinking should guide our testing efforts.
Richard Feynman once said, “The first principle is that you must not fool yourself—and you are the easiest person to fool.” We must identify how the new system will fool us into thinking it’s working as well as the old system. Then we must write the tests most likely to show how we’ve fooled ourselves.