Automation That Learns: Making Your Computer Work for You

[article]
Summary:

It's been suggested that because automation can only do checking, automation cannot learn. But if you're talking about the acquisition of knowledge through experience and study, Jeremy Carey-Dressler believes automation can, in fact, learn—with a tester adding some additional code to capture and analyze more available data.

I have heard it suggested that because automation can only do checking, automation cannot learn. By “learning,” I will use the dictionary definition, “the acquisition of knowledge or skills through experience, study, or by being taught.” If I claim that the act of programming was teaching a machine, and that was how automation learns, it would not be a particularly meaningful claim. Instead, what I am talking about is the acquisition of knowledge through experience and study. I believe automation can, in fact, learn, with a tester adding some additional code around the automation.

Recording Data

One high-level example of a computer learning is simply using a database. A computer can clearly “learn” through the experiences it has, then record that information into a database. For example, a database can log how long a test runs by capturing those run times. Each test execution will be a new row, and over time the computer will “know” about how long a test will take. The automation can then detect if a test is broken by checking a test’s execution time compared to the average. For that matter, the test could also notice a trend of a test taking longer and longer and perhaps notify the team that the system is getting slower. Obviously these are not evaluations that replace human intelligence, but they are heuristics a human might care about.

In a real-world example, at my current company we have a load test that runs on a nightly basis. We run light load tests to see if the system has slowed down in any significant ways. Currently a human must review the results, but we plan on adding a formula to decide if the change in performance is too drastic and should be reviewed by a human. The formula would take into account previous runs and decide if the trend is statistically significant or not. Further into the future we want to tie this to builds so that we know which code changes are likely the cause of any slowdowns.

Complex search, such as what Google does, is an area that is difficult to test because of the lack of an oracle. One way to handle this problem is to take search results between runs and see how different the results are—this way you can judge if a new version of search is an improvement or not.

One way to do the validation is by looking at the frequency rate of words you used to search. So, say you search for "Dog" and "Dog" appears twenty-five times in the top ten items that came back from your search. You can store this data, then on the next version of search try "Dog" again. With this, you can create a trend line to demonstrate whether the search is improving. Maybe the next time you search for "Dog" you find the word twenty-four times. While this test might appear like a failure, it doesn't mean the search engine is worse overall. Rather than thinking of each search term as an individual test that passes or fails, you can consider the entire suite of tests as one large test. That is to say, given enough searches, you demonstrate a trend showing if the search feature on average provided better results to a user.

In a previous job, we had a system that would take a screenshot of any failed tests. We would manually look at these for patterns before doing any serious triage because it often was the fastest way. As the number of tests increased, we noticed many fails had similar failing screenshots, but the screenshots often were not exactly the same. There would be five tests that hit HTTP server errors, but there would be nothing the automation could find in common because the tests were executed in different browsers.

One of my brilliant coworkers came up with the idea of using the average luminosity (brightness) of the screenshot and sorting based on that. Luminosity worked because browsers generally render brightness the same way, while font often renders differently. Using the average luminosity, which is just a number, as a sorting method, all similar images would be grouped together and it would become apparent that those tests failed for the same reason, regardless of browser. While we didn’t do this, the next step would have been to compare the resulting luminosity value to previous test-run image luminosity. It could, with some additional safety check, then mark the test as “failed like previous run” and thus save having a tester need to reexamine the results.

Learn at Runtime

There are other ways to leverage knowledge around testing. I cowrote a system that automatically did first-level triage on all the automated tests. We noticed there were patterns we found in our triage that occurred over and over again. We had an e-commerce website akin to Amazon that had a variety of interesting code paths. You might use a credit card, or you might use a gift card. You might order in English, or you might order in Portuguese. You might use the search page, the product details page, and cart, but not check out. When we hand-triaged these tests, we would notice a pattern, like “All these tests are failing at search . . . better go test that.” It would take thirty minutes to notice the pattern and find the bug. These were things the automation basically knew about each test, and it also knew what the ultimate test result was. So we developed a system where the automation would consider all the inputs and all the outputs and determine the most likely cause. Inputs would include things like the name of the test and any parameters in our parameterized tests, like payment method. Outputs would include pages visited, the exception type and message, and page the test failed on. We then would look at all the fails and decide what was common and assign that as the most likely cause. This did not mean a tester didn’t have to manually review the results; it just took a lot less time.

Summing It All Up

There is lots of data that automation throws away or ignores, preventing it from learning. If you capture this sort of data when it makes sense, you can analyze it, and perhaps even have the computer check for things that are obviously wrong. Be careful, though—“smart” automation cannot completely replace the need for human intelligence. There are many other ways you can have your tools “learn” everything from production traffic to what tests are flaky. How much of that you should automate depends on your environment and needs.

Finally, when you find yourself not getting enough value from automation, or you’re spending too much time on analysis work a computer could do, it might be best to see if you can get your computer to learn on its own.

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.