For most traditional automation approaches, the tools and scripts built on them require data in order to interact with and check conditions on the system under test. This data typically falls into one of two categories: configuration data or oracle data.
Configuration data is just that: data that is used to configure the tool, the scripts, and the system under test (SUT), such as street addresses, credit card information, and user IDs. In the setup of more complex SUTs, the configuration data is typically also more complex and more voluminous. For example, a hospital management system could require many different patient types with unique names, identification numbers, diagnosis codes, lengths of stay, and other patient demographic information.
In testing, an oracle can be broadly defined as a way to determine whether a test has passed or failed. Oracles play this key role in automation as well, where the data in the oracle indicates whether a particular assertion or a script as a whole is considered to have failed. Though there are many types of oracles, the type that is of interest here is an oracle where we already have the expected result for specific inputs; based on Douglas Hoffman’s oracle descriptions in his article “Heuristic Test Oracles,” we’ll call this a sampling oracle.
As an example, let’s use needing to test the behavior of a sum function. The function takes two integers as input and then returns the sum of those integers as a result. Now, consider the following table:
Case | Input 1 | Input 2 | Expected Result |
1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 |
3 | 0 | 1 | 1 |
4 | 1 | 1 | 2 |
In the table, each pair of inputs has an expected result. If we are automating the checking of the previously described sum function, the tool or script could iterate through the input values, apply the sum function to the pair of inputs, and compare the actual result of the function call with the corresponding expected result. If any of the cases produces a result that does not match the expected result, that case is reported as a failure. In that context, the table above is being used as a sampling oracle.
Clearly, this is a simplistic and incomplete example. Most of the products and features we test would require far larger and more complex sampling oracles. As with configuration data, more complex applications tend to have more complex sampling oracles.
Reusing Data: Save Time, Cost, and Trouble
Many automation tools have a mechanism for storing data that will be used in their test scripts; typically, the specifics of this mechanism is different across tools, making each a proprietary mechanism. These proprietary mechanisms are convenient ways to store data that need to be used in test scripts. In addition, these mechanisms are frequently used to store both configuration data and data to be used as oracles. The challenge, however, is that it can be difficult to use this data outside the tool itself. Even if there is a mechanism to import and export this data, converting that data into a format that’s appropriate for use in another tool is the user’s responsibility. The effort for this conversion can be significant.
These considerations lead us to the point of this article: the value of making your data sources reusable across tools, particularly if the stored data is large or complex.
How can we do that? We can use well-known, external data sources such as CSV, Excel, or databases, to name a few. These types of data sources are generally accessible from most tools and in most programming languages.
Why? If data can’t be reused, it may have to be duplicated, completely or in part, across all tools in which it will be used.
Would I ever want to use data from one tool in another tool? Yes!
I once worked in an organization where we needed to expand the number of browsers we were testing with our automation. The tool we were using at the time did not support some of the browsers that we needed, specifically, Chrome. We decided to supplement our original tool with Selenium WebDriver. To facilitate the endeavor, we had layered the automation framework such that none of the underlying tool’s specifics were exposed in the test scripts. This allowed us to supplement our existing automation capability without requiring changes to the test scripts. Additionally, we didn’t rely on a tool-proprietary technology for storing our automation-related data, so we didn’t have to undertake a duplication of that data.
If you have spent enough time working with automation, you have probably found that tool migrations are not uncommon; I’ve been through at least four of these events that I can recall. For one of these events, we were using some of our previously written scripts to help with release night checking, when during a release the product is taken offline and upgraded, and automation is used to check for basic sanity. For certain business and technology reasons, we migrated those scripts from the tool we were using at the time to a different tool. Again, we had not used a tool-proprietary data source, so we did not have to engage in migrating that data from one data storage mechanism to another, thereby avoiding the cost of that activity.
I share these two examples to show that, while similar, they are different, real-world situations where using an external, reusable data source allows organizations to avoid the cost of migrating or duplicating existing data—the latter being a less desirable activity because the data can easily become out of sync.
There are some ancillary benefits as well. By using external data sources, we can leverage the expertise of our team members who have extensive knowledge of our products but not of our tools. Instead of teaching them the data entry mechanism of one or more tools, these team members can use data sources they are already familiar with, such as Excel, CSV, and databases.
Deciding whether to Use an External Data Source
Unfortunately, it’s not all benefits. By not using the tool-provided data sources, we can lose the previously mentioned conveniences provided by the tool. We may incur additional upfront work to provide programmatic access to the external data sources, as well as having to build the data sources themselves. External data sources also add moving parts to our automation ecosystem; if we fail to keep these additional parts in sync, we will have unreliable results from our automation runs.
As with any automation-related decision, deciding to use an external data source is a business decision made with technology data. If there is little chance that we’ll need to share data across tools, or if our tool has the capability to export from the proprietary, tool-specific data store into an easily manipulated format, perhaps using an external data source is not a good value. We should keep in mind, however, that we have a greater number of tooling options than ever, making it likely that we’ll use more than one tool in our automation ecosystem. It may be worth future-proofing our frameworks with access to external data sources.
User Comments
Nice article, Paul! Just one additional thought. I tend to stay away from "local" files or packages, like .CSV or Excel. When running tests in parallel, Excel is not very resource friendly. You get a new instance of Excel for each test. And it can be a challenge to deal with collisions when opening shared files. In addition, when suites scale and you start wanting to run on multiple machines (VMs?), you have to deploy those files as well, AND you may not have licenses for Excel on those other machines, either. When data is being shared between tests and frameworks, the most central place would likely be a database architected specifically for that data.
Thanks Bill!
I agree with your points on Excel, but I differ on CSV. I've seen teams do managable things with CSV files and avoid the need for a database. Typically they handle the local vs. non-local aspect by storing the CSV files in the repository with the test scripts so they are version controled and deployable.
Great article Paul. In reusing data, have you ever thought of it as a canary in the coal mine that lets you understand that expertise of the tester. Think of three scenarios. Someone who uses data that touches the boundaries of the logic or in equivalence partitions is likely more technical and understands what exactly is being decisioned in the code. A tester who uses data that is very "real world", likely understands the end user and has a solid customer perspective. Most importantly, testers who are just using varying non-realistic numbers signal they are looking at a black box and do not have expertise on what the customer needs and expects from the technology.
Thanks, Mark!
I never consciously through about assessing a tester's expertise based on how they approach test data. It is an interesting thought, though. I've always kind of taken a holistic view of a tester's (or automator's) skill.