Testing an application's performance in the cloud begins with understanding the infrastructure of your public cloud environment. Learn which elements to watch for and how to optimize your cloud-based performance testing environment.
As more IT organizations become comfortable with the public cloud and look to leverage it for critical applications, performance testing engineers must be prepared to test effectively in the public cloud environment. As a performance testing engineer, you must understand that performance testing in the public cloud requires a detailed understanding of the cloud infrastructure and the effects that infrastructure can have on performance testing results. You should be aware of important infrastructure considerations that can significantly affect your load testing efforts while setting up your cloud-based load testing environment and interpreting the results of your efforts.
The more that you know about cloud-based infrastructure environment variables, the more confident you will become with the performance metrics that your load testing tool produces and the more information you will have to apply to the interpretation of the load testing tool results. At the core of a typical instance of a public, cloud-based environment sits a physical server or servers, of which the physical characteristics (server class, CPU, memory speed, disk speed, etc.) may or may not be known to you. Some public cloud vendors are not straightforward with the hardware configuration that your instance runs on. Do not be afraid to contact the cloud vendor for this information.
Typically, you or the administrator will know the instance size or type. For example, on the cloud-based Amazon Web Services (AWS) platform, you have the option of several sizes. Some of the hardware-specific details for each size are published in the documentation, providing transparency to the physical characteristics of the machine that your instance sits upon. Just as with a non-cloud-based load testing environment, you need to be fully aware of the instance type and related characteristics that are targeted for your production environment. You must run the load tests with the appropriately matched instance type that is targeted for production. Not doing so is akin to testing one class of server in a typical non-cloud load environment while knowing that your production environment will have a completely different class of server.
The next critical component in the public cloud architecture stack is the virtualization software, which sits on top of the physical machine. You likely will not be aware of which virtualization product is in use, but the specific product is not important. What is important is that the cloud vendor should have only one product across its entire offering, Check with your vendor prior to making that assumption.
Virtualization software introduces overhead that you would not have in a typical on-premises load testing environment. Effectively, this will reduce the user-facing processing power and throughput of that server (and therefore the number of users that it can handle simultaneously) by a marginal amount. Some studies suggest the average reduction is about 10-15 percent, and my personal testing experience has supported that number. A key variable working in your favor is that the virtualization software is supported by the underlying hardware, and the two work together to make the virtualization experience as efficient as possible.
The next important aspect you need to consider is that you may not know how many other users you share your server with. The whole purpose of the previously described virtualization layer is to allow the cloud vendor to “rent” space to several customers on a physical server. Obviously, this is a critical variable that you need to be aware of. Your challenge is to decode the physical infrastructure resources that others may be sharing with you. This data will be highly dependent on your particular vendor’s internal infrastructure, so you are going to need to do some research to get this information. For example, within the AWS cloud, shared resources include the network and the disk subsystem. This means that you will want to pay particular attention to performance metrics that those resources would affect throughout your testing, because other customers who run an instance on your server can affect the availability of those resources and thereby your user’s performance experience.
There are some protective mechanisms that you can put in place to limit the risk of excessive resource utilization from other instances utilized by other customers. For example, AWS offers different levels of I/O, including a very high I/O option (10 GB Ethernet) with its cluster computing instance option. While you may still share that I/O controller with other users, the increased amount of bandwidth you have offers significantly more protection from typical spikes in activity. Additionally, the vendor may offer options that restrict the number of users sharing your server or cluster of servers, possibly including an option in which you are the exclusive user.
The key thing to be wary of is that the cloud is a highly shared platform, which will affect your application’s performance depending on the activity of other users sharing the underlying physical infrastructure. This demand will be largely unpredictable in your public cloud production environment and is an ongoing risk that must be mediated.
Having an extensive amount of performance data in the form of a baseline from testing in the environment will allow you to protect against some of this risk. One of your most powerful tools is to run multiple longevity tests to establish this baseline. A longevity test is a series of tests performed at a steady and controlled load rate over an extended period of time (eight hours or more) that will allow you to closely monitor the behavior and performance data of your application. Although you are monitoring your application behavior and performance metrics during these tests, you are also indirectly monitoring the behavior of the other users who are sharing server resources with you. For example, if you have lengthy baseline data of a certain performance metric that suddenly spikes higher for a given period of time and then levels back off, you may have just indirectly measured the effect of another user taking up a significant amount of network bandwidth or briefly utilizing a large portion of the disk subsystem. As long as your application performance metrics have not exceeded their performance requirements during the longevity test, you can reasonably assume that, over time, this activity will average out and continue to result in acceptable performance of your application.
Of course, you will not be able to definitively determine that any brief periods of degradation of your application under load are caused by another user’s spike in activity, but that is why you need to run many longevity tests and build up tens of hours of raw performance data over an extended period of time. With all of this data, you should be able to conclude that any net effect of other users’ activity on your shared cloud resources will be of minimal consequence to your application’s performance. The key here is having the collective visibility and analysis over an extended period of time from which to draw these conclusions.
Every vendor will have different virtual instance types from which to choose, so research how your vendor defines its types. The instance types on AWS are further categorized into instance families, so there are two dimensions to consider. Grouping types into families allows for fine-grained selection of computing power and bandwidth that maximizes value. Since you only pay for what you use, it is in your best interest to choose an appropriate type for your needs rather than a larger, more costly size that will be underutilized. Performance across family tiers is different, and testing within each tier will have drastically different load testing results.
As previously mentioned, there are also several instance types that provide varying levels of computing power. The types are available across each family, so you need to be certain that your load testing environment is leveraging the same family and instance combination that your production environment will use.
Of course, one benefit of a cloud-computing environment is that it is easy to change the configuration of the environment on demand. You can use this to your advantage by running your load tests across different instance families and types. This is a convenient, low-cost way of determining the optimal combination on which to host your production environment. In fact, your recommendation could end up saving your company a significant amount of money over time as the costs of combinations can differ drastically. Your test results can be used to optimize both performance and cost, making both your users and your management happy and further supporting the value and importance of your performance testing efforts.
I have only briefly touched upon a number of important infrastructure and architecture elements that can affect your ability to accurately measure the performance of your application in a public cloud. Depending on the unique architectural elements of your vendor, there will likely be other considerations that will affect your approach to performance and load testing. Be sure to review all of the vendor’s technical documentation and ask whether the vendor provides any documentation to help you develop a robust performance and load testing strategy. Some vendors may even offer access to solution engineers who can help you identify other critical infrastructure components to consider with your testing. Remember that the cloud vendor works for you and your company, so do not be afraid to demand as much information as is necessary for you to develop your load testing strategy. If things go wrong from a performance perspective in the public cloud with your application, ultimately the blame will be on your shoulders, so do your due diligence.