Saturday, May 28, 2016

Theory about Test Environments

Often my career has faced dealing with an arbitrary environment to test in. This environment preceded my arrival, and often was still there at my departure with many developers became fatalistic towards this arbitrary environment.  This is not good.

 

The Rhetorical Goal Recomposed

“We use our test environment to verify that our code changes will work as expected”

While this assures upper management, it lacks specifics to evaluate if the test environment is appropriate or complete. A more objective measurement would be:

  • The code changes perform as specified at the six-sigma level of certainty.

This then logically cascades into sub-measurements:

  • A1: The code changes perform as specified at the highest projected peak load for the next N year (typically 1-2) at the six-sigma level of certainty.
  • A2: The code changes perform as specified on a fresh created (perfect) environment  at the six-sigma level of certainty.
  • A3: The code changes perform as specified on a copy of production environment with random data at the six-sigma level of certainty.

The last one is actually the most critical because too often there is bad data from bad prior released code (which may have be rolled back – but the corrupted data remained!) . There is a corollary:

  • C1: The code changes do not need to perform as specified when the environment have had its data corrupted by arbitrary code and data changes that have not made it to production. In other words, ignore a corrupted test environment

 

Once thru is not enough!

Today’s systems are often multi-layers with timeouts, blockage under load and other things making the outcome not a certainty but a random event. Above, I cited six sigma – this is a classic level sought in quality assurance of mechanical processes.

 

“A six sigma process is one in which 99.99966% of all opportunities to produce some feature of a part are statistically expected to be free of defects (3.4 defective features per million opportunities).”

 

To translate this into a single test context – the test must run 1,000,000 times and fail less than4 times. Alternatively, 250,000 times with no failures.

 

Load testing to reach six-sigma

Load testing will often result in 250,000 calls being made. In some cases, it may mean that the load test may need to run for 24 hours instead of 1 hour. There are some common problem with many load tests:

  • The load test does not run on a full copy of the production environment – violates A3:
  • The same data is used time and again for the tests – thus A3: the use of random data fails.
    • If you have a system that has been running for 5 years, then the data should be selected based on user created data with 1/5 from each year
    • If the system has had N releases, then the data should be selected on user created data with 1/n from each release period

Proposal for a Conforming Pattern

Preliminary development (PD) is done on a virgin system each day. By virgin I mean that databases and other data stores are created from scripts and populated with perfect data. There may be super user data but no common user data.  This should be done by an automated process. I have seen this done in some firms and it has some real benefits:

  • Integration tests must create (instead of borrow) users
    • Integration tests are done immediately after build – the environment is confirmed before any developers arrive at work.
    • Images of this environment could be saved to allow faster restores.
  • Performance is good because the data store is small
  • A test environment is much smaller and can be easily (and cheaply) created on one or more cloud services or even VMs
  • Residue from bad code do not persist (often reducing triage time greatly) – when a developer realized they have accidentally jacked the data then they just blow away the environment and recreate it

After the virgin system is built, the developer’s “release folder scripts” are executed – for example, adding new tables, altering stored procedures, adding new data to system tables. Then the integration tests are executed again. Some tests may fail. A simple solution that I have seen is for these tests to call into the data store to get the version number and add an extension to NUnit that indicate that this test applies to before of after this version number. Tests can then be excluded that are expected to fail (and also identified for a new version to be written).

 

Integration development(ID) applies to the situation where there may be multiple teams working on stuff that will go out in a single release. Often it is more efficient to keep the teams in complete isolation for preliminary development – if there are complexities and side-effects than only one team suffers. A new environment is created then each teams’ “release folder scripts” are executed and tests are executed.

i.e. PD+PD+….+PD = ID

This keeps the number of moving code fragments controlled.

 

Scope of Testing in PD and ID

A2 level is as far as we can do in this environment. We cannot do A1 or A3.

 

SmokeTest development (STD) means that an image of the production data base is made available to the integration team and they can test the code changes using real data. Ideally, they should regress with users  created during each release period so artifact issues can be identified. This may be significant testing, but is not load testing because we do not push up to peak volumes.

Tests either creates a new user (in the case of PD and ID) or searches for a random user that was created in release cycle 456 in the case of STD. Of course, code like SELECT TOP 1 *… should not be used, rather all users retrieved and one randomly selected.

 

This gets us close to A3: if we do enough iterations.

 

Designing Unit Tests for multiple Test Environment

Designing a UserFactory with a signature such as

UserFactory.GetUser(UserAttributes[] requiredAttributes)

can simplify the development of unit tests that can be used across multiple environments. This UserFactory reads a configuration file which may have  properties such as

  • CreateNewUser=”true”
  • PickExistingUser=”ByCreateDate”
  • PickExistingUser=”ByReleaseDate”
  • PickExistingUser=”ByCreateDateMostInactive”

In the first case, a user is created with the desired attributes.  In other cases, the attributes are used to filter the production data to get a list of candidates to randomly pick from.

 

In stressing scenarios when we want to test for side-effects due to concurrent operation by the same user, then we could use the current second to select the same user for all tests starting in the current second.

 

Developers Hiding Significant Errors – Unintentional

At one firm, we successfully established the following guidance:

  • Fatal: When the unexpected happen – for example, the error that was thrown was not mapped to a known error response (i.e. Unexpected Server Error should not be returned)
  • Error: When an error happens that should not happen, i.e. try catch worked to recover the situation…. but…
  • Warning: When the error was caused by customer input. The input must be recorded into the log (less passwords). This typically indicates a defect in UI, training or child applications
  • Info: everything else, i.e. counts
  • Debug: what ever

We also implemented the ability to change the log4net settings on the fly – so we could, in production, get every message for a short period of time (massive logs)

Load Stress with Concurrency

Correct load testing is very challenging and requires significant design and statistics to do and validate the results.

 

One of the simplest implementation is to have a week old copy of the database, capture all of the web request traffic in the last week and do a play back in a reduced time period. With new functionality extending existing APIs then we are reasonably good – except we need to make sure that we reach six-sigma level – i.e.  was there at least 250,000 calls???  This can be further complicated if the existing system has a 0.1% error rate. A 0.1% error rate means 250 errors are expected on average, unfortunately this means that detecting a 1 error in 250,000 calls difference is impossible from a single run (or even a dozen runs). Often the first stage is to drive error rates down to near zero on the existing code base. I have personally (over several months) a 50K/day exception logging rate to less than 10. It can be done – just a lot of systematic slow work (and fighting to get these not business significant bug fixes into production). IMHO, they are business significant: they reduce triage time, false leads, bug reports, and thus customer experience with the application.

 

One of the issues is whether the 250,000 calls applies to the system as a whole – or just the method being added or modified? For true six-sigma, it needs to be the method modified – sorry! And if there are 250,000 different users (or other objects) to be tested, then random selection of test data is required.

 

I advocate the use of PNUnit (Parallel Nunit) on multiple machines with a slight twist. In the above UserFactory.Get() described above, we randomly select the user, but  for stress testing, we could use the seconds (long) and modular it with the number of candidate users and then execute the tests. This approach intentionally creates a situation where concurrent activity will generated, potentially creating blocks, deadlocks and inconsistencies.

 

There is a nasty problem with using integration tests mirroring the production distribution of calls. Marking tests appropriately may help, the test runner can them select the tests to simulate the actual production call distribution and rates. Of course, this means that there is data on the call rates and error rates from the production system.

 

Make sure that you are giving statistically correct reports!

 

The easy question to answer is “Does the new code make the error rate statistically worst?” Taking our example above of 0.1% error we had 250 errors being expected. If we want to have 95% confidence then we would need to see 325 errors to deem it to be worst. You must stop and think about this, because of the our stated goal was less than 1 error in 250,000 – and we ignore 75 more errors as not being significant!!! This is a very weak criteria. It also makes clear that driving down the back ground error rate is essential. You cannot get strong results with a high background error rate, you may only be able to demonstrate 1 sigma defect rate.

 

In short, you can rarely have a better sigma rate than your current rate unless you fix the current code base to have a lower sigma rate.

3 comments: