Metric: Quality Confidence

The Metrics and Reporting View looks at the various feedback cycles, metrics and reports in Holistic Software Development.

Quality Confidence is a lead indicator for the quality of a software release based on the stability of test pass rates and requirements coverage. Quality Confidence can be implemented at any level of the requirements stack mapped to a definition of done.

Is the output of good enough quality?

Quality confidence combines a number of standard questions together in a single simple measure of the current quality of the release (at any level of integration) before it’s been released. This metric answers the questions:

  • How much test coverage have we got?
  • What’s the current pass rate?
  • How stable are the test results?

Quality Confidence is 100% if all of the in scope requirements have got test coverage and all of those tests have passed for the last few test runs. Alternatively, Quality Confidence will be low if either tests are repeatedly failing or there isn’t good test coverage of in scope requirements.Quality Confidence can be represented as a single value or a trend over time.

Since in Holistic Software Development the Requirements Stack maps explicitly to Definitions of Done, with development at each level brought together via Integration Streams, Quality Confidence can be implemented independently at each level and even used as a quality gate prior to acceptance into an Integration Stream.

A Word of Warning

Quality Confidence is only an indicator of the confidence in quality of the product and should not be considered a stable solid measure of quality. Any method of measuring quality based on test cases and test pass/fails had two flawed assumptions included in it:

  1. The set of test cases fully exercises the software
    • Our experience shows that code coverage, flow coverage or simple assertions that the “tests cover the code” does not mean that all bugs have been caught, especially in Fringe Cases. We might think that we’ve got reasonable coverage of functionality (and non-functionals) with some test cases but due to complex emergent behaviors in non-trivial systems we cannot be 100% sure.
  2. The test cases are accurately defined and will not be interpreted differently by different people
    • Just as with requirements, tests can be understood in different ways by different people. There are numerous examples of individuals interpreting test cases in a diverse number of ways to the extent that the same set of test cases run against a piece of software by different people can result in radically different test results.

Metrics such as Quality Confidence, must be interpreted within the context of these flawed assumptions. As such they are simply useful indicators, if they disagree with the perceptions of the team then the team’s views should have precedence and any differences can be investigated to uncover quality problems. We strongly recommend a Go See first, measure second mentality.

How to calculate quality confidence

To give an indicator of the confidence in the quality of the current release we first need to ensure that the measure is only based on the current in-scope requirements. We then track the tests related to each of these requirements, flagging the requirements that we consider to have “enough” testing as well as their results over time. The reason we include whether a requirement has enough tests is that we might have a requirement in scope that is difficult to test, or has historically been a source of many Fringe Cases,  and so although it is in scope we might not have confidence that its testing is adequate. Obviously this a situation to resolve sooner than later.

Once we understand the requirements in scope for the current release we can start to think about the quality confidence of each.

A confidence of 100% for a single requirement that is in scope for the current release is achieved when all the tests for that requirement have been run and passed (not just this time but have also passed for the last few runs) and that the requirement has enough coverage. For multiple requirements we simply average (or maybe weighted average) the results across the in-scope requirements set.

We look at not just the current pass scores but previous test runs to see how stable the test is. If a test has failed its last 5 runs but passed this time we don’t want to assert that quality is assured. Instead we use a weighted moving average so that more recent test runs have more influence on a single score than the oldest but 100% is only achieved when the last x number of test results have passed. The specific number can be tuned based on the frequency of testing and level of  risk.

If we don’t run all the tests during each test run then we can interpolate the quality for each requirement but we suggest decreasing the confidence for a requirement (by a cumulative 0.8) for each missing run. Just because a test passed previously doesn’t mean it’s going to still pass now.

To help calibrate these elements (aging, confidence interpolation, and coverage) Quality Confidence can be correlated with the lag measure of escaped bugs. However, in real world implementations fine tuning of these parameters (other than coverage) have been shown to have little impact on actual Quality Confidence scores.


Despite being less than simple to measure Quality Confidence is quite intuitive to interpret as it is based on the principle of Behavior Driven Measurement. In our experience it tends to be congruent with team members ‘gut feel’ for the state of a product, especially when shown over time. Quality Confidence is a useful indicator but is no substitute for direct honest communication between people.

We encourage telling teams how to “game” metrics to make them look better. In the case of Quality Confidence, the measure can be gamed by adding more tests to a requirement, running tests more often and ensuring they pass regularly. All excellent behaviors for a development team.

Quality Confidence provides a lead indicator for the quality of Releases since we can calculate it before a release goes live. For continuous flow teams we can simply track the Quality Confidence of each individual requirement, change or other work item. Simple averages across calendar cycles give trend information.

Quality Confidence can be aggregated across teams (by averaging) and can also be applied at successive integration levels of the Definition of Done stack for Team-of-Teams work.

A simpler version

If you don’t have tests linked to requirements, then you may want to consider whether you’re testing the right thing or not. Quality Confidence can be simplified to be simply based on tests and test results (ignoring the coverage question above) if the team assert a level of coverage.In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Quality Confidence:

  • Teams increase test coverage against requirements
  • Teams test more often

If we balance this metric with the Throughput metrics that promote speed over quality, then the measurement driven behavior is positive for the teams.