Workflow Metrics

The Metrics and Reporting View looks at the various feedback cycles, metrics and reports in Holistic Software Development.

Workflow Metrics are indicators of flow and health in an organization. Used to compliment direct engagement through the Go See practice they can help provide evidence for decisions and provide indicators of areas to investigate.

Workflow Metrics are designed to answer Governance and Planning questions.

When designing metrics we need to be very careful that the metrics are evidence based and actually measure the output we’re trying to achieve. All metrics risk causing Measurement Driven Behaviour – when people’s attempts to meet the metrics cause unintented, negative, behaviors. We recommend Behavior Driven Measurement as an approach to defining metrics. We recommend using the same simple metrics at both team and team-of-teams level so that they can resonate and be easily understood by everyone in the organization.

At executive levels the details of workflow metrics may be too much, however information about the number of exceptions or cross-organization trends might be useful as health indicators on the Executive Dashboard.

When discussing metrics, we often refer to “work items”. Work Items are any tracking thing that needs to be done. This broad definition covers requirements, bugs, changes, support tickets and anything else that requires effort. We do not include wish list items, risks (although we do include their mitigation plans), or other non-work items that teams may choose to track.

Metric: Story Points and Velocity

Story Points are an abstract, arbitrary number associated with User Stories that indicates the relative effort (not complexity) required to complete the story development.

Often used in the form of a Fibonnacci series (1, 2, 3, 5, 8, 13, 21, 34, …) or a simple integer value Story Points are intended to indicate relative effort, not complexity, of low level requirements. Story Points are an estimation method based on picking a well-understood “normal” story, setting its value and then estimating other stories relative to the “normal”story.

Points are an abstract team based indicator and are not comparable across teams. They do not equate to “person-days” or complexity and so are fundamentally unsuitable for use in contractual arrangements.

Points cannot be aggregated meaningfully across teams or up to programme level due to their arbitrary team based value. For this reason,we strongly recommend against, using points at both a Programme Backlog and Product Backlog level.

Story Point estimating may be useful within a team to help size Stories for inclusion or not in sprints/iterations. This is an in-team private metric that does not make sense outside of the team. Over time story point sizes tend to lower values as large stories are broken up more is understood about the work (risks reduce in line with the Cone of Uncertainty). As a result, points are not numerically consistent even within the context of a single team over time. We’ve seen many planning and reporting dysfunctions based on poor understanding, and the implied false accuracy of Story Points such as Project Managers setting a target velocity!

Metric: Velocity

How quickly are we working? When will we be done?

Velocity is how many points are completed over time and is often used, especially in Scrum teams, as a measure of progress towards the total number of points.

Over time teams will gain an understanding of roughly how many points they can deliver in a period of time (or an iteration/sprint). This is called their “Velocity” and can be used to extrapolate remaining effort to completion (ETC). This is the original intention behind story points and is why they do not equate to complexity since some very complex things don’t take long to implement but some very simple (but large in volume) requirements can take a long time to implement. When using velocity it will normally vary a little over the lifecycle and will typically be a little unstable during the first couple and last few time periods/iterations/sprints.

This means that teams need to throw away the first few iterations of velocity, then establish at least 3 (preferably 5) iterations to extrapolate something that’s even close to statistically meaningful. That means velocity can only be meaningfully calculated at somewhere approaching the 6th or 7th iteration/sprint.
Based on mining the work item data of 500 projects of diverse types we have found that team velocities are, generally, pretty stable past the first 3 sprints, and until the last 2 or 3. The biggest factor that affects velocity is changing team members, otherwise they’re pretty stable. Despite this long term stability in velocity, 90% of the projects we mined had over-planned sprints. Often when work wasn’t completed in a previous iteration it was simply added to the next, cumulatively overloading the team. Team velocity never increases because more work is planned into a timebox.
Numerically, there seems to be no significant difference in our dataset between simple extrapolation of numbers of done items on a backlog (per Release) and story point extrapolation indicating the Story Points are pointless. Simply tracking the % of items done per Release is simpler, and easier to communicate.

As with any estimate, we recommend presenting it with an Uncertainty Indicator. Extrapolation of progress as effort indicators based on actual activity so far helps teams to answer “how long until it’s done?”. For more on tracking Story Points over time see Workflow Metrics.

Importantly, in our examination of project data (including projects that did, and didn’t use Story Points) we found no statistically significant difference between Velocity and simply the % Complete over time. We recommend against using Story Points and Velocity as a metric because:
  • Abstract numbers are difficult for people and leaders to understand
  • Their meaning changes over time as the Cone of Uncertainty is reduced in teams undermining extrapolation
  • They imply false accuracy by being, often small discrete numbers
  • They offer no additional benefit over simply counting items complete
  • They cannot be aggregated because they are team-defined and not normalized against any standard

We recommend strongly resisting schemes to normalize story points in organizations, or make the same mistakes with other purely abstract measures (such as Business Value Points!)

Metric: Team Throughput

How quickly are we doing things? 

Throughput, or average time to close, is a useful indicator of how quickly the team is getting things done.

If the trend starts to go up, then we have an indicator that work is taking longer. This may be due to less resources, more complexity at the moment or a change in ways of working that has introduced waste and slowed things down. Alternatively, if the trend starts to reduce we can deduce that work is being done more quickly. This may be due to the work being simpler and smaller at the moment, more resources or a change in ways of working that has improved efficiency and productivity. To accurately understand when a work item has been closed the Definition of Done must be explicitly understood.

Different work items will be of various sizes in terms of both effort and complexity, however over both small and very large project and organization datasets, we have found that both tend to a normal distribution which means that tracking the average is a useful indicator. Extrapolating the average time to close to the length of an iteration, sprint or release gives an indicator of how many items can be done in that timebox, and therefore if plans are realistic or need refining.

This essentially gives a Velocity of “things done” rather than “abstract points”.

Lean and/or Continuous Flow teams will typically track Team Throughput as Lead and Cycle Time metrics.

Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.

We have found that tracking these metrics can identify teams who are working at a steady state, improving or struggling. In terms of Behavior Driven Measurement we have seen the following behaviors emerge:

  • Work is broken down into smaller chunks
  • Those small chunks are delivered as quickly as possible
  • Reducing the amount of work in progress (WIP) at any one time

If we balance these behaviors with one that promotes quality over speed then the measurement driven behavior is positive for the teams.

There is a risk that focusing on speed reduces quality and so this metric, and resulting behaviors, need to be balanced with one that promotes quality over speed.

Metric: Release Burnup or Cumulative Burnup

Are we on track or not? When will we be done?

A related metric is a Release Burnup (for a product delivery team) or a Cumulative Burnup (for a steady state maintenance team).

A Burnup is an established metric in agile processes that indicates what the current scope target is (the number of stories or story points in the current release) and then tracks completion towards this target over time. The gradient of this line is driven by the Team Throughput and is usually called the team’s Velocity. Velocity is useful to extrapolate towards the intersection of work getting done and the required scope to indicate a likely point of timebox, release or product completion.

A traditional Burndown is simply an upside down version of this graph. We prefer a burnup because it allows for real life intervening in plans and the scope changing either up or down. In terms of perception the team can feel more positive about achieving something as the line goes up as progress is made. In a Burndown, as work is completed the line goes down as remaining work is reduced.

When there is no release or useful timebox (e.g. in a maintenance or purist continuous team) then a simple calendar based timebox (such as a month or quarter). Instead of a target scope we simply track the cumulative number of created items. This results in a Cumulative Burnup graph.

An ideal Cumulative Burnup graph will show that the number of completed items keep roughly in line with the number of created items. An unhealthy graph will show the number of items created outrunning the number of items completed. This indicates over-demand on teams.

Cumulative Burnups that count items rather than abstract points can be useful across teams and even across entire development organizations to indicate how well the organization is keeping up with demand, how stable that demand is and how stable the development supply is. Note that the only real difference between a Release Burnup and a Cumulative Burnup is the amount of up-front planning and therefore creation of items. In our experience the number of items in a Release always fluctuates and should never be considered as a fixed value.

Release Burnups can’t be easily aggregated because Release timeboxes will be different for different products. However, simplifying to Cumulative Burnups allows for easy aggregation.

In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Burnups:

  • A desire to create less items
  • A desire to close items more quickly

Since this metric emphasizes speed over quality we balance it with the Quality Confidence Metric. To avoid very low priority but slow items never getting done we might also keep an eye on the age distribution of currently open items.

Metric: Lead and Cycle Time

Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.

How long does it take to do work? Why does it take so long?

As mentioned in Lean Principles, the most effective way of understanding a value stream is to directly examine it, to literally go and see the activities that take place throughout the organization, the hand overs between teams and the products produced. Similarly, the most effective way to identify waste is to simply ask team members in the organization what they think, since they work in these processes day to day, they are well qualified to identify which are useful and which are wasteful.

Sometimes a process may appear wasteful from a local perspective but may be meaningful and valuable from a systemic perspective. In this case the value of the wider process requires better communication so that local teams can understand its purpose, which in turn will improve motivation and productivity. We recommend that an organization is open to challenge from individuals and actively encourages people to identify areas of potential improvement. Whether their proposed changes end up being successfully implemented or not is less important than encouraging individuals to continuously improve the organization.

Measurement and Mapping

Lead and Cycle time are useful measures for understanding a value stream and mapping it.

Cycle Time  = the time it takes to actually do something

Lead Time = the time taken from the initial request to it finally being done

Lead time can only ever be equal to the cycle time or longer since a request can’t be done quicker than the amount of time taken to do the task. Lead time includes cycle time. Most tasks in a complex organization (e.g. provisioning a new Continuous Integration Server) typically take far longer to achieve (the Lead Time) than the time to actually implement the request (set up a new virtual server, install software, integrate into network etc.). Measuring the amount of time involved in “doing” vs. “waiting” is the purpose behind Lead and Cycle Time.

Measuring activities in this way often leads to an understanding of the amount of time a request is in a queue, waiting to be triaged, on a person’s task list before work starts on it. This waiting time is all potential waste and is often inherent in any hand over of an activity between people, teams, departments or businesses.As a result, flow and efficiency can be simply improved by reducing the number of handovers.

There are many reasons for large “queue time” including over-demand and under-resourcing. However, there are frequently less justifiable reasons for delays including measurement driven behavior that results in teams protecting their Service Level Agreements (SLAs) and/or Key Performance Indicators (KPIs) by “buffering” work to introduce some slack for when things go wrong or to deal with variance. In traditional Project Management this deliberate introduction of waste is called “contingency planning” and is considered a positive practice. Unfortunately, it is especially prevalent when there is a contractual boundary between teams that is focused on SLAs.

These delays, regardless of their root causes can have significant effects when looking at the system as a whole. If we examine a chain of activities as part of a value stream, measuring each in terms of lead and cycle time we can draw a picture of a sequential value stream:

In the example above we have 26 days of actual work spread over an elapsed 60 days (12 weeks) which equals about 60% waste. This work could have been done in less than half the time it actually took!

Often planners will look to put some activities in parallel, to reduce contingency by overlapping activities (accepting some risk). Even with these optimizations, if waste inherent in each activity is not addressed there will still be a lot of waste in the system.

By visualizing a value stream in this fashion we can immediately identify waste in the system, in the example above, despite some broad optimizations we can still see it’s mostly red. In many cases planners aren’t willing to accept the risks inherent in overlapping activities as shown here, or aren’t even aware that they can be overlapped, leading to the more sequential path shown previously. The result is a minimum time that it takes a request to get done, based on the impedance of the current value chain of, in this example, 38 days before we even start thinking about actually doing any work. This is a large amount of waste.

What is the baseline impedance in your organization?


It is possible to average lead and cycle values over time, on a team, product or arbitrary basis. Such aggregation provides quite useful information on general health of departments, or even the entire portfolio.
We recommend applying a pressure to drive them downwards and measuring the effect of process improvement against predicted changes. A change that results in lead and cycle times going up (assuming quality is stable) isn’t necessarily an improvement.
Tracking more than just the creation time, work start time and close time against activities, requirements or work items means that more than the Lead and Cycle time can be calculated. A Cumulative Flow diagram showing transitions between each state in a workflow can be created identifying bottlenecks and helping refine work in progress limits.

Metric: Work In Progress and Cumulative Flow

Teams using Kanban style continuous flow will typically work to strictly enforced Work-In-Progress (WIP) limits to expose bottlenecks and inefficiencies in their workflow. This is intended to help smooth the workflow, reduce work in progress and increase throughput. Enforcing WIP limits can be difficult however since work items are never really the same size, and efforts to same size them risk fragmentation of work, or unnecessary batching.

If bottlenecks occur team members can “swarm” to help their colleagues make progress.

Counting the number of items in each state of a workflow on a regular or continuous basis can be plotted to create a Cumulative Flow graph. These charts show the fluctuations in the workflow and average amount of time in each stage of the workflow – this does not require tracking of individual items through the workflow, only the number of items at each point on the sampling heartbeat.

A Cumulative Flow graph is similar to the Cumulative Burnup except that it has more lines since rather than just tracking created and done, it tracks each state change. An organization wide Cumulative Flow graph can be created if the organization adopts a standard set of Definitions of Done as recommended in Holistic Software Development.

Metric: Work Item Age

How long has work stayed open? Are we forgetting things?
Lead and Cycle time are useful for work that’s been completed. But the averages of those values ignore the items that are open, and have stayed open. The work items created and never done.
Examination of a medium sized development organization of around 400 people recently showed that although it proudly boasted an average 30-day cycle time it had several thousand items that were over a year old still outstanding!
Tracking the ages of outstanding open items helps balance the gaming and behavior driven measurement of a focus on Lead and Cycle times. A behavior can emerge where teams will only pick the easy incoming tasks, leaving the hard ones to fall into the “permanently open trap”. Measuring the age of open items helps to counter this behavior.
Of course, poor housekeeping and simply tracking many wish list items can lead to high ages on open items. For this reason, we recommend not tracking wish list items as part of work item metrics. The work item debt caused by poor housekeeping is harder to solve when it’s in the thousands of items.

Metric: Work Item Type Distribution

Another interesting metric related to work items is type distribution. Simply counting the number of items of each type over time, or as a snapshot can indicate workflow patterns and areas for investigation or simply provide context for other workflow metrics.
If your organization uses many different customized work item types then we recommend theming into broad categories such as “requirements”, “bugs” and “changes” before analysis.

Our project data shows that there are often higher number of bugs and changes than requirements, this isn’t necessarily a bad thing. Bugs and Changes are normally smaller in scope and effort than requirements types and so there’ll typically be many bugs per requirement. There’s a difference between Development Bugs and “Escaped Bugs” which we describe in Bug Frequency.

Requirements heavy distributions indicate projects at an early stage of their workflow. Excessive bug heavy distributions indicate poor quality practices. Excessive change heavy distributions indicate unstable requirements and therefore too much up front requirements work.
How these distributions change over time can be particularly informative. For many teams we can identify their release cycles by the changes in their work item distributions when we see a pattern of:
  1. Creation of a mass of detailed requirements types (i.e. Stories)
  2. Followed by a lag of correlating bugs/changes a few weeks later
This is a standard, and healthy, iterative pattern.

Metric: Bug Frequency

How many bugs are there? What’s the quality like?

Quality is more involved than simply counting bugs. However, the number of bugs found over time is a reasonable indicator of quality.

“Development Bugs” are bugs that are found during development – the more of these the better as they’re found before the product gets to a customer. In contrast “Escaped Bugs” are found once the product is live in user environments which is bad. If your data can differentiate internal vs. escaped bugs you can get a sense for how effective your quality practices are at bug hunting.

We track the number of internal and escaped bugs created over time. Generally speaking we can observe a standard patter of low escaped bugs and high development bugs. If the number of escaped bugs is trending up then there may be a quality problem, even if this correlates with an increase in user numbers. If the number of escaped bugs is trending down, then quality practices are being effective.
Development bugs will fluctuate over time, but excessive numbers indicate that more quality work may be required to prevent bugs rather than catch them afterwards. We’ve observed that the number of bugs (development and escaped) reduces when development and test effort is merged and increases when there are separate development and test teams.

Metric: Quality Confidence

Quality Confidence is a lead indicator for the quality of a software release based on the stability of test pass rates and requirements coverage. Quality Confidence can be implemented at any level of the requirements stack mapped to a definition of done.

Is the output of good enough quality?

Quality confidence combines a number of standard questions together in a single simple measure of the current quality of the release (at any level of integration) before it’s been released. This metric answers the questions:

  • How much test coverage have we got?
  • What’s the current pass rate?
  • How stable are the test results?

Quality Confidence is 100% if all of the in scope requirements have got test coverage and all of those tests have passed for the last few test runs. Alternatively, Quality Confidence will be low if either tests are repeatedly failing or there isn’t good test coverage of in scope requirements.Quality Confidence can be represented as a single value or a trend over time.

Since in Holistic Software Development the Requirements Stack maps explicitly to Definitions of Done, with development at each level brought together via Integration Streams, Quality Confidence can be implemented independently at each level and even used as a quality gate prior to acceptance into an Integration Stream.

A Word of Warning

Quality Confidence is only an indicator of the confidence in quality of the product and should not be considered a stable solid measure of quality. Any method of measuring quality based on test cases and test pass/fails had two flawed assumptions included in it:

  1. The set of test cases fully exercises the software
    • Our experience shows that code coverage, flow coverage or simple assertions that the “tests cover the code” does not mean that all bugs have been caught, especially in Fringe Cases. We might think that we’ve got reasonable coverage of functionality (and non-functionals) with some test cases but due to complex emergent behaviors in non-trivial systems we cannot be 100% sure.
  2. The test cases are accurately defined and will not be interpreted differently by different people
    • Just as with requirements, tests can be understood in different ways by different people. There are numerous examples of individuals interpreting test cases in a diverse number of ways to the extent that the same set of test cases run against a piece of software by different people can result in radically different test results.

Metrics such as Quality Confidence, must be interpreted within the context of these flawed assumptions. As such they are simply useful indicators, if they disagree with the perceptions of the team then the team’s views should have precedence and any differences can be investigated to uncover quality problems. We strongly recommend a Go See first, measure second mentality.

How to calculate quality confidence

To give an indicator of the confidence in the quality of the current release we first need to ensure that the measure is only based on the current in-scope requirements. We then track the tests related to each of these requirements, flagging the requirements that we consider to have “enough” testing as well as their results over time. The reason we include whether a requirement has enough tests is that we might have a requirement in scope that is difficult to test, or has historically been a source of many Fringe Cases,  and so although it is in scope we might not have confidence that its testing is adequate. Obviously this a situation to resolve sooner than later.

Once we understand the requirements in scope for the current release we can start to think about the quality confidence of each.

A confidence of 100% for a single requirement that is in scope for the current release is achieved when all the tests for that requirement have been run and passed (not just this time but have also passed for the last few runs) and that the requirement has enough coverage. For multiple requirements we simply average (or maybe weighted average) the results across the in-scope requirements set.

We look at not just the current pass scores but previous test runs to see how stable the test is. If a test has failed its last 5 runs but passed this time we don’t want to assert that quality is assured. Instead we use a weighted moving average so that more recent test runs have more influence on a single score than the oldest but 100% is only achieved when the last x number of test results have passed. The specific number can be tuned based on the frequency of testing and level of  risk.

If we don’t run all the tests during each test run then we can interpolate the quality for each requirement but we suggest decreasing the confidence for a requirement (by a cumulative 0.8) for each missing run. Just because a test passed previously doesn’t mean it’s going to still pass now.

To help calibrate these elements (aging, confidence interpolation, and coverage) Quality Confidence can be correlated with the lag measure of escaped bugs. However, in real world implementations fine tuning of these parameters (other than coverage) have been shown to have little impact on actual Quality Confidence scores.


Despite being less than simple to measure Quality Confidence is quite intuitive to interpret as it is based on the principle of Behavior Driven Measurement. In our experience it tends to be congruent with team members ‘gut feel’ for the state of a product, especially when shown over time. Quality Confidence is a useful indicator but is no substitute for direct honest communication between people.

We encourage telling teams how to “game” metrics to make them look better. In the case of Quality Confidence, the measure can be gamed by adding more tests to a requirement, running tests more often and ensuring they pass regularly. All excellent behaviors for a development team.

Quality Confidence provides a lead indicator for the quality of Releases since we can calculate it before a release goes live. For continuous flow teams we can simply track the Quality Confidence of each individual requirement, change or other work item. Simple averages across calendar cycles give trend information.

Quality Confidence can be aggregated across teams (by averaging) and can also be applied at successive integration levels of the Definition of Done stack for Team-of-Teams work.

A simpler version

If you don’t have tests linked to requirements, then you may want to consider whether you’re testing the right thing or not. Quality Confidence can be simplified to be simply based on tests and test results (ignoring the coverage question above) if the team assert a level of coverage.In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Quality Confidence:

  • Teams increase test coverage against requirements
  • Teams test more often

If we balance this metric with the Throughput metrics that promote speed over quality, then the measurement driven behavior is positive for the teams.

Metric: Open and Honest Communication

The most useful measurement process is regular open and honest communication and the most effective way to understand what is going on in a workflow is to “go and see“. Physically walking a workflow, following a work item through the people and teams that interact with it in some way is an excellent way of understanding a business value stream in detail.

Counting things and drawing graphs is no substitute for direct open and honest communication.