Workflow Metrics
The Metrics and Reporting View looks at the various feedback cycles, metrics and reports in Holistic Software Development.
Workflow Metrics are indicators of flow and health in an organization. Used to compliment direct engagement through the Go See practice they can help provide evidence for decisions and provide indicators of areas to investigate.
Workflow Metrics are designed to answer Governance and Planning questions.
When designing metrics we need to be very careful that the metrics are evidence based and actually measure the output we’re trying to achieve. All metrics risk causing Measurement Driven Behaviour – when people’s attempts to meet the metrics cause unintented, negative, behaviors. We recommend Behavior Driven Measurement as an approach to defining metrics. We recommend using the same simple metrics at both team and team-of-teams level so that they can resonate and be easily understood by everyone in the organization.
At executive levels the details of workflow metrics may be too much, however information about the number of exceptions or cross-organization trends might be useful as health indicators on the Executive Dashboard.
When discussing metrics, we often refer to “work items”. Work Items are any tracking thing that needs to be done. This broad definition covers requirements, bugs, changes, support tickets and anything else that requires effort. We do not include wish list items, risks (although we do include their mitigation plans), or other non-work items that teams may choose to track.
Metric: Story Points and Velocity
Story Points are an abstract, arbitrary number associated with User Stories that indicates the relative effort (not complexity) required to complete the story development.
Often used in the form of a Fibonnacci series (1, 2, 3, 5, 8, 13, 21, 34, …) or a simple integer value Story Points are intended to indicate relative effort, not complexity, of low level requirements. Story Points are an estimation method based on picking a well-understood “normal” story, setting its value and then estimating other stories relative to the “normal”story.
Points are an abstract team based indicator and are not comparable across teams. They do not equate to “person-days” or complexity and so are fundamentally unsuitable for use in contractual arrangements.
Points cannot be aggregated meaningfully across teams or up to programme level due to their arbitrary team based value. For this reason,we strongly recommend against, using points at both a Programme Backlog and Product Backlog level.
Story Point estimating may be useful within a team to help size Stories for inclusion or not in sprints/iterations. This is an in-team private metric that does not make sense outside of the team. Over time story point sizes tend to lower values as large stories are broken up more is understood about the work (risks reduce in line with the Cone of Uncertainty). As a result, points are not numerically consistent even within the context of a single team over time. We’ve seen many planning and reporting dysfunctions based on poor understanding, and the implied false accuracy of Story Points such as Project Managers setting a target velocity!
Metric: Velocity
How quickly are we working? When will we be done?
Velocity is how many points are completed over time and is often used, especially in Scrum teams, as a measure of progress towards the total number of points.
Over time teams will gain an understanding of roughly how many points they can deliver in a period of time (or an iteration/sprint). This is called their “Velocity” and can be used to extrapolate remaining effort to completion (ETC). This is the original intention behind story points and is why they do not equate to complexity since some very complex things don’t take long to implement but some very simple (but large in volume) requirements can take a long time to implement. When using velocity it will normally vary a little over the lifecycle and will typically be a little unstable during the first couple and last few time periods/iterations/sprints.
As with any estimate, we recommend presenting it with an Uncertainty Indicator. Extrapolation of progress as effort indicators based on actual activity so far helps teams to answer “how long until it’s done?”. For more on tracking Story Points over time see Workflow Metrics.
- Abstract numbers are difficult for people and leaders to understand
- Their meaning changes over time as the Cone of Uncertainty is reduced in teams undermining extrapolation
- They imply false accuracy by being, often small discrete numbers
- They offer no additional benefit over simply counting items complete
- They cannot be aggregated because they are team-defined and not normalized against any standard
We recommend strongly resisting schemes to normalize story points in organizations, or make the same mistakes with other purely abstract measures (such as Business Value Points!)
Metric: Team Throughput
How quickly are we doing things?
Throughput, or average time to close, is a useful indicator of how quickly the team is getting things done.
If the trend starts to go up, then we have an indicator that work is taking longer. This may be due to less resources, more complexity at the moment or a change in ways of working that has introduced waste and slowed things down. Alternatively, if the trend starts to reduce we can deduce that work is being done more quickly. This may be due to the work being simpler and smaller at the moment, more resources or a change in ways of working that has improved efficiency and productivity. To accurately understand when a work item has been closed the Definition of Done must be explicitly understood.
Different work items will be of various sizes in terms of both effort and complexity, however over both small and very large project and organization datasets, we have found that both tend to a normal distribution which means that tracking the average is a useful indicator. Extrapolating the average time to close to the length of an iteration, sprint or release gives an indicator of how many items can be done in that timebox, and therefore if plans are realistic or need refining.
This essentially gives a Velocity of “things done” rather than “abstract points”.
Lean and/or Continuous Flow teams will typically track Team Throughput as Lead and Cycle Time metrics.
Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.
We have found that tracking these metrics can identify teams who are working at a steady state, improving or struggling. In terms of Behavior Driven Measurement we have seen the following behaviors emerge:
- Work is broken down into smaller chunks
- Those small chunks are delivered as quickly as possible
- Reducing the amount of work in progress (WIP) at any one time
If we balance these behaviors with one that promotes quality over speed then the measurement driven behavior is positive for the teams.
There is a risk that focusing on speed reduces quality and so this metric, and resulting behaviors, need to be balanced with one that promotes quality over speed.
Metric: Release Burnup or Cumulative Burnup
Are we on track or not? When will we be done?
A Burnup is an established metric in agile processes that indicates what the current scope target is (the number of stories or story points in the current release) and then tracks completion towards this target over time. The gradient of this line is driven by the Team Throughput and is usually called the team’s Velocity. Velocity is useful to extrapolate towards the intersection of work getting done and the required scope to indicate a likely point of timebox, release or product completion.
A traditional Burndown is simply an upside down version of this graph. We prefer a burnup because it allows for real life intervening in plans and the scope changing either up or down. In terms of perception the team can feel more positive about achieving something as the line goes up as progress is made. In a Burndown, as work is completed the line goes down as remaining work is reduced.
When there is no release or useful timebox (e.g. in a maintenance or purist continuous team) then a simple calendar based timebox (such as a month or quarter). Instead of a target scope we simply track the cumulative number of created items. This results in a Cumulative Burnup graph.
Cumulative Burnups that count items rather than abstract points can be useful across teams and even across entire development organizations to indicate how well the organization is keeping up with demand, how stable that demand is and how stable the development supply is. Note that the only real difference between a Release Burnup and a Cumulative Burnup is the amount of up-front planning and therefore creation of items. In our experience the number of items in a Release always fluctuates and should never be considered as a fixed value.
Release Burnups can’t be easily aggregated because Release timeboxes will be different for different products. However, simplifying to Cumulative Burnups allows for easy aggregation.
In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Burnups:
- A desire to create less items
- A desire to close items more quickly
Since this metric emphasizes speed over quality we balance it with the Quality Confidence Metric. To avoid very low priority but slow items never getting done we might also keep an eye on the age distribution of currently open items.
Metric: Lead and Cycle Time
Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.
How long does it take to do work? Why does it take so long?
As mentioned in Lean Principles, the most effective way of understanding a value stream is to directly examine it, to literally go and see the activities that take place throughout the organization, the hand overs between teams and the products produced. Similarly, the most effective way to identify waste is to simply ask team members in the organization what they think, since they work in these processes day to day, they are well qualified to identify which are useful and which are wasteful.
Sometimes a process may appear wasteful from a local perspective but may be meaningful and valuable from a systemic perspective. In this case the value of the wider process requires better communication so that local teams can understand its purpose, which in turn will improve motivation and productivity. We recommend that an organization is open to challenge from individuals and actively encourages people to identify areas of potential improvement. Whether their proposed changes end up being successfully implemented or not is less important than encouraging individuals to continuously improve the organization.
Measurement and Mapping
Lead and Cycle time are useful measures for understanding a value stream and mapping it.
Cycle Time = the time it takes to actually do something
Lead Time = the time taken from the initial request to it finally being done
Lead time can only ever be equal to the cycle time or longer since a request can’t be done quicker than the amount of time taken to do the task. Lead time includes cycle time. Most tasks in a complex organization (e.g. provisioning a new Continuous Integration Server) typically take far longer to achieve (the Lead Time) than the time to actually implement the request (set up a new virtual server, install software, integrate into network etc.). Measuring the amount of time involved in “doing” vs. “waiting” is the purpose behind Lead and Cycle Time.
Measuring activities in this way often leads to an understanding of the amount of time a request is in a queue, waiting to be triaged, on a person’s task list before work starts on it. This waiting time is all potential waste and is often inherent in any hand over of an activity between people, teams, departments or businesses.As a result, flow and efficiency can be simply improved by reducing the number of handovers.
There are many reasons for large “queue time” including over-demand and under-resourcing. However, there are frequently less justifiable reasons for delays including measurement driven behavior that results in teams protecting their Service Level Agreements (SLAs) and/or Key Performance Indicators (KPIs) by “buffering” work to introduce some slack for when things go wrong or to deal with variance. In traditional Project Management this deliberate introduction of waste is called “contingency planning” and is considered a positive practice. Unfortunately, it is especially prevalent when there is a contractual boundary between teams that is focused on SLAs.
These delays, regardless of their root causes can have significant effects when looking at the system as a whole. If we examine a chain of activities as part of a value stream, measuring each in terms of lead and cycle time we can draw a picture of a sequential value stream:
In the example above we have 26 days of actual work spread over an elapsed 60 days (12 weeks) which equals about 60% waste. This work could have been done in less than half the time it actually took!
Often planners will look to put some activities in parallel, to reduce contingency by overlapping activities (accepting some risk). Even with these optimizations, if waste inherent in each activity is not addressed there will still be a lot of waste in the system.
By visualizing a value stream in this fashion we can immediately identify waste in the system, in the example above, despite some broad optimizations we can still see it’s mostly red. In many cases planners aren’t willing to accept the risks inherent in overlapping activities as shown here, or aren’t even aware that they can be overlapped, leading to the more sequential path shown previously. The result is a minimum time that it takes a request to get done, based on the impedance of the current value chain of, in this example, 38 days before we even start thinking about actually doing any work. This is a large amount of waste.
What is the baseline impedance in your organization?
Aggregation
Metric: Work In Progress and Cumulative Flow
Teams using Kanban style continuous flow will typically work to strictly enforced Work-In-Progress (WIP) limits to expose bottlenecks and inefficiencies in their workflow. This is intended to help smooth the workflow, reduce work in progress and increase throughput. Enforcing WIP limits can be difficult however since work items are never really the same size, and efforts to same size them risk fragmentation of work, or unnecessary batching.
If bottlenecks occur team members can “swarm” to help their colleagues make progress.
Counting the number of items in each state of a workflow on a regular or continuous basis can be plotted to create a Cumulative Flow graph. These charts show the fluctuations in the workflow and average amount of time in each stage of the workflow – this does not require tracking of individual items through the workflow, only the number of items at each point on the sampling heartbeat.
A Cumulative Flow graph is similar to the Cumulative Burnup except that it has more lines since rather than just tracking created and done, it tracks each state change. An organization wide Cumulative Flow graph can be created if the organization adopts a standard set of Definitions of Done as recommended in Holistic Software Development.
Metric: Work Item Age
Metric: Work Item Type Distribution
Our project data shows that there are often higher number of bugs and changes than requirements, this isn’t necessarily a bad thing. Bugs and Changes are normally smaller in scope and effort than requirements types and so there’ll typically be many bugs per requirement. There’s a difference between Development Bugs and “Escaped Bugs” which we describe in Bug Frequency.
- Creation of a mass of detailed requirements types (i.e. Stories)
- Followed by a lag of correlating bugs/changes a few weeks later
Metric: Bug Frequency
How many bugs are there? What’s the quality like?
Quality is more involved than simply counting bugs. However, the number of bugs found over time is a reasonable indicator of quality.
“Development Bugs” are bugs that are found during development – the more of these the better as they’re found before the product gets to a customer. In contrast “Escaped Bugs” are found once the product is live in user environments which is bad. If your data can differentiate internal vs. escaped bugs you can get a sense for how effective your quality practices are at bug hunting.
Metric: Quality Confidence
Quality Confidence is a lead indicator for the quality of a software release based on the stability of test pass rates and requirements coverage. Quality Confidence can be implemented at any level of the requirements stack mapped to a definition of done.
Is the output of good enough quality?
Quality confidence combines a number of standard questions together in a single simple measure of the current quality of the release (at any level of integration) before it’s been released. This metric answers the questions:
- How much test coverage have we got?
- What’s the current pass rate?
- How stable are the test results?
Quality Confidence is 100% if all of the in scope requirements have got test coverage and all of those tests have passed for the last few test runs. Alternatively, Quality Confidence will be low if either tests are repeatedly failing or there isn’t good test coverage of in scope requirements.Quality Confidence can be represented as a single value or a trend over time.
Since in Holistic Software Development the Requirements Stack maps explicitly to Definitions of Done, with development at each level brought together via Integration Streams, Quality Confidence can be implemented independently at each level and even used as a quality gate prior to acceptance into an Integration Stream.
A Word of Warning
Quality Confidence is only an indicator of the confidence in quality of the product and should not be considered a stable solid measure of quality. Any method of measuring quality based on test cases and test pass/fails had two flawed assumptions included in it:
- The set of test cases fully exercises the software
- Our experience shows that code coverage, flow coverage or simple assertions that the “tests cover the code” does not mean that all bugs have been caught, especially in Fringe Cases. We might think that we’ve got reasonable coverage of functionality (and non-functionals) with some test cases but due to complex emergent behaviors in non-trivial systems we cannot be 100% sure.
- The test cases are accurately defined and will not be interpreted differently by different people
- Just as with requirements, tests can be understood in different ways by different people. There are numerous examples of individuals interpreting test cases in a diverse number of ways to the extent that the same set of test cases run against a piece of software by different people can result in radically different test results.
Metrics such as Quality Confidence, must be interpreted within the context of these flawed assumptions. As such they are simply useful indicators, if they disagree with the perceptions of the team then the team’s views should have precedence and any differences can be investigated to uncover quality problems. We strongly recommend a Go See first, measure second mentality.
How to calculate quality confidence
To give an indicator of the confidence in the quality of the current release we first need to ensure that the measure is only based on the current in-scope requirements. We then track the tests related to each of these requirements, flagging the requirements that we consider to have “enough” testing as well as their results over time. The reason we include whether a requirement has enough tests is that we might have a requirement in scope that is difficult to test, or has historically been a source of many Fringe Cases, and so although it is in scope we might not have confidence that its testing is adequate. Obviously this a situation to resolve sooner than later.
Once we understand the requirements in scope for the current release we can start to think about the quality confidence of each.
A confidence of 100% for a single requirement that is in scope for the current release is achieved when all the tests for that requirement have been run and passed (not just this time but have also passed for the last few runs) and that the requirement has enough coverage. For multiple requirements we simply average (or maybe weighted average) the results across the in-scope requirements set.
We look at not just the current pass scores but previous test runs to see how stable the test is. If a test has failed its last 5 runs but passed this time we don’t want to assert that quality is assured. Instead we use a weighted moving average so that more recent test runs have more influence on a single score than the oldest but 100% is only achieved when the last x number of test results have passed. The specific number can be tuned based on the frequency of testing and level of risk.
If we don’t run all the tests during each test run then we can interpolate the quality for each requirement but we suggest decreasing the confidence for a requirement (by a cumulative 0.8) for each missing run. Just because a test passed previously doesn’t mean it’s going to still pass now.
To help calibrate these elements (aging, confidence interpolation, and coverage) Quality Confidence can be correlated with the lag measure of escaped bugs. However, in real world implementations fine tuning of these parameters (other than coverage) have been shown to have little impact on actual Quality Confidence scores.
Interpretation
Despite being less than simple to measure Quality Confidence is quite intuitive to interpret as it is based on the principle of Behavior Driven Measurement. In our experience it tends to be congruent with team members ‘gut feel’ for the state of a product, especially when shown over time. Quality Confidence is a useful indicator but is no substitute for direct honest communication between people.
We encourage telling teams how to “game” metrics to make them look better. In the case of Quality Confidence, the measure can be gamed by adding more tests to a requirement, running tests more often and ensuring they pass regularly. All excellent behaviors for a development team.
Quality Confidence provides a lead indicator for the quality of Releases since we can calculate it before a release goes live. For continuous flow teams we can simply track the Quality Confidence of each individual requirement, change or other work item. Simple averages across calendar cycles give trend information.
Quality Confidence can be aggregated across teams (by averaging) and can also be applied at successive integration levels of the Definition of Done stack for Team-of-Teams work.
A simpler version
If you don’t have tests linked to requirements, then you may want to consider whether you’re testing the right thing or not. Quality Confidence can be simplified to be simply based on tests and test results (ignoring the coverage question above) if the team assert a level of coverage.In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Quality Confidence:
- Teams increase test coverage against requirements
- Teams test more often
If we balance this metric with the Throughput metrics that promote speed over quality, then the measurement driven behavior is positive for the teams.
Metric: Open and Honest Communication
The most useful measurement process is regular open and honest communication and the most effective way to understand what is going on in a workflow is to “go and see“. Physically walking a workflow, following a work item through the people and teams that interact with it in some way is an excellent way of understanding a business value stream in detail.
Counting things and drawing graphs is no substitute for direct open and honest communication.