Metrics and Reporting

The Metrics and Reporting View looks at the various feedback cycles, metrics and reports in Holistic Software Development.

The purpose of measurement is to put evidence into decision making, providing lead indicators of problems and assuring investors of the health of the business. Unfortunately, poorly designed or badly implemented measures can actually have a detrimental effect on the health of the organization.

Metrics vs. Measures

The terms “metric” and “measure” overlap a little.
Measures are concrete objective attributes, such as “Lines of Code (LOC)” – a pure fact but a terrible indicator of software development progress. Metrics are more abstract and a bit more subjective, such as “quality” indicators. We tend to have a sense of what Quality means and yet it’s hard to define hard measures for it.
In HSD we don’t worry too much about the distinction and collectively refer to measures, qualities and attributes as metrics because we wish to use them at a high level rather than objectively measure the internal workings of software development.
Oddly we’ve found that people trust measures more than metrics even when both the source and interpretation is questionable. Worse we’ve found that people tend to trust reports as evidence and dismiss talking to people as anecdotal.
The numbers in tools and reports come from people, and yet they are often trusted more than the opinions of the people. We’ve seen cases where the aggregate progress metrics based on several backlogs % complete figures combined were trusted more than the team lead’s opinion that the progress wasn’t good enough.  That’s entirely the wrong way around.
We’ve mined the data from over 500 diverse projects going back over a decade. As a result, we’ve been able to compare the data with project history and results. We’ve learned that some metrics work well as indicators of areas to investigate, and looking across a significant scale systemic issues can be found in organizations, but they never tell the whole story. An over-reliance on metrics, and an unwillingness to act on evidence of problems both undermine Metrics and Reporting.


The purpose of reporting is to communicate useful and timely information out from a team (to other teams, Customers and stakeholders as well as Business Leaders). A common problem in large organizations is an over-burdening on teams of reporting requirements from various sets of stakeholders creating a governance overload that reduces efficiency as reporting is fragmented.

Since Holistic Software Development combines Strategic DirectionGovernancePortfolio and Programme Management with Product Delivery practices in a single framework we are able to suggest a limited set of congruent measures that reinforce each other, using the same data and reporting mechanisms wherever possible to reduce this fragmentation and waste.

For any report or metric the following questions should be easily satisfied, otherwise the reporting should be stopped:

  • Who is reading the report or looking at the metric?
  • Are decisions based on the report/metric?
  • Do the readers consider it valuable?
  • Are positive behaviors driven by use of the metric?
  • Is the metric cost effective to capture?

Good measurement and reporting can act as a driver for continuous improvement, HSD incorporates a number of feedback loops so that teams can reflect on their performance and working practices to successively improve.

Clear Communication

The most effective form of reporting metrics is to have regular open and honest direct communication between the team and its stakeholders. To that end Holistic Software Development puts a focus on Feedback Loops and problem/issue/conflict escalation and resolution before measurement a reporting. However to make effective decisions we often need hard evidence and so this is balanced by a few specific metrics that are transparently communicated and radiated by teams. The peer-pressure and gaming effects of transparently reporting positive team workflow metrics can create a competitive culture for high performing teams. Although this can be a positive thing it will become a negative if teams compete with each other at the expense of working together to generate business value.

We believe there is no substitute for direct communication and so we recommend that Business Leaders including managers and executives talk directly with teams to understand their progress and issues rather than being unapproachable and only communicated with via reports or worse a series of successively remote levels of reporting.

Go See

The Go See practice involves literal physical examination of working practices, communication paths, interactions and products.

There is no better way to understand something than to actually give it a go. Reading reports about what’s going on or looking at diagrams, charts and statistics is no substitute for direct observation and direct engagement. We recommend being wary of reliance on metrics, measures, analytics, reports and other well-intentioned but (at best) secondary forms of evidence. To get the best possible understanding of a software business go and look at it, talk to the people, even try a bit of it.

The series of activities that fit together to ultimately create Business Value in an organization (called a value stream in Lean) often span multiple teams and organization structures meaning that it can be difficult to see how they all fit together and ensure that every activity is actually worthwhile. The best way to understand a value stream, and whether it can be improved, is to go and see it. We recommend that people physically walk the value stream:

  • Identify a new piece of work and follow it through the value stream
  • Attend every meeting, track the activities and note how long it’s static (or in a queue)
  • Don’t attempt to fix it during this period of observation, don’t criticize, just watch and learn.

By physically observing value streams we have found the Go See practice to be extremely powerful in its simplicity. Examination will uncover problems sometimes obvious to many but not effectively reported. the practice exposes problems that are sometimes hard for individuals to see due to the fragmented nature of value streams in structured organizations. Through direct observation the pressing issues become obvious and how to remove waste and blockages often becomes common sense. The complex dynamics of individuals working together often do not bear much relation to “sensible” plans and diagrams, observing how things really are is a fundamental source of truth.

We recommend that Business Leaders visit team working areas, attend retrospectives, stand ups and planning meetings. We have seen many instances where senior management is unaware of the realities of the organization simply because they rely solely on their metrics and reports. This is often exacerbated in organizations where the reporting of exceptions is seen as a failure.

“Go See” is also an excellent way to understand the quality of products and create rapport between Business Leaders and people who work with them encouraging collaboration, improving motivation and driving the Bubble Up practice.

We’ve seen many organizations where the teams don’t know who the senior Business Leaders are, or find them unapproachable. The Go See practice improves leadership visibility, and is free to try. It requires no preparation or effort. If the leaders of an organization won’t spend a few minutes talking to the people who work in the organization, they’re probably the wrong leaders.

Behavior Driven Measurement

Behavior Driven Measurement involves defining target behaviors and then deriving measures that will measure those behaviors. “Gaming” such measures should only cause positive effects in the organization.

Poorly designed or implemented measures can have a detrimental effect on the health of the organization. The reason for this is Measurement Driven Behaviour where peverse behaviors emerge as individuals and teams attempt to make the measures look good. In HSD, we reverse this idea to focus on Behavior Driven Measurement.

A Dysfunctional Example of Measurement Driven Behavior

Measurement driven behavior is hard to avoid because, simply put, what you measure is often what people will do. An example of this causing problems can be found in the use of Service Level Agreements (SLAs) with support teams.

A support team was given an SLA target that all calls must be resolved within 5 minutes. The intention of this was to drive quick problem resolution and identify when the support engineers needed more training (based on the SLAs not being met).

Support engineers felt a pressure to perform to this SLA to the extent that when a support call came in and the customer had more than one question they were asked to call back (and queue again) rather than ask two (or more) questions to the same support engineer – so that the clock could be reset and the SLA wouldn’t be broken.

Clearly this is dysfunctional in terms of providing rapid effective support to customers and indeed it caused a certain amount of reputational damage.

Turning it around: Behavior Driven Measurement

In HSD we look at the desired behaviors first and think about measures that will indicate the desired behavior, be useful and that when “gamed” will result in good behaviors. “Gaming” a measure means behaving specifically to change the measurement, regardless of the aims of the true business function. Unfortunately, humans are prone to gaming, so instead of pretending it won’t happen we embrace it and use it to the advantage of everyone.

In the case of support SLAs we want to know if the customer receives good support and if the support engineers have enough training and knowledge to do that effectively. Neither are well measured by setting a time limit. Instead we can focus on value and knowledge. The best source of customer satisfaction data is the customer body, we can either ask them directly to feedback as part of their support engagement or we can randomly sample the customer base to get an understanding of their satisfaction.

In terms of knowing whether our support engineers have enough training and knowledge a time limit is a very poor measure as support calls are not likely to be about the same subject, typically they are about a diverse range of issues. The best source for this information is in fact the support engineers, simply asking them if they have enough information and training is a good way of understanding their needs. If all of our support engineers say they are under-skilled on a particular system, then that’s good evidence that training may be necessary.

If talking to people isn’t enough and we actually want to record some measures then we can use Lead and Cycle Time or other Workflow Metrics measures to track the lifetime of a problem being raised, then work to resolve it and finally get to the point of its resolution. These measures are useful for establishing the average amount of time that requests are in queue vs. being worked on. We often find that that the best way to resolve problems more quickly is to reduce the queue times (which tends to raise customer satisfaction) which can be done in many ways, such as providing automated solutions for common issues and adding more support engineers.

Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.

Workflow Metrics are indicators of flow and health in an organization. Used to compliment direct engagement through the Go See practice they can help provide evidence for decisions and provide indicators of areas to investigate.

Targets and Incentives

Targets reinforce measurement driven behaviors and can often be counter-productive as they reduce a focus on Business Value moving the focus to meeting the target instead.
Setting targets leads to goal-seeking behavior where any behaviors that helps hit the target is considered acceptable. In one of our clients an agile mentoring team was required to run a number of maturity assessments on development teams. This requirement became a target of 1 per month. This target caused two dysfunctional behaviors:
  1. A pressure to get maturity assessments done caused the mentoring team to hassle development teams, damaging their reputation as positive enablers
  2. One month there were no assessments done because the various teams couldn’t fit in an assessment. The mentoring team’s management rounded up this 0 to 2 on their monthly report
We discuss the validity of “maturity assessments” later in this Process Metrics in the Metrics and Reporting View.
Goal seeking involves working backwards from the target to derive actions that will meet the target.  A better approach is to simply tell people what behaviors you’d like them to exhibit.
Incentives have the same effect in terms of goal seeking. They can have a very negative effect on people doing knowledge work (see Motivation) and supplier relationships (see Contracts).
At another client we were asked to examine a contract model that was proposed by a supplier with incentives built into it. The intent of the incentives was to reward the supplier for delivering the scope within initial time estimates. When we compared the various scenarios the supplier proposed through their own models side by side a perverse effect emerged that the supplier would actually earn more money by delivering late, and to a low level quality. Although this wasn’t intended by the supplier, or obvious from the model at face value once exposed the contract model was rightly shut down in favor of a simpler model.
The best incentive for teams is continued motivation, any incentives that are counter to that motivation (in terms of autonomy, mastery and purpose) will be damaging.
We do not recommend the use of targets or incentives.


HSD is based on Feedback Loops. The idea of these Feedback Loops is to provide a regular way to examine output against needs and, if necessary, take corrective action. Feedback is the heart of iterative development and the evolution of high quality products.

HSD is a balance between pragmatic lean governance, lightweight iterative and continuous processes with proven organizational patterns and is indicative not prescriptive. HSD is a balance between pragmatic lean governance, lightweight iterative and continuous processes with proven organizational patterns and is indicative not prescriptive. HSD is evidence based, using feedback wherever possible to improve quality. For this reason, quality assurance and verification is expected to be built into every activity. We strongly recommend, phased, iterative, agile or continuous delivery models as these approaches build corrective feedback activity into delivery these approaches build corrective feedback activity into delivery methods.

Simple iterative feedback loops

HSD is not a simple linear development process and the H-Model is iterative at a number of different levels. At the lowest level, team based development processes are iterative providing the smallest (and fastest) feedback cycle. These feedback cycles are good for correcting product development, but cannot directly change strategy by themselves, they must be connected to higher level feedback cycles.

Metrics can be useful to provide evidence at each feedback loop in terms of progress and health. We can measure the big questions of Governance and Planning, balanced by narratives from the people involved and the Go See practice.

We recommend teamsprojectsprogrammes and the business use regular retrospectives to identify opportunities for improvement and any impediments to success. Impediments or risks can then be addressed directly by the team.

Retrospectives are periodic team based reviews focusing on what went well and what went wrong so that lessons can be learned and processes improved.

Issues that cannot be dealt with by the team can be escalated using the Bubble Up mechanism, however the Bubble Up mechanism can also be used to communicate positive highlights as well.

The Bubble Up practice is a method for capturing key feedback messages at retrospectives throughout the organization and guaranteeing their esclation to, as well as a response from, Business Leadership.

HSD has nested feedback loops at the team, project and programme integration streams, as well as incorporating feedback as part of Continuous Governance.

Workflow Metrics

Workflow Metrics are indicators of flow and health in an organization. Used to compliment direct engagement through the Go See practice they can help provide evidence for decisions and provide indicators of areas to investigate.

Workflow Metrics are designed to answer Governance and Planning questions.

When designing metrics we need to be very careful that the metrics are evidence based and actually measure the output we’re trying to achieve. All metrics risk causing Measurement Driven Behaviour – when people’s attempts to meet the metrics cause unintented, negative, behaviors. We recommend Behavior Driven Measurement as an approach to defining metrics. We recommend using the same simple metrics at both team and team-of-teams level so that they can resonate and be easily understood by everyone in the organization.

At executive levels the details of workflow metrics may be too much, however information about the number of exceptions or cross-organization trends might be useful as health indicators on the Executive Dashboard.

When discussing metrics, we often refer to “work items”. Work Items are any tracking thing that needs to be done. This broad definition covers requirements, bugs, changes, support tickets and anything else that requires effort. We do not include wish list items, risks (although we do include their mitigation plans), or other non-work items that teams may choose to track.

Metric: Story Points and Velocity

Story Points are an abstract, arbitrary number associated with User Stories that indicates the relative effort (not complexity) required to complete the story development.

Often used in the form of a Fibonnacci series (1, 2, 3, 5, 8, 13, 21, 34, …) or a simple integer value Story Points are intended to indicate relative effort, not complexity, of low level requirements. Story Points are an estimation method based on picking a well-understood “normal” story, setting its value and then estimating other stories relative to the “normal”story.

Points are an abstract team based indicator and are not comparable across teams. They do not equate to “person-days” or complexity and so are fundamentally unsuitable for use in contractual arrangements.

Points cannot be aggregated meaningfully across teams or up to programme level due to their arbitrary team based value. For this reason,we strongly recommend against, using points at both a Programme Backlog and Product Backlog level.

Story Point estimating may be useful within a team to help size Stories for inclusion or not in sprints/iterations. This is an in-team private metric that does not make sense outside of the team. Over time story point sizes tend to lower values as large stories are broken up more is understood about the work (risks reduce in line with the Cone of Uncertainty). As a result, points are not numerically consistent even within the context of a single team over time. We’ve seen many planning and reporting dysfunctions based on poor understanding, and the implied false accuracy of Story Points such as Project Managers setting a target velocity!

Metric: Velocity

How quickly are we working? When will we be done?

Velocity is how many points are completed over time and is often used, especially in Scrum teams, as a measure of progress towards the total number of points.

Over time teams will gain an understanding of roughly how many points they can deliver in a period of time (or an iteration/sprint). This is called their “Velocity” and can be used to extrapolate remaining effort to completion (ETC). This is the original intention behind story points and is why they do not equate to complexity since some very complex things don’t take long to implement but some very simple (but large in volume) requirements can take a long time to implement. When using velocity it will normally vary a little over the lifecycle and will typically be a little unstable during the first couple and last few time periods/iterations/sprints.

This means that teams need to throw away the first few iterations of velocity, then establish at least 3 (preferably 5) iterations to extrapolate something that’s even close to statistically meaningful. That means velocity can only be meaningfully calculated at somewhere approaching the 6th or 7th iteration/sprint.
Based on mining the work item data of 500 projects of diverse types we have found that team velocities are, generally, pretty stable past the first 3 sprints, and until the last 2 or 3. The biggest factor that affects velocity is changing team members, otherwise they’re pretty stable. Despite this long term stability in velocity, 90% of the projects we mined had over-planned sprints. Often when work wasn’t completed in a previous iteration it was simply added to the next, cumulatively overloading the team. Team velocity never increases because more work is planned into a timebox.
Numerically, there seems to be no significant difference in our dataset between simple extrapolation of numbers of done items on a backlog (per Release) and story point extrapolation indicating the Story Points are pointless. Simply tracking the % of items done per Release is simpler, and easier to communicate.

As with any estimate, we recommend presenting it with an Uncertainty Indicator. Extrapolation of progress as effort indicators based on actual activity so far helps teams to answer “how long until it’s done?”. For more on tracking Story Points over time see Workflow Metrics.

Importantly, in our examination of project data (including projects that did, and didn’t use Story Points) we found no statistically significant difference between Velocity and simply the % Complete over time. We recommend against using Story Points and Velocity as a metric because:
  • Abstract numbers are difficult for people and leaders to understand
  • Their meaning changes over time as the Cone of Uncertainty is reduced in teams undermining extrapolation
  • They imply false accuracy by being, often small discrete numbers
  • They offer no additional benefit over simply counting items complete
  • They cannot be aggregated because they are team-defined and not normalized against any standard

We recommend strongly resisting schemes to normalize story points in organizations, or make the same mistakes with other purely abstract measures (such as Business Value Points!)

Metric: Team Throughput

How quickly are we doing things? 

Throughput, or average time to close, is a useful indicator of how quickly the team is getting things done.

If the trend starts to go up, then we have an indicator that work is taking longer. This may be due to less resources, more complexity at the moment or a change in ways of working that has introduced waste and slowed things down. Alternatively, if the trend starts to reduce we can deduce that work is being done more quickly. This may be due to the work being simpler and smaller at the moment, more resources or a change in ways of working that has improved efficiency and productivity. To accurately understand when a work item has been closed the Definition of Done must be explicitly understood.

Different work items will be of various sizes in terms of both effort and complexity, however over both small and very large project and organization datasets, we have found that both tend to a normal distribution which means that tracking the average is a useful indicator. Extrapolating the average time to close to the length of an iteration, sprint or release gives an indicator of how many items can be done in that timebox, and therefore if plans are realistic or need refining.

This essentially gives a Velocity of “things done” rather than “abstract points”.

Lean and/or Continuous Flow teams will typically track Team Throughput as Lead and Cycle Time metrics.

Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.

We have found that tracking these metrics can identify teams who are working at a steady state, improving or struggling. In terms of Behavior Driven Measurement we have seen the following behaviors emerge:

  • Work is broken down into smaller chunks
  • Those small chunks are delivered as quickly as possible
  • Reducing the amount of work in progress (WIP) at any one time

If we balance these behaviors with one that promotes quality over speed then the measurement driven behavior is positive for the teams.

There is a risk that focusing on speed reduces quality and so this metric, and resulting behaviors, need to be balanced with one that promotes quality over speed.

Metric: Release Burnup or Cumulative Burnup

Are we on track or not? When will we be done?

A related metric is a Release Burnup (for a product delivery team) or a Cumulative Burnup (for a steady state maintenance team).

A Burnup is an established metric in agile processes that indicates what the current scope target is (the number of stories or story points in the current release) and then tracks completion towards this target over time. The gradient of this line is driven by the Team Throughput and is usually called the team’s Velocity. Velocity is useful to extrapolate towards the intersection of work getting done and the required scope to indicate a likely point of timebox, release or product completion.

A traditional Burndown is simply an upside down version of this graph. We prefer a burnup because it allows for real life intervening in plans and the scope changing either up or down. In terms of perception the team can feel more positive about achieving something as the line goes up as progress is made. In a Burndown, as work is completed the line goes down as remaining work is reduced.

When there is no release or useful timebox (e.g. in a maintenance or purist continuous team) then a simple calendar based timebox (such as a month or quarter). Instead of a target scope we simply track the cumulative number of created items. This results in a Cumulative Burnup graph.

An ideal Cumulative Burnup graph will show that the number of completed items keep roughly in line with the number of created items. An unhealthy graph will show the number of items created outrunning the number of items completed. This indicates over-demand on teams.


Cumulative Burnups that count items rather than abstract points can be useful across teams and even across entire development organizations to indicate how well the organization is keeping up with demand, how stable that demand is and how stable the development supply is. Note that the only real difference between a Release Burnup and a Cumulative Burnup is the amount of up-front planning and therefore creation of items. In our experience the number of items in a Release always fluctuates and should never be considered as a fixed value.

Release Burnups can’t be easily aggregated because Release timeboxes will be different for different products. However, simplifying to Cumulative Burnups allows for easy aggregation.

In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Burnups:

  • A desire to create less items
  • A desire to close items more quickly

Since this metric emphasizes speed over quality we balance it with the Quality Confidence Metric. To avoid very low priority but slow items never getting done we might also keep an eye on the age distribution of currently open items.

Metric: Lead and Cycle Time

Cycle Time is the time it takes to actually do a piece of work. Lead time is the time it takes from the request being raised and the work being completed. Lead Time and Cycle Time are measures common in Lean implementations. Lead Time = Cycle Time + Queue Time.

How long does it take to do work? Why does it take so long?

As mentioned in Lean Principles, the most effective way of understanding a value stream is to directly examine it, to literally go and see the activities that take place throughout the organization, the hand overs between teams and the products produced. Similarly, the most effective way to identify waste is to simply ask team members in the organization what they think, since they work in these processes day to day, they are well qualified to identify which are useful and which are wasteful.

Sometimes a process may appear wasteful from a local perspective but may be meaningful and valuable from a systemic perspective. In this case the value of the wider process requires better communication so that local teams can understand its purpose, which in turn will improve motivation and productivity. We recommend that an organization is open to challenge from individuals and actively encourages people to identify areas of potential improvement. Whether their proposed changes end up being successfully implemented or not is less important than encouraging individuals to continuously improve the organization.

Measurement and Mapping

Lead and Cycle time are useful measures for understanding a value stream and mapping it.

Cycle Time  = the time it takes to actually do something

Lead Time = the time taken from the initial request to it finally being done

Lead time can only ever be equal to the cycle time or longer since a request can’t be done quicker than the amount of time taken to do the task. Lead time includes cycle time. Most tasks in a complex organization (e.g. provisioning a new Continuous Integration Server) typically take far longer to achieve (the Lead Time) than the time to actually implement the request (set up a new virtual server, install software, integrate into network etc.). Measuring the amount of time involved in “doing” vs. “waiting” is the purpose behind Lead and Cycle Time.

Measuring activities in this way often leads to an understanding of the amount of time a request is in a queue, waiting to be triaged, on a person’s task list before work starts on it. This waiting time is all potential waste and is often inherent in any hand over of an activity between people, teams, departments or businesses.As a result, flow and efficiency can be simply improved by reducing the number of handovers.

There are many reasons for large “queue time” including over-demand and under-resourcing. However, there are frequently less justifiable reasons for delays including measurement driven behavior that results in teams protecting their Service Level Agreements (SLAs) and/or Key Performance Indicators (KPIs) by “buffering” work to introduce some slack for when things go wrong or to deal with variance. In traditional Project Management this deliberate introduction of waste is called “contingency planning” and is considered a positive practice. Unfortunately, it is especially prevalent when there is a contractual boundary between teams that is focused on SLAs.

These delays, regardless of their root causes can have significant effects when looking at the system as a whole. If we examine a chain of activities as part of a value stream, measuring each in terms of lead and cycle time we can draw a picture of a sequential value stream:

In the example above we have 26 days of actual work spread over an elapsed 60 days (12 weeks) which equals about 60% waste. This work could have been done in less than half the time it actually took!

Often planners will look to put some activities in parallel, to reduce contingency by overlapping activities (accepting some risk). Even with these optimizations, if waste inherent in each activity is not addressed there will still be a lot of waste in the system.

By visualizing a value stream in this fashion we can immediately identify waste in the system, in the example above, despite some broad optimizations we can still see it’s mostly red. In many cases planners aren’t willing to accept the risks inherent in overlapping activities as shown here, or aren’t even aware that they can be overlapped, leading to the more sequential path shown previously. The result is a minimum time that it takes a request to get done, based on the impedance of the current value chain of, in this example, 38 days before we even start thinking about actually doing any work. This is a large amount of waste.

What is the baseline impedance in your organization?


It is possible to average lead and cycle values over time, on a team, product or arbitrary basis. Such aggregation provides quite useful information on general health of departments, or even the entire portfolio.
We recommend applying a pressure to drive them downwards and measuring the effect of process improvement against predicted changes. A change that results in lead and cycle times going up (assuming quality is stable) isn’t necessarily an improvement.
Tracking more than just the creation time, work start time and close time against activities, requirements or work items means that more than the Lead and Cycle time can be calculated. A Cumulative Flow diagram showing transitions between each state in a workflow can be created identifying bottlenecks and helping refine work in progress limits.

Metric: Work In Progress and Cumulative Flow

Teams using Kanban style continuous flow will typically work to strictly enforced Work-In-Progress (WIP) limits to expose bottlenecks and inefficiencies in their workflow. This is intended to help smooth the workflow, reduce work in progress and increase throughput. Enforcing WIP limits can be difficult however since work items are never really the same size, and efforts to same size them risk fragmentation of work, or unnecessary batching.

If bottlenecks occur team members can “swarm” to help their colleagues make progress.

Counting the number of items in each state of a workflow on a regular or continuous basis can be plotted to create a Cumulative Flow graph. These charts show the fluctuations in the workflow and average amount of time in each stage of the workflow – this does not require tracking of individual items through the workflow, only the number of items at each point on the sampling heartbeat.

A Cumulative Flow graph is similar to the Cumulative Burnup except that it has more lines since rather than just tracking created and done, it tracks each state change. An organization wide Cumulative Flow graph can be created if the organization adopts a standard set of Definitions of Done as recommended in Holistic Software Development.

Metric: Work Item Age

How long has work stayed open? Are we forgetting things?
Lead and Cycle time are useful for work that’s been completed. But the averages of those values ignore the items that are open, and have stayed open. The work items created and never done.
Examination of a medium sized development organization of around 400 people recently showed that although it proudly boasted an average 30-day cycle time it had several thousand items that were over a year old still outstanding!
Tracking the ages of outstanding open items helps balance the gaming and behavior driven measurement of a focus on Lead and Cycle times. A behavior can emerge where teams will only pick the easy incoming tasks, leaving the hard ones to fall into the “permanently open trap”. Measuring the age of open items helps to counter this behavior.
Of course, poor housekeeping and simply tracking many wish list items can lead to high ages on open items. For this reason, we recommend not tracking wish list items as part of work item metrics. The work item debt caused by poor housekeeping is harder to solve when it’s in the thousands of items.
Metric: Work Item Type Distribution
Another interesting metric related to work items is type distribution. Simply counting the number of items of each type over time, or as a snapshot can indicate workflow patterns and areas for investigation or simply provide context for other workflow metrics.
If your organization uses many different customized work item types then we recommend theming into broad categories such as “requirements”, “bugs” and “changes” before analysis.

Our project data shows that there are often higher number of bugs and changes than requirements, this isn’t necessarily a bad thing. Bugs and Changes are normally smaller in scope and effort than requirements types and so there’ll typically be many bugs per requirement. There’s a difference between Development Bugs and “Escaped Bugs” which we describe in Bug Frequency.

Requirements heavy distributions indicate projects at an early stage of their workflow. Excessive bug heavy distributions indicate poor quality practices. Excessive change heavy distributions indicate unstable requirements and therefore too much up front requirements work.
How these distributions change over time can be particularly informative. For many teams we can identify their release cycles by the changes in their work item distributions when we see a pattern of:
  1. Creation of a mass of detailed requirements types (i.e. Stories)
  2. Followed by a lag of correlating bugs/changes a few weeks later
This is a standard, and healthy, iterative pattern.

Metric: Bug Frequency

How many bugs are there? What’s the quality like?

Quality is more involved than simply counting bugs. However, the number of bugs found over time is a reasonable indicator of quality.

“Development Bugs” are bugs that are found during development – the more of these the better as they’re found before the product gets to a customer. In contrast “Escaped Bugs” are found once the product is live in user environments which is bad. If your data can differentiate internal vs. escaped bugs you can get a sense for how effective your quality practices are at bug hunting.

We track the number of internal and escaped bugs created over time. Generally speaking we can observe a standard patter of low escaped bugs and high development bugs. If the number of escaped bugs is trending up then there may be a quality problem, even if this correlates with an increase in user numbers. If the number of escaped bugs is trending down, then quality practices are being effective.
Development bugs will fluctuate over time, but excessive numbers indicate that more quality work may be required to prevent bugs rather than catch them afterwards. We’ve observed that the number of bugs (development and escaped) reduces when development and test effort is merged and increases when there are separate development and test teams.

Metric: Quality Confidence

Quality Confidence is a lead indicator for the quality of a software release based on the stability of test pass rates and requirements coverage. Quality Confidence can be implemented at any level of the requirements stack mapped to a definition of done.

Is the output of good enough quality?

Quality confidence combines a number of standard questions together in a single simple measure of the current quality of the release (at any level of integration) before it’s been released. This metric answers the questions:

  • How much test coverage have we got?
  • What’s the current pass rate?
  • How stable are the test results?

Quality Confidence is 100% if all of the in scope requirements have got test coverage and all of those tests have passed for the last few test runs. Alternatively, Quality Confidence will be low if either tests are repeatedly failing or there isn’t good test coverage of in scope requirements.Quality Confidence can be represented as a single value or a trend over time.

Since in Holistic Software Development the Requirements Stack maps explicitly to Definitions of Done, with development at each level brought together via Integration Streams, Quality Confidence can be implemented independently at each level and even used as a quality gate prior to acceptance into an Integration Stream.

A Word of Warning

Quality Confidence is only an indicator of the confidence in quality of the product and should not be considered a stable solid measure of quality. Any method of measuring quality based on test cases and test pass/fails had two flawed assumptions included in it:

  1. The set of test cases fully exercises the software
    • Our experience shows that code coverage, flow coverage or simple assertions that the “tests cover the code” does not mean that all bugs have been caught, especially in Fringe Cases. We might think that we’ve got reasonable coverage of functionality (and non-functionals) with some test cases but due to complex emergent behaviors in non-trivial systems we cannot be 100% sure.
  2. The test cases are accurately defined and will not be interpreted differently by different people
    • Just as with requirements, tests can be understood in different ways by different people. There are numerous examples of individuals interpreting test cases in a diverse number of ways to the extent that the same set of test cases run against a piece of software by different people can result in radically different test results.

Metrics such as Quality Confidence, must be interpreted within the context of these flawed assumptions. As such they are simply useful indicators, if they disagree with the perceptions of the team then the team’s views should have precedence and any differences can be investigated to uncover quality problems. We strongly recommend a Go See first, measure second mentality.

How to calculate quality confidence

To give an indicator of the confidence in the quality of the current release we first need to ensure that the measure is only based on the current in-scope requirements. We then track the tests related to each of these requirements, flagging the requirements that we consider to have “enough” testing as well as their results over time. The reason we include whether a requirement has enough tests is that we might have a requirement in scope that is difficult to test, or has historically been a source of many Fringe Cases,  and so although it is in scope we might not have confidence that its testing is adequate. Obviously this a situation to resolve sooner than later.

Once we understand the requirements in scope for the current release we can start to think about the quality confidence of each.

A confidence of 100% for a single requirement that is in scope for the current release is achieved when all the tests for that requirement have been run and passed (not just this time but have also passed for the last few runs) and that the requirement has enough coverage. For multiple requirements we simply average (or maybe weighted average) the results across the in-scope requirements set.

We look at not just the current pass scores but previous test runs to see how stable the test is. If a test has failed its last 5 runs but passed this time we don’t want to assert that quality is assured. Instead we use a weighted moving average so that more recent test runs have more influence on a single score than the oldest but 100% is only achieved when the last x number of test results have passed. The specific number can be tuned based on the frequency of testing and level of  risk.

If we don’t run all the tests during each test run then we can interpolate the quality for each requirement but we suggest decreasing the confidence for a requirement (by a cumulative 0.8) for each missing run. Just because a test passed previously doesn’t mean it’s going to still pass now.

To help calibrate these elements (aging, confidence interpolation, and coverage) Quality Confidence can be correlated with the lag measure of escaped bugs. However, in real world implementations fine tuning of these parameters (other than coverage) have been shown to have little impact on actual Quality Confidence scores.


Despite being less than simple to measure Quality Confidence is quite intuitive to interpret as it is based on the principle of Behavior Driven Measurement. In our experience it tends to be congruent with team members ‘gut feel’ for the state of a product, especially when shown over time. Quality Confidence is a useful indicator but is no substitute for direct honest communication between people.

We encourage telling teams how to “game” metrics to make them look better. In the case of Quality Confidence, the measure can be gamed by adding more tests to a requirement, running tests more often and ensuring they pass regularly. All excellent behaviors for a development team.

Quality Confidence provides a lead indicator for the quality of Releases since we can calculate it before a release goes live. For continuous flow teams we can simply track the Quality Confidence of each individual requirement, change or other work item. Simple averages across calendar cycles give trend information.

Quality Confidence can be aggregated across teams (by averaging) and can also be applied at successive integration levels of the Definition of Done stack for Team-of-Teams work.

A simpler version

If you don’t have tests linked to requirements, then you may want to consider whether you’re testing the right thing or not. Quality Confidence can be simplified to be simply based on tests and test results (ignoring the coverage question above) if the team assert a level of coverage.In terms of Behavior Driven Measurement we have seen the following behaviors driven by tracking Quality Confidence:

  • Teams increase test coverage against requirements
  • Teams test more often

If we balance this metric with the Throughput metrics that promote speed over quality, then the measurement driven behavior is positive for the teams.

Metric: Open and Honest Communication

The most useful measurement process is regular open and honest communication and the most effective way to understand what is going on in a workflow is to “go and see“. Physically walking a workflow, following a work item through the people and teams that interact with it in some way is an excellent way of understanding a business value stream in detail.

Counting things and drawing graphs is no substitute for direct open and honest communication.

Process Metrics

We often see organizations wanting to measure their processes or process adoption. To that end organizations engage in agile maturity models and assessments. In HSD, we consider these to be third order metrics.

The most important thing is Business Value.

  • First order: The most meaningful measure of progress is working software.
  • Second order: Measurement of intermediary artifacts such as plans and designs.
  • Third order: Measurement of the processes uses to create the intermediary artifacts.

The motivation behind measuring process often comes from the desire to see a return on investment in process improvement. However, process rollouts that involve forcing everyone to standardize their working practices and behaviors are a bad idea. Knowing that people are following a process does nothing to tell us whether their output and collaboration is any good.

Return on Investment for processes is tied to the value proposition of the process. For HSD, in our Introduction we said that HSD makes your organization healthier, faster, cheaper and happier. So for HSD adoption, we measure those things – however those are organizational metrics, not process metrics. Adopting HSD helps you get a handle on your organization and gives you the tools to improve it.

We strongly discourage measuring third order process issues such as numbers of teams doing standups, number of Product Forum meetings, spikes done etc. Teams need the freedom to customize their process because one size does not fit all.

We’ve seen a number of generations of process maturity assessments, and been able to compare them against our mined project data set. We’ll explore one of them here from a large client of 1000s of team members:

Agile Maturity Model #1

A set of questions that teams were asked that were aimed at comparing them against a number of “habits of successful teams”. The questions roughly lined up to asking people about their practices and rating their execution on a numerical scale that ranged from 0 – not doing this, to 4 – good enough. A number of variations of this assessment were done over a number of years including mentors making the assessment, mentors and team leads co-assessing to mentor facilitated group assessments.

There were a number of problems with this model:

  • Iteration was assumed by the questions to the extent that continuous flow teams would score badly.
  • A score of 0 was ambiguous, was the practice not being done by choice (a mature behavior) or not being done due to a lack of knowledge or ability (a low maturity behavior)?
  • Is “good enough” really the best a team could be? What about striving for perfection or continuous improvement?
  • A number of the questions looked for the existence of intermediary artifacts.

Analysis of the results of this assessment model, in all of its evolutions against the project data showed that the worst performing teams (in terms of Lead/Cycle timethroughput and quality metrics) got the highest process scores. In contrast teams with repeatedly low process scores were high performing in terms of workflow metrics.

We investigated further by taking a sampling of the projects and investigating them in detail. We found that the following held true:

High Agile Maturity score Poor workflow metrics High Agile Maturity scores were caused by teams being unaware of where they could improve, they were unconsciously incompetent. Unsuccessful projects
Low Agile Maturity score Good workflow metrics Low Agile Maturity scores were caused by teams being mature enough to recognize where they could improve. Successful projects


So the Agile Maturity Model negatively correlated with real team maturity and project success! In every case we would have picked teams from the second category if we were building a new product. Needless to say, as a result of our investigation the measurement model was stopped.

Some elements of process can be usefully measured but we recommend taking an extremely light touch to such things. Ideally development communities should be encouraged to examine the must-haves themselves. Here’s our starting set:

  • Release Frequency – a frequency of over 3 months sets off alarm bells, continuous or 1 month is best.
  • Build automation – don’t have an automated build process? Get one.
  • Version control – are your code assets stored and resilient? If not, why not.

Organizational Metrics

Organizational Metrics look at the organization as a whole rather than teams or pieces of work. They try to measure the health, shape and happiness of the organization. Top-level Business Value indicators and financial information can also be tracked organizationally.

Metric: Organizational Health

Organizational Health is difficult to measure, as it is dependent on the definition of health for each particular organization. We recommend tracking the trends of top-level Bubble-Up issues, both positive and negative to provide an indicator of what’s important across the organizing. Changes in the trends of positive to negative items can indicate the current mood of the organization.
We also recommend asking the organization at large what the top 10 issues are periodically. These top 10 community sourced issues should be reported, transparently, as part of the Executive Dashboard.
Other indicators for organizational health include:
  • Progress against Strategic Goals
  • Recruitment vs. Attrition trends
    • Are the numbers of leavers going up? What are the reasons for people leaving?
  • Independent Holistic Assessment
    • An independent analysis of organizational processes, metrics and leadership can give useful insights that may not be clear to those living in the organization day to day
  • Top level aggregations of workflow metrics, especially Lead and Cycle Time, Work Item Type Distribution and Quality Confidence.

Metric: Organizational Happiness

Happy people are productive people. We think Organizational Happiness is the most important organizational metric.

Based on the work of Professor Martin Seligman, active in the scientific community as a promoter of the field of “positive psychology”, we measure organizational happiness based on the PERMA model. Each of the five parts of the PERMA model are core elements of psychological well-being and happiness.

Positive Emotions: Perhaps the most obvious, but positive emotions are more than just smiling. Focusing on optimism and a positive view of the future we can measure if people are feeling good about the present and the future.

Engagement: Indicates how involved we are with our work, colleagues and the organization. We look at how connected people feel to these elements, and whether they consider their input as being valued.

Relationships: Humans are social creatures and so meaningful positive relationships are critical to our happiness. For collaborative work, authentic relationships are the foundation of honesty and communication.

Meaning: Having a purpose, and agreeing with that purpose is a critical component of Motivation. Understanding the reason for the work we need to do, and how what we do contributes to Strategic Goals makes us happier.

Accomplishments: Achieving goals makes people happier. Setting achievable, and meaningful, goals creates a sense of satisfaction.

We recommend running a periodic organization survey asking questions aligned to the PERMA elements. We typically ask 5 questions per element, and then create a Happiness Index for teams, departments and the entire organization.

Tracking how happiness changes over time, and in response to changes helps improve the overall happiness.

Metric: Workforce Shape

Workforce Shape is a top level measurement that shows the target and actual counts of managers vs. producers in an organization in one dimension and the number of contractors vs. full-time staff on the other axis.

There should be more producers than managers in an organization, keeping an eye on the ratio between them and ensuring it’s healthy can stop resource creep in middle management. Similarly, large organizations tend to have a significant contractor workforce which is flexed over time depending on current needs. Making sure that the balance is right between staff and contractors is part of workforce management.

The Workforce Shape on the right is only for explanation of the measure – it does not indicate a recommended workforce shape! The graph shows a target workforce shape (green) with a roughly equal contractor/staff workforce with a roughly 1:3 ratio between managers and producers. The ratio of managers to producers is normally around 1:10 or less, the split between contractors and permanent staff is dependent on the business in question.

The actual workforce shape is indicated by the blue overlay – this shows that there is a roughly equal split between managers and producers (likely to be highly dysfunctional) which is below the target ratio. The actual workforce also has slightly too many contractors and not enough staff – although this can often be caused by resource pipeline difficulties, in this example, the ratio is currently below target.

Ideal resource levels are described in the Governance view.

Workforce shape can be measured at team, projectprogramme and portfolio level.


We generally recommend against regular reporting. Measure the numbers of readers of each report and you will often find that readership gradually reduces over time.
Reports tell you what happened, and so they are a lag indicator. We prefer building in corrective feedback elements into all areas of process to prevent issues rather than report them after the fact. Where reports are useful they should be transparently communicated to the entire organization.
Where possible we try to fit reports onto a single page or dashboard showing the main indicators. People can then drill down from top-level indicators where useful or more importantly find out who to talk with to understand the real story behind a metric (Go See).
We recommend reporting and publishing:
We recommend that Business Leaders commission one off , deep dive reports into unusual indicators, major successes or areas of concern. Reporting on successes, to celebrate success and share lessons learned is just as important as learning from failure.

Executive Dashboard

The Executive Dashboard is the top level report that combines reporting on progress against strategic goals, development organization health and workforce shape. Providing evidence for Business Leaders to make decisions the Executive Dashboard helps to generate “pull” throughout the organization.

We recommend creating an Executive Dashboard as a simple single page, no bigger than A3.

The Executive Dashboard should provide a balance across indicators that will generate behaviors across speed, quality, morale, cost and progress aligned with Strategic Direction.

We recommend the following are included on the Executive Dashboard on a monthly cycle:

We recommend that the Executive Dashboard is produced monthly. Where possible reporting should be automated, although we’ve found putting a little manual care into report production helps consumption, but a monthly snapshot is still worth taking.

Remember to periodically check the Executive Dashboard against the checks we mentioned in the Metrics and Reporting View:

  • Who is reading the report or looking at the metric?
  • Are decisions based on the report/metric?
  • Do the readers consider it valuable?
  • Are positive behaviors driven by use of the metric?
  • Is the metric cost effective to capture?

Problems with Metrics and Reporting

Reports and metrics can be useful but they are only ever indicators of reality. We discuss what makes for good evidence in detail in the People chapter under Evidence Based Decision Making, but an important element is that we are confident of its accuracy. One way to improve confidence is to gather corroborating evidence from alternative sources. In simple terms that means we should always back up a metric or report with context and narrative from the people who actually know what’s going on. We strongly recommend that leaders Go See to understand their teams, products and business rather than sit behind a wall of reports and layers of management.
Analyzing a large number of reports against project history we’ve found a number of systemic issues with metrics and reporting:
False Accuracy
This one bites twice!
False accuracy is implied by numerical values, graphs and well produced reporting. The most common general case is presenting data in an over-precise way (e.g. 3.275% happiness improvement). The .275% implies a very precise measurement to be able to differentiate such subtle changes in happiness, which simply doesn’t exist.
Another issue is that numerical representation of non-exact states implies accuracy that doesn’t exist. An example here is agile maturity. As we described earlier, this is a difficult thing to measure anyway. Putting an exact figure on it, such as 3.5 out of 4 implies an accurate form of measurement which doesn’t exist. Worse, averaging such scores across teams to provide an organizational maturity further spreads the misrepresentation of a subtle and subjective assessment as a numerical fact.
As we mentioned in Estimates, uncertainty should always be presented along with numbers, estimated and metrics.
Sampling Frames
Data is only representative when it covers a meaningful amount of the population. We’ve seen many cases when people extrapolate from a sampling frame of 1 to the entire organization. This kind of projection is a normal human behavior but one that we must be wary of when talking about reporting and metric.
Just because one team experiences a problem doesn’t mean it’s a systemic issue in the organization. Even if 10 teams experience an issue it might not be systemic in an organization of 1000s.
Comparison against hypotheticals
This one is easy to spot but surprisingly common, especially in Business Cases. We’ve seen a number of examples where people make arguments, or present evidence in terms of a change against a hypothetical situation. This is simply making up evidence to justify a belief. Obvious examples include “… as compared to a normal development team”.
We’ve observed a tendency for people to only want to report good news. Especially as information is filtered up through an organization through layers of management. Each layer often adds a subtle positive spin to messages until an escalated problem can end up reported as “Green” and healthy at senior management.
This can be prevented by reducing organizational layers, making all reports transparent and clarifying a cultural direction that Greenshifting is not helpful, it’s lying.
Correlation != Causation
Correlation does not equal, or even imply, causation. Just because two figures seem to have changed together doesn’t mean that one change caused the other. For example, if an organization notices a trend that average lead time is reducing across teams at the same time as more office coffee machines have failed than normal that does not mean that lack of coffee decreases lead time. Others might note that during the same time period more superhero movies have been released than previously. Again that doesn’t indicate that superhero movies reduce software lead time.
Silly examples are easy to spot, but often when metrics are presented next to each other such as happiness and lead time they imply a causation relationship. These implications should be challenged as to what evidence there is linking the two metrics.
In this case, there is academic evidence for example: “Happiness and Productivity” by Andrew J Oswald, Eugenio Proto and Daniel Sgroi – University of Warwick.
No matter how good the evidence is, and how well presented it is, different people are liable to interpret numbers graphs and written reports in different ways. Learning how to interpret data, statistics and reports is a necessary skill for Business Leaders but we should never assume everyone will interpret the same metric the same way.
One example we saw in a client organization was a Cumulative Burnup that showed a cumulative 13% increase in created items over the number of completed items:
  1. One leader interpreted this as a nice healthy demand on development services.
  2. Another interpreted this as a runaway over-demand problem in the organization.
  3. Another interpreted it as evidence of significant failure demand in the organization.

Final word on Metrics and Reporting

Measure less, talk more