Why Most Training is Measured at the Wrong Level
Most organisations measure training, but they measure the wrong thing. The Kirkpatrick model (see our guide to the Kirkpatrick model) describes four levels of evaluation: Level 1 Reaction, Level 2 Learning, Level 3 Behaviour, Level 4 Results. Almost all routine measurement stops at Levels 1 and 2.
Level 1 is the satisfaction survey handed out at the end of a session: did you find it useful, would you recommend it, how would you rate the facilitator. Level 2 is the quiz or knowledge check that confirms the content went in. Both are easy to administer, both produce a tidy number, and both are collected while the room is still warm. Neither tells you whether anyone behaved differently once they were back at their desk.
The Kirkpatricks themselves are clear that Level 3, observed behaviour change in real work, is the level that answers the question training is meant to address. Donald Kirkpatrick first set out the four levels in 1959, and Kirkpatrick and Kirkpatrick (2016) reaffirmed that behaviour at work, not reaction to the course, is the measure that matters for any objective beyond pure knowledge transfer. Yet CIPD surveys consistently find that only a small minority of UK organisations measure training at Level 3 at all.
This gap is not an accident, and it is the core of what Sidestream exists to address. Some things cannot be taught, they have to be felt, and the same is true of measurement: a satisfaction score tells you how the room felt, not how people act. The evidence is clear, passive measurement and passive learning hardly change anything. A participant can rate a course five out of five and behave exactly as before. A participant can find a session uncomfortable and change how they work for years. Measuring the feeling instead of the behaviour gives you a number that looks like proof but proves nothing.
There is a deeper problem underneath the convenience. Level 1 and Level 2 data are systematically inflated by the Dunning-Kruger pattern: exposure to a topic produces confidence that runs ahead of capability. In Sidestream's own academic behaviour-change work, drawing on research from UCL, Cambridge and Bocconi, participants reported high confidence in newly taught communication skills while their actual performance told a different story. The study only saw the truth because it replaced self-reports with behavioural measurement. That is the whole point. If you ask people how well they can do something, you measure their confidence. If you watch what they do, you measure their behaviour.
What "Measuring Behaviour Change" Actually Means
Measuring behaviour change means answering one question with evidence: are people doing something observably different in their real work as a result of the training? Three words in that sentence carry the weight.
Observably. The thing you measure must be something another person could see and record. "More confident" is not observable. "Raised a concern in the team meeting" is. "Better at coaching" is not observable. "Asked an open question before offering advice" is. The discipline of forcing every target into an observable form is what separates real measurement from wishful self-report.
Different. Difference can only be shown against a reference point. A single measurement taken after training is a snapshot, not a change. To claim change you need a before and an after, which is why a baseline is non-negotiable.
As a result of. Behaviour shifts for many reasons over any given month: a new manager, a reorganisation, a high-profile incident, the season. Attributing a shift to the training, rather than to everything else happening at the same time, is the hardest and most often skipped part of the job.
A Step-by-Step Method for Measuring Behaviour Change
The following five steps turn that principle into a method you can write into a brief or a procurement specification. They map directly onto Kirkpatrick Level 3 and are the structure Sidestream builds into every engagement.
Define the Observable Behaviour Before You Train
Start at the end. Name the single behaviour the training is meant to change, and define it so precisely that a colleague sitting in the same room could count it without asking you what it means. Not "improve psychological safety" but "the number of times a junior team member raises a dissenting view in a clinical handover." Not "stronger leadership" but "the proportion of decisions recorded with a documented rationale."
This is the step the 2016 New World Kirkpatrick update insists on: design backwards from the behaviour you want, rather than running the course and hoping to find something to measure afterwards. If you cannot state the target behaviour as something observable and countable before the training, you will not be able to measure it after.
Capture a Baseline
Measure the target behaviour before the training event. This is the reference point against which any change is read. The baseline can be a frequency count over a fixed period, a set of structured observation ratings, or a sample of work artefacts scored against a rubric. What matters is that it uses exactly the same method you will use afterwards, so the before and after are comparable.
A baseline also protects you from the most common false claim in L&D: presenting a healthy-looking post-training number as evidence of improvement when the number was already that high before anyone was trained. No baseline, no change, only a snapshot.
Choose Observable Behavioural Indicators
Translate the target behaviour into indicators that an observer can count or rate in the actual workplace. There are three reliable types: frequency counts (how often the behaviour occurs), structured observation ratings (a trained observer scoring the behaviour against defined criteria), and work artefacts (documents, recordings or records that carry a trace of the behaviour).
Treat self-report as a supplement, never the spine. Asking people whether they now speak up more measures their belief about their behaviour, which the Dunning-Kruger pattern shows is unreliable. Where you do use a validated self-report scale, such as the Edmondson seven-item psychological safety scale, pair it with at least one observed indicator so the two can be read against each other.
Follow Up After the Behaviour Should Have Embedded
Re-measure the same indicators three to six weeks after the event. This window matters. Measure too early and you capture post-course enthusiasm, the temporary lift that fades within days. Measure far too late and the signal is buried under unrelated change. Three to six weeks is long enough for a new behaviour to settle into routine and short enough that the training is still the most plausible cause.
Follow-up is where most measurement efforts quietly die, because it requires going back into the workplace after everyone has moved on to the next priority. Building the follow-up date into the engagement at the outset, rather than leaving it to be arranged later, is what turns intention into data.
Isolate the Training Effect
Finally, separate the training's contribution from everything else that could have moved the behaviour. Three practical designs do this without a research laboratory. A comparison group: a similar team that has not yet been trained, measured over the same period. A phased rollout: later cohorts act as a temporary control for earlier ones, which is often the easiest design to justify operationally. A disciplined pre-and-post design: a single group measured before and after, with a deliberate record of any other significant changes over the window so they can be weighed in interpretation.
None of these is perfect, and you do not need perfection. You need enough rigour to answer a sceptical finance director who asks how you know it was the training and not the new team leader who started the same month.
Examples of Level 3 Measures That Work
The abstract method becomes concrete in the specific behavioural indicators chosen for each brief. These are measures Sidestream calibrates to the target of a given engagement during the diagnostic phase. Each one is observable, countable, and tied to a behaviour that matters.
- Speak-up frequency. The number of concerns, challenges or dissenting views raised through formal channels or in observed meetings, post-engagement against a pre-engagement baseline. The clearest behavioural signal of a developing speak-up culture (see our guide to building a speak-up culture).
- Structured peer challenge. The rate at which leaders challenge one another's reasoning in decision meetings, scored by a structured observer. A direct read on whether psychological safety has translated into behaviour.
- Disclosure-response quality. For harassment-prevention work, the quality of the first response a manager gives when a concern is disclosed, rated against a defined rubric. This matters more than completion certificates under the all-reasonable-steps duty.
- Coaching-question frequency. In one-to-one meetings, how often a manager asks an open question before offering a solution, for coaching-skills programmes.
- Decision-documentation quality. The proportion of significant decisions recorded with a documented rationale, for decision-making and governance work.
What unites them is that each is something a colleague or observer can see in real work. None depends on asking the participant how they think they did. That distinction is the whole discipline of behaviour-change measurement.
Common Pitfalls to Avoid
Pitfall 1: measuring satisfaction and calling it effectiveness. The smile-sheet score is a measure of the experience, not the outcome. Reporting Level 1 data as evidence of training effectiveness is the single most common error in L&D, and it is the one this guide exists to correct.
Pitfall 2: relying on self-report alone. Asking people whether they have changed measures their confidence about changing, which runs systematically ahead of their actual behaviour. Always anchor at least one indicator in observation.
Pitfall 3: no baseline. Without a before, an after cannot show change. A strong post-training number means nothing unless you can demonstrate the behaviour was weaker beforehand.
Pitfall 4: measuring too soon. Data collected on the day or the day after captures enthusiasm, not embedded behaviour. The post-course high is real but temporary, and mistaking it for change flatters the result.
Pitfall 5: skipping isolation. Claiming a behaviour shift was caused by the training without any comparison or controlled design invites the obvious challenge: how do you know it was not the reorganisation, the new leader or the recent incident? A design that isolates the effect is what makes the claim defensible.
Pitfall 6: choosing what is easy to measure over what matters. If the only thing you can count is attendance, you will report attendance. The method runs in the right order for a reason: decide the behaviour that matters first, then build the measurement to fit it, never the other way round.
How This Connects to the Way Training is Designed
Measurement and design cannot be separated. The reason Level 3 is rarely measured is the same reason much training is built to produce good satisfaction scores rather than behaviour change: the two reinforce each other. A passive format that teaches by talking at people is comfortable to sit through and easy to rate highly, and it is measured with the survey it was built to please.
This is where the choice of method shows up in the numbers. In Sidestream's own behaviour-change research, immersive role-play was around twenty per cent more effective than passive modalities such as slide-shows and video e-learning at building communication skills, and that gap was only visible because the study measured behaviour rather than self-reported confidence. Passive measurement and passive learning hide the difference; behavioural measurement reveals it. Our guide to immersive training versus e-learning maps the two methods against the Kirkpatrick levels in detail.
Building the measurement infrastructure into the engagement from the start, the baseline, the observable indicators, the follow-up window and the isolation design, is what makes Level 3 evidence possible rather than an afterthought bolted on once the course is over. For the wider UK picture on how rarely this is done, and what is changing, see our analysis of the state of behaviour change in the UK in 2026.
Real behaviour change happens through lived experience, and so does real proof of it. Get in touch today. We are Sidestream.
Related Sidestream Guides
- What is the Kirkpatrick Model?, the four levels of training evaluation explained
- Immersive Training vs E-Learning, the two methods mapped against the Kirkpatrick levels
- The State of Behaviour Change in the UK: 2026 Data
- What is the Dunning-Kruger Effect?, why self-report data is unreliable
- Behaviour Change Training: The Complete UK Guide
- Glossary: 100 Behaviour Change Terms
Frequently Asked Questions
How do you measure behaviour change from training?
Define an observable behaviour, capture a baseline before training, re-measure the same indicators three to six weeks afterwards, and isolate the training effect with a comparison group or phased rollout. This is Kirkpatrick Level 3 measurement: observed behaviour in real work, not satisfaction scores.
What is the difference between measuring at Level 1 and Level 3?
Level 1 measures how participants felt about the session through an end-of-course survey. Level 3 measures whether they behaved differently at work weeks later. The two are often uncorrelated, so a high satisfaction score is not evidence that behaviour changed.
What are examples of Level 3 behaviour-change measures?
Speak-up frequency before and after, the rate of structured peer challenge in meetings, disclosure-response quality in harassment cases, coaching-question frequency, and decision-documentation quality. Each is a behaviour an observer can count or rate in real work rather than something participants report about themselves.
Why is behaviour change so rarely measured?
Because it is harder than a survey. Level 3 needs a baseline, workplace observation, follow-up weeks later, and a method to separate the training effect from other influences. Most providers do not offer this and most procurement specifications do not ask for it.
How do you isolate the effect of training on behaviour?
Use a comparison group that is not trained, a phased rollout where later cohorts act as a temporary control, or a disciplined pre-and-post design that records other changes over the same period. The aim is to show the behaviour shift tracks the training, not an unrelated event.