Reproducibility: What is it and How to Calculate it

Key Takeaways: Reproducibility

Definition: The variation in measurements of the same measurand under *changed* conditions (e.g., different operators, different days, or different equipment).
The Goal: To determine the long-term stability and consistency of the measurement process.
Calculation: Determined by calculating the standard deviation of the means or results from different sets of conditions.
Impact: It is a major component of Type A uncertainty and often accounts for human and environmental variability.
GUM Compliance: Helps satisfy the JCGM 100:2008 requirement for identifying all significant components of uncertainty.

Introduction

Reproducibility is an important contributor to measurement uncertainty.

It is a Type A uncertainty that should be included in every uncertainty budget. However, many people neglect to evaluate it. If you evaluate measurement repeatability, then you should evaluate reproducibility too.

Most people are familiar with repeatability and less familiar with reproducibility. When labs find out (during an assessment) their uncertainty budgets should include reproducibility, I get a lot of questions.

So, I decided to create a complete guide all about reproducibility.

In this guide, I am going to cover everything you need to know about reproducibility, including:

What is Reproducibility
Why Reproducibility is Important
Reproducibility Testing Scheme
Reproducibility Conditions of Measurement
How to Calculate Reproducibility

If you need to evaluate reproducibility for your measurement uncertainty analysis, then keep reading. This guide is going to help you.

If you only want to know how to evaluate the results, then jump ahead to “How to Calculate Reproducibility.” I have included step-by-step instructions to make the process easy for you.

What is Reproducibility

According to the Vocabulary in Metrology, reproducibility is measurement precision under reproducibility conditions of measurement.

In the below image, you will see the definition (2.25) from the Vocabulary in Metrology.

To better understand the definition of reproducibility, focus on the keyword “reproducibility conditions of measurement.” I will tell you more about later in this document.

Why is Reproducibility Important

Reproducibility (in my opinion) is a better Type A uncertainty evaluation of performance compared to repeatability. Repeatability evaluates the short-term performance variability while reproducibility evaluates the long-term performance variability under various conditions encountered by the laboratory over time.

Over time, a laboratory will perform their testing or calibration activities under various conditions or measurements, such as different days, operators, methods, equipment, etc.

By evaluating these conditions, you can get a better estimate of measurement uncertainty for the laboratory’s activities.

Therefore, it is very important to perform reproducibility testing.

Reproducibility Conditions of Measurement

In the definition of reproducibility, you notice the phrase “reproducibility conditions of measurement.”

This is important because it helps you understand the difference between repeatability and reproducibility testing.

According to definition 2.24 of the Vocabulary in Metrology (VIM), you need the following conditions to perform a reproducibility test:

Different Procedures,
Different Operators,
Different Measuring Systems,
Different Operating Conditions,
Different Locations, and
Different Replicate Measurements on Same or Similar Objects

In the below image, you will see the definition of reproducibility conditions of measurement from the VIM.

Unlike repeatability, where all conditions of measurement are the same, reproducibility requires the conditions of measurement to be different. Therefore, you will need to change factors that significantly contribute to measurement uncertainty.

However, I only recommend evaluating one condition at a time (per ISO 5725-3) to avoid confounding results.

What Reproducibility Conditions Should You Evaluate?

In the table below, you will find the most commonly evaluated reproducibility conditions. Also, I included valuable information to help you pick the condition that is best for you.

In the below sections, I have provided more details about each reproducibility conditions.

1. Different Operators/Technicians

This is the most recommended condition to change for reproducibility testing. Some of the largest uncertainties occur from the inconsistencies between operators.

This option is best (for most labs) when there is more than one qualified technician. Therefore, it is recommended to pick two or more qualified technicians and have them independently perform the test or measurement. Then, their results can be evaluated to determine operator-to-operator reproducibility.

2. Different Days

This reproducibility condition is best for labs with only one qualified operator and one measurement system. Typically, this method is recommended for single operator labs.

When testing this condition, you will want to perform the test or measurement on two or more different days. For example, a technician will perform the same test or measurement on Monday, Tuesday, and Wednesday. Then, the results can be evaluated to determine day-to-day reproducibility.

3. Different Methods/Procedures

This condition is best for labs that regularly use more than one method for their testing or calibration activities. This will allow you to evaluate the intermediate precision of selecting different methods.

Evaluating the reproducibility between methods can be helpful. However, this condition is typically overlooked even though it is more common that most people think.

For example, here are three common scenarios:

Example 1: A calibration laboratory that has 2 different procedures (using the same method of comparison) to calibrate a pressure gauge.
Example 2: A chemical testing laboratory that prepares solutions using either the gravimetric or volumetric method.
Example 3: A microbiology laboratory that inoculates plates with different culture medias.

4. Different Equipment

This condition is best for labs with multiple (similar) measurement systems or workstations. In this scenario, you are evaluating the uncertainty associated with the random selection of a measurement system or workstation.

This option is great for laboratories with two or more similar measurement systems. However, you may want to consider evaluating operator to operator reproducibility.

In my experience, it is likely that the uncertainty associated with operators is larger than different measurement systems.

5. Different Environments

This condition is best for labs that perform testing and(or) calibration activities in the laboratory and in the field (i.e. at the customer site).

It can help you evaluate the uncertainty between controlled and uncontrolled environments.

However, most labs create two sets of uncertainty budgets. One set of budgets that evaluates uncertainty for measurements performed in the laboratory, and another set of budgets for measurements performed in the field.

This way, they can showcase their (typically) better measurement capabilities in the laboratory versus their field activities.

Other Conditions

In the table below, you will see an excerpt from ISO 5725-3 that lists conditions of measurement (i.e. factor) that you can evaluate for reproducibility testing.

Many of the conditions I previously listed are included in this table. However, there are a few more I did not cover that you may be interested in.

Below, you will see another table from ISO 5725-3 that provides rationale for specific conditions of measurement.

Reproducibility Testing Scheme

One-factor balanced experiment design

When you need to carry out a reproducibility test, you should use an experiment design. It will help you control your testing scheme and ensure consistent results that can be easily evaluated.

Additionally, an experiment design will help you replicate your reproducibility testing when you need to repeat it in the future.

Most accredited labs will only test one-factor at a time. Therefore, I recommend using a one-factor balanced fully nested experiment design.

It is a simple experiment design where you will need to specify the following:

Level 1: Measurement function and value (to evaluate),
Level 2: Reproducibility Conditions (to evaluate), and
Level 3: Number of repeated measurements (under each condition)

In the image below, you will see a visual representation of this experiment design that shows the levels and parameters of the scheme.

The gray boxes represent the ability to expand the experiment to add additional conditions or additional samples under each condition.

Example of one-factor balanced experiment design

Hopefully, the previous image is helpful. However, just in case you need more, here is an example of a common set-up for a repeatability and reproducibility testing scheme.

Level 1: Measurement function and value – 1 in Gage Block with a Caliper
Level 2: Reproducibility Condition – Operators, and
Level 3: Number of repeated measurements – 10 each

In the image below, you will see an example of how I typically set-up a repeatability and reproducibility testing scheme. For your benefit, I marked-up the image with details to help you see the parameters of each level in the scheme.

Hopefully, you find this example helpful.

If you set-up your repeatability and reproducibility testing schemes this way, it will make analyzing the results much easier. In the next section, I will show you three different methods you can use to calculate reproducibility.

How to Calculate Reproducibility

There are several methods to calculate reproducibility.

If you were to ask several experts how to evaluate reproducibility, you will likely get a variety of responses based on their experience and expertise.

Personally, I prefer to evaluate reproducibility as a standard deviation. This is based on the definition of reproducibility from the Vocabulary in Metrology and the ISO 5725, part 1. Both documents refer to reproducibility as a standard deviation.

In the image below, you will see the definition of reproducibility from the ISO 5725. As you can see, the standard document refers to reproducibility as a “standard deviation.”

The viewpoint of reproducibility as a standard deviation is further supported by the GUM or JCGM 100:2008.

Section 4.2 – covers the evaluation of Type A uncertainties and refers to estimating the experimental standard deviation.
Appendix B.2.16 – gives another definition for reproducibility where Note 3 states it can be expressed as the dispersion characteristics of the results.
Appendix H.5 – provides an example for the analysis of variance that shows how to perform repeatability and reproducibility evaluations.
Examples – Many of the examples in the GUM express Type A uncertainties as variance (i.e. the square root of the variance is the standard deviation).

However, to be fair, there are a lot of statistical treatments that can be applied to experimental data. Several definitions of reproducibility include notes stating “dispersion characteristics of the results,” or similar.

Therefore, if you research “measures of statistical dispersion,” you can find many types of evaluations, such as:

Range,
Standard deviation,
Variance,
Coefficient of Variance (CV), and
many more.

In the sections below, I am going to show you how to calculate reproducibility as both a standard deviation and a range. Each of the methods given are supported by ISO standard documents which should serve as objective evidence if your evaluations ever come into question (This is common).

Whichever method you decide to use, make sure you include note in your uncertainty budgets that specifies the method you used and the reference document it came from. This helps assessors and other observers ensure that you used appropriate methods to evaluate measurement uncertainty.

Reproducibility per ISO 5725-3

The most common method used to calculate reproducibility can be found in ISO 5725-3.

The document specifies how to calculate the reproducibility standard deviation by evaluating intermediate precision.

This is done by evaluating one reproducibility condition of measurement at a time. Most of the time, reproducibility between operators (e.g. technicians) is evaluated. However, other conditions may be evaluated based on a laboratory’s operations and available resources.

Now, this evaluation of reproducibility is considered easy for most people. On a 5-point difficulty scale (where 1 is easy and 5 is hard), I rate this evaluation as a 2 (for most people).

Even though the evaluation is considered easy, I have broken down the process into simpler steps to help you perform the calculations.

Read the sections below to evaluate reproducibility.

Calculate Reproducibility (ISO 5725-3) Step-by-Step

Follow the instructions below to calculate reproducibility per ISO 5725-3:

Select the test or measurement function to evaluate,
Determine the requirements to conduct the test or measurement,
Determine the reproducibility condition to evaluate,
Perform the test or measurement under:

condition A,
condition B,
if applicable, additional conditions, and

Evaluate the results.

Select the Test or Measurement Function

Pick a test or calibration measurement function to evaluate the reproducibility of results. Typically, you will want to pick the parameter from the laboratory’s scope of accreditation because this process will be used to evaluate type A uncertainties for an uncertainty analysis for ISO/IEC 17025 accreditation.

Requirements to Conduct Test or Measurement

Determine the requirements to perform the test or measurement. This can include, but is not limited to, the following:

Personnel,
Equipment,
Reference Standards,
Method,
Environmental Conditions,
Item Under Test,
etc.

Reproducibility Conditions

Determine the reproducibility condition to evaluate. This can include, but is not limited to, the following conditions:

Operators,
Days,
Equipment or Standards,
Methods,
Environmental Conditions

The most commonly recommended condition to evaluate is the reproducibility between operators. However, choose the condition that is most appropriate for you laboratory.

According to ISO 5725-3, it is best to evaluate one condition at a time. Otherwise, you may end up with results that are confounded (i.e. mixed up where the effects of each contributor are too difficult to tell apart).

If you want to evaluate more than one condition (at a time), you may want to consider using a Full Factorial experiment design. It will allow you to efficiently evaluate more than one condition while allowing you to evaluate their independent effects and interactions with other factors.

Performing the Test or Measurement

Independently perform the test or measurement under each condition. Make sure that the test or measurement is independently done, from start to finish, for each condition.

Some professionals call this a ‘true replicate.’ While there is no official definition for this term, you can find it used in many papers and presentations.

Evaluate the Results

Evaluate the results of the reproducibility test by calculating the standard deviation of the results under different conditions.

The below image is the standard deviation formula from ISO 5725-3, section 6.2.1; the simplest approach to evaluate intermediate precision (i.e. reproducibility standard deviation) within one laboratory.

You can perform this evaluation in Microsoft Excel or Google Sheets using the following formula:

FORMULA

=STDEV.S(cell₁:cell_n)

Tip for Evaluating Reproducibility Results

Many times, repeatability and reproducibility are evaluated at the same time. If you are evaluating both (at the same time), then calculate reproducibility by calculating the:

Mean or average of each data set,
Standard deviation of the mean or averages from the previous step.

Reproducibility Example

For example, imagine two technicians independently perform the same measurement 10 times each. To determine reproducibility, first calculate the mean or average of each technician’s results. This should give you two results; an average result for each technician.

Next, calculate the standard deviation of the two calculated average values. The result will be the reproducibility between operators.

Reproducibility per ISO 5725-2

ISO 5725-2 gives a technique for determining the repeatability and reproducibility standard deviations of a measurement method. ISO 21748 also uses this same technique to evaluate reproducibility.

When a laboratory needs to evaluate the measurement uncertainty of a measurement method, this technique is used to:

Determine repeatability and reproducibility, or
Determine the homogeneity of materials.

Now, this evaluation of reproducibility is not easy for most people. On a 5-point difficulty scale (where 1 is easy and 5 is hard), I rate this evaluation as a 4 (for most people).

Therefore, I have broken down the process into simpler steps to help you perform the calculations.

Read the sections below to evaluate reproducibility.

Calculate Reproducibility (ISO 5725-2) Step-by-Step

Follow the instructions below to calculate reproducibility per ISO 5725-2:

Select the test or measurement function to evaluate,
Determine the requirements to conduct the test or measurement,
Determine the reproducibility condition to evaluate,
Each participant shall independently perform the measurement(s),
Evaluate the results:

Calculate the Mean Square within groups,
Calculate the Mean Square between groups, and
Calculate Reproducibility.