Reliability is concerned with the consistency with which an assessment will perform its job. If we assess a group of people today and get one set of results and assess them next month and get a totally different set of results this suggests that there is a problem with the reliability of our assessment method.

If we know an assessment of “diagrammatic reasoning” is reliable we know that we should get the same type of results reasonably consistently. The reliability of a test is a quality assurance that the designers of the test have taken steps to reduce the amount of error occurring through ambiguous items, poor administration etc.

It is important to remember that while a test is reliable it may or may not be valid.

Types of Reliability

There are 3 main methods of establishing reliability:

»     Test-Retest: This involves repeating the assessment after some time has lapsed

»     Alternate Form: This was used in the early days of psychometrics and involved comparing results from two versions of the same assessment

»     Internal Consistency: This is the most commonly reported type of reliability today. It relies on statistical software to analyse how including or excluding different items affect the reliability of the test.

How Reliable Should an Assessment Be?

The practical importance of reliability is that it has implications for the range of error we make in measurement. However we measure psychological characteristics there will always be error. As assessors we want to minimise the sources of error and the range of error we might expect to occur in our assessments. The lower the reliability of a test the higher the error is likely to be. The size of error is represented by the standard error of measurement (SEm).


Interviews are famed for having low reliabilities. We therefore expect psychometric assessments which are standardised to be more accurate. As a rule of thumb we look for a reliability of 0.7, but this may depend on how wide a domain is sampled by the test.

For example: If we are working with stens the SD of our scale will be 2 and we have a reliability of .7 then our band of error will be approximately one sten. This is generally acceptable.

The SEm indicates the band of error we must allow to be confident that our assessments will prove to be consistent. To be more specific we can have:

»     68% confidence that a candidate’s true score lies within 1 SEm above or below obtained score

»     95% confidence that candidate’s true score lies within 2 SEm above or below obtained score

This is useful when comparing scores. If scores differ by 2 SEm’s then we can be 95% confident of genuine difference.

 How Do We Create a Reliable Test?

In designing tests we need to reduce four main sources of error. These are:

»     Candidate-related sources: variations in mood, motivation, health; test sophistication

»     Test-related sources: ambiguous items, item domain coverage; item construction

»     Administration: test conditions, test administration,

»     Scoring and Interpretation: Limiting bias and subjectivity

The greater the number of test items, the more certain we can be of assessing someone’s true score, because we are reducing sampling error and increasing the test reliability. However candidates would tire of more questions and added return in extra reliability must be balanced by the time it takes to collect the information (depth versus breadth).


Leave a Reply

Your email address will not be published.