|
Guidelines for Writing Effective Tests:
A Practical "Short Course" for Test Authors
Trainers, and others in organizations who author tests, often
have a high degree of technical and subject matter expertise,
but many haven't been exposed to the basic principles and
criteria for writing tests. This Insight white paper
offers practical guidelines on how to write effective tests.
While following the guidelines will result in better tests,
we suggest that it may be worthwhile to seek professional
counsel when the test is particularly important, or when the
test must meet legal guidelines because it is used in some
fashion to make personnel decisions.
By "effective tests" we mean that they will accomplish
the objectives for which they are intended. This brings up
the first guideline the test author must be clear about
the test's objectives. Some tests are designed to certify
that people have acquired, and can apply, a certain set of
skills or a body of knowledge. Other tests are designed mainly
to help people further develop by measuring
skills and knowledge, and then providing specific feedback.
Excluding selection tests, those are the two main objectives,
and many tests are designed to do both.
Tests can be used for purposes other than certification and
development (for example, evaluating the effectiveness of
training programs, measuring ROI, promoting a culture of continuous
learning in the organization). Regardless of the purpose,
there is an increasing awareness that even small investments
in testing can significantly leverage the enormous investments
companies make in developmental programs (formal training,
self-study and on-the-job development, managerial coaching,
etc.). The purpose of this document is to help test authors
capitalize on that opportunity.
Scope of this Document
The subtitle above contains the phrase "Short Course,"
and the treatment of the topic here is very short, indeed.
The science of writing tests is complex, and Industrial-Organizational
Psychologists spend years earning their doctorate degrees
to understand that complexity. This document is not intended
to be a treatise on the topic, but rather a way to convey
the basic principles to test authors with less formal training.
The term "test" is intentionally used in this document,
instead of the more generic term "assessment." There
are many different kinds of assessments, ranging from simple
multiple-choice tests to day-long assessment centers. The
guidelines offered here pertain to what we usually think of
as "knowledge tests" tests that can be objectively
scored (versus essays, job simulations, etc.). However, as
we discuss below, this does not mean that such tests can only
measure simple information recall. Well-constructed objective
tests can measure complex abilities to apply knowledge and
skills in real-world situations.
The Test Blueprint
As stated previously, a good test starts with the test author
determining the objectives for the test and what will be done
with the results. Based on that determination, the next step
is to prepare a test blueprint a game
plan for the content to be included in the test, how scores
will be calculated, and other related matters. A builder cannot
construct a good home without a blueprint, nor can a test
author construct a good test.
Conceptually, developing a test blueprint is pretty simple
and involves two basic steps:
- Map out the domain of knowledge and skills the test takers
should have what they should know or be able to do.
- Determine the test items (types of items, how many items,
difficulty level, etc.) needed to accurately measure each
of the major knowledge and skill areas.
In practice, however, developing a test blueprint can seem
like such a daunting challenge that many test authors don't
even bother they just start writing test items. That's
a mistake because the end result is unlikely to be a tool
that is helpful to either the company or the test takers.
As a side comment, we note that one of the important benefits
of linking solid testing programs to developmental initiatives
is that it helps ensure those initiatives have specifically
defined objectives. In this sense, good testing leads to better
training and development.
Test Blueprint Matrix
| Levels of Knowledge/Skill |
Topics/Sections
|
|
|
A
|
B
|
C
|
D
|
E
|
Total
|
| Knowledge of terms |
2
|
2
|
0
|
2
|
1
|
7
|
| Comprehension of principles |
3
|
2
|
3
|
4
|
2
|
14
|
| Application of principles |
2
|
4
|
3
|
5
|
3
|
17
|
| Analysis of situations |
3
|
3
|
4
|
4
|
2
|
16
|
| Evaluation of solutions |
0
|
2
|
2
|
4
|
3
|
11
|
|
Total:
|
10
|
13
|
12
|
19
|
11
|
65
|
Okay, we've made the point that it's very important to develop
a test blueprint, but it's not all that easy to do. So what's
the solution? Creating a test matrix as shown above is a good
starting point.
In this example, the domain of knowledge and skills has been
divided into five topics, or sections, represented by the
columns. The rows in the matrix represent different levels,
ranging from understanding terms to evaluating solutions.
The numbers in the cells indicate the number of test items
for each topic and level. The matrix helps the test author
ensure that the content of the test reflects what was covered
in the training program, or what is important to the test
taker's successful performance on the job. For example, in
this case, topic D is most important, and merely knowing the
terms is less important than the other levels.
We'll briefly mention three other points related to the matrix
before moving on. First, the different levels shown in the
matrix are fairly generic and will work for many tests, but
not all. You should choose whatever levels make sense for
your particular situation. For example, a product knowledge
test for customer service reps who interact with a bank's
customers might have these three levels: knowledge of terms,
product features and benefits, and ability to link products
to customers' needs.
Second, how many items should be included in a test? That's
not an easy question to answer, and it depends on the total
domain being measured, what scores will be generated, and
what will be done with the scores. Here are some guidelines:
- Obviously, a summary test that measures many content
areas will need more items than a test that measures only
what people have learned in a two-hour training module.
- If there is a need to measure each section accurately,
in addition to a total score, then more items will be needed.
A rule of thumb is that each section should have at least
8 items.
- If there are serious positive or negative consequences
associated with the scores (e.g., promotion, having to take
the test again, having to repeat training), then more items
are needed.
Third, for simplicity's sake, the remainder of this document
will discuss test development in terms of measuring knowledge
and skills associated with training programs. However, the
same principles can be applied to tests associated with any
measurement need.
Types of Test Items
At this point, you've specified the domain of knowledge and
skills your test will measure in terms of topics and levels,
and the number of items (data points) in each. The next step
is to decide what types of items the test will contain. It
is beyond the scope of this document to go into too much detail
on item types, but a few of the key points are summarized
in this section.
Most types include a stem where the question or problem is
stated, and response options where the test taker chooses
one or more answers. Exceptions to this are noted in the following
discussion. Here are the primary types of test items (objective
tests only):
Multiple choice Stem followed by several response
options (usually 4 or 5), only one of which is correct. The
other options are called distracters. This is the most commonly
used type. The advantage is that, with careful construction,
this type can be used to measure knowledge at most levels.
The disadvantage is that it's hard to write good distracters
for levels beyond factual recall.
What is the capital of Florida?
A. Miami
B. Tallahassee
C. Orlando
D. Jacksonville
True-false Statement which is either true or
false. The advantage is that it's the most efficient way to
measure a lot of content in a short period of test time. The
disadvantages are that it's hard to measure higher-level knowledge
areas, and guessing (50% chance of being right).
The border between the U.S. and Canada
is longer than the border between the U.S. and Mexico.
A. True
B. False
Multiple select Stem followed by several response
options (4-10), more than one of which may be correct. The
advantage is that it is an efficient way of measuring a set
of facts or concepts that cluster together. The disadvantage
is that this is suitable only for certain knowledge areas.
What colors are in the American flag?
Mark all that are correct.
__ Red
__ Green
__ White
__ Blue
__ Black
Matching List of premises (definitions, explanations)
on the left, and response options on the right. The test taker
must select the best response option for each of the premises.
The advantage is that it allows the comparison of related
ideas or concepts. The disadvantages are that it's not suitable
for measuring isolated facts and information, and scoring
can be complex.
For each concept on the left, select
the word from the list on the right that best matches it.
|
__
|
Test predicts future
performance |
A. Face validity |
|
__
|
Test appears a reasonable
measure |
B. Reliability |
|
__
|
Re-test scores are
very similar |
C. Accuracy |
|
__
|
Low standard error |
D. Validity |
|
|
|
E. Consistency |
Ranking List of response options the test taker
must put in the proper order (by sequence, importance or some
other factor). The advantage is that this is perfect when
knowing the correct order is important. The disadvantages
are that it's not suitable for anything else, and scoring
can be complex.
Put the following steps in the correct
order a test author should take in writing a new test.
__ Prepare test blueprint
__ Determine test objectives
__ Draft test items
__ Evaluate items against criteria
__ Perform item analysis
__ Check with subject matter experts
__ Select item types to be used
__ Pilot the test and modify as needed
Fill in the blank Stem consists of a statement
or question, and the test taker must supply the answer. The
advantages are that it minimizes guessing, and this type is
easy to write. The disadvantages are that scoring can be difficult
(and sometimes subjective), and this type has little advantage
over well-written multiple-choice items.
The first President of the United
States was ___________________.
Choosing Item Types
There are many different types of test items. Which one(s)
should you choose? Again, there are no hard and fast rules,
but here are some guidelines that might be helpful.
- Start with multiple-choice items since they are the most
versatile, and then add other types as necessary.
- Be careful not to have too many item types in the same
test. This makes it confusing to test takers. There's nothing
wrong with having all multiple-choice items if doing so
meets your measurement objectives.
- Related to the two previous points, keep things simple.
(By "simple," we don't mean low difficulty level.)
While training program delivery should be varied to sustain
interest, this usually isn't an issue for test takers.
- Think about the specific knowledge or skill you're trying
to measure, and then consider which type of test items will
be the most accurate and efficient.
- The stems for multiple-choice items don't necessarily
have to be short questions or problems (but write as concisely
as possible!). You can create scenarios or situations that
require test takers to engage in analysis and critical thinking.
You can also link more than one item to a scenario.
- Add graphics, diagrams, pictures and other information
to the items as appropriate. With computer-based testing,
you can also include sound and/or video files.
- Multiple choice, T-F, and fill in the blank items are
usually scored as either right or wrong. The scoring for
other item types (multiple select, matching and ranking)
can get complicated. Make sure you know how you plan to
score them before administering the test.
Having extolled the advantages of multiple-choice items,
and having cautioned test authors to keep things simple, we
need to backtrack just a bit. Test authors should be creative
in writing tests that will best accomplish their objectives.
Be creative, and don't get stuck in a rut!
Outline: Steps in Developing a Test
- Determine test objectives
- Prepare test blueprint
- Select item types to be used
- Draft test items
- Evaluate items against criteria
- Check with subject matter experts
- Pilot the test and modify as needed
- Perform item analysis
Steps in Developing a Test
Before getting into the specifics of writing good test items,
let's first take a higher-level view of the overall steps
in developing a new test. Previously in this white paper,
the "ranking" example listed the steps. (Give yourself
bonus points if you tried to order them.) The list in the
outline above shows the steps in the correct order.
We've already discussed Steps 1-3 that was done first
so you would have the proper context for understanding the sequence
of events. Steps 4 and 5 will be covered next, and then we'll
briefly touch on the remaining steps. Developing a good test
is clearly more than just sitting down and writing items.
General Points on Writing Effective Tests
This section and the next provide suggestions on how to move
from theory to application in
writing effective tests. There have been volumes written on
this topic, and some of the technical and measurement issues
are rather complex, so we'll only be able to hit the high
points in this document.
As you author test items, keep the most important principle
in mind: Each item must accurately measure the knowledge
or skill area on which the item is focused. This is
an extremely important point, and one that is often overlooked
by people who provide advice to test authors. One can get
so caught up in the different criteria and psychometric nuances
that you lose sight of the main point "Is this
an item a competent test taker should get right, and is it
one that someone less competent might get wrong?" If
the answer is "no," then it's not a good item.
If the test is associated with a training program (or some
other developmental activity), the content of the program
should be the focus of the test. In this case, test development
should be fairly straightforward. What knowledge and skills
were being developed, and what test items will help measure
the degree to which that occurred? In other words, in authoring
a test associated with training, you should start with the
content that the test takers should have acquired (i.e., the
learning objectives).
Criteria for Effective Items
There are a great many rules/criteria for writing effective
test items. The checklist at the end of this white paper summarizes
the most important ones. You should be thoroughly familiar
with the criteria as you draft items (Step 4), and you can
use the checklist as you carefully evaluate your work afterwards
(Step 5).
Take a look at the checklist now. Many of the criteria are
applicable to all tests, but some apply only to certain item
types. Most criteria are self-explanatory, but the comments
below will help you understand the rationale for some of the
criteria. (The numbers-letters correspond to those on the
checklist.)
| |
1-a |
As a test author, you may or may not have
responsibility for determining the objectives of the test.
If you don't, ask someone who does. |
| |
1-e |
Tests that are not associated with a particular
developmental activity must still be associated with something.
There's still a need to be clear on what you're trying
to measure and why. |
| |
2-b |
With unsophisticated test takers, an example
of a completed item, given as part of the instructions,
might be appropriate for more complex item types. |
| |
3-a |
Trying to measure more than one thing with
a single item is a mistake commonly made by new test authors. |
| |
3-b |
Importance is sometimes confused with item
difficulty. Something could be extremely important, but
if 100% of the test takers always get the item right,
it's probably trivial and should be eliminated. |
| |
3-d |
The easiest way to tell if the stem is a
complete thought is to cover up the response options and
see if you know what you're supposed to do. |
| |
3-f |
Write each item as concisely as you can,
but not to the point of sacrificing clarity. |
| |
3-g |
Use simple words, but don't shy away from
technical terms if they are important for test takers
to know. |
| |
3-h |
As you ensure that all response options
are grammatically correct with respect to the stem, try
to avoid the "a(n)," "is/are" solution.
Rewrite the item so you can measure the knowledge or skill
without getting hung up on the grammar. |
| |
3-i |
Writing plausible distracters is both an
art and a science, and it's very hard work. An approach
some authors find helpful is to first write the stem and
correct response, and then think of good distracters. |
| |
3-j |
An example of response options that aren't
independent is: A. <10, B. <20, C. >40, D. >50. |
| |
3-m |
In addition to one item not leading to the
correct answer on another item, make sure the test takers
aren't penalized on one item because they got another
item wrong. |
| |
3-p |
Occasionally, statements with definitive
words are actually true, and are good items. For example,
the statement "all aptitude tests that are valid
are also reliable" happens to be true. |
| |
4-e |
The problem with "A and B" as
a response option is that it becomes more of a logic test
than a knowledge or skills test. However, it is occasionally
permissible as long as there aren't too many such items
in the same test. |
| |
7-c |
The reason for having the shorter text on
the right is that most test takers will be reading that
list repeatedly to find the correct matches. |
| |
7-e |
The reason for having an unequal number
of premises and response options is that it makes the
item more challenging because the test taker cannot determine
the last matches via the pro-cess of elimination. However,
there's nothing wrong with having an equal number. |
Refining the Test
The last three steps in developing a test deal with refining
the test to make it even better. Step 6 is to check with subject
matter experts. It's always a good practice to have someone
else review the test before it's administered, both from a
psychometric standpoint (the criteria for good items), and
a technical, subject matter standpoint. The review process
is also a good time to clean up typos and grammatical mistakes.
Step 7 is to pilot the test with a small sample of people
who are similar to those in the actual test taker population.
A focus group works well for this. Administer the test and
note the completion time for each participant. This will help
you judge how much time should be allotted for real test takers.
Debrief the participants afterwards to solicit input on their
overall reactions, as well as their reactions to any items
they felt were troublesome (unclear, no right answer, etc.).
Finally, Step 8 is to perform an item analysis once you have
a large enough sample of test taker results. This is the true
"test of your test," and is an excellent way to
further refine it. Again, it's beyond our scope in this document
to go into much detail, but here are a few key points:
- Be careful not to draw conclusions based on small sample
sizes (N < 30).
- Examine items where the percent correct is low (e.g.,
< 65%), and look at the distribution of responses across
the distracters. If the correct answer is B, but almost
everyone who missed it picked C, the item could be confusing
or ambiguous for some reason. On the other hand, it could
just be a difficult item, which is fine.
- Examine items where the percent correct is very high
(e.g., > 95%). It could be that the item is too easy
or even trivial. At the extreme, an item with 100% correct
is probably not worth retaining unless you want to keep
it in for reinforcement purposes.
- Identify those distracters that were picked by an extremely
low percentage of test takers (e.g., 0-3%). See if you can
rewrite them to be a little more plausible.
- Examine the correlations between item scores and total
test score. If high-scoring people are more
likely to miss a particular item, then the item is probably
defective.
Besides refining your test, item analysis is extremely valuable
in improving training programs and their delivery. You'll
have quantitative data on what people don't know or understand,
a solid basis for making changes to training materials, and
good feedback for trainers.
Difficulty Level and Scoring
In previous sections and in the checklist, we briefly touch
on the concept of item difficulty level. You should have a
range of items within sections some easier items and
some harder items. The remaining question is, how difficult
should the whole test be? In general, a good test should be
challenging enough that few or even none of the test takers
will get 100%. Why? Because you can't tell how much more
the test takers actually know when they get 100% you
didn't measure that "high." Another reason is that
feedback given to test takers is much more useful when they
discover areas where they can further develop.
There are exceptions to the guideline that good tests should
be challenging to even the most competent test takers. For
example, some training has very strictly defined and limited
learning objectives. The purpose of the test may only be to
certify that people have mastered some baseline knowledge
or skill, and you're not trying to do anything beyond that.
Then we come to the issue of setting a passing score
the total percent correct required for a test taker to pass
the test. Before going any further, make sure you need a passing
score in the first place. When tests are used for developmental
purposes, it may be preferable to simply report a total score
and section scores without reporting whether people have passed
or failed. However, in many situations there are compelling
reasons to set a passing score, and you can't duck the issue.
The passing score is obviously dependent on the overall test
difficulty level, and there are no rules of thumb such as
"the passing score should never be lower than 85%."
There are scientific, statistical methods available for setting
passing scores, but here's a simple approach that is adequate
for most tests:
- Meet with a group of subject matter experts, perhaps
including managers and trainers. Review the content of the
test and pose the question, "What percentage of these
items do you think people should get correct to be considered
fully competent to perform their jobs?" Reach a consensus
as a group.
- During that same meeting (but after you've
reached a consensus), examine the distribution of total
scores of actual test takers and note the percentage of
people who did not pass. Ask the group if the results seem
logical in terms of performance on the job and business
needs for com-petencies. Reach a consensus on an adjusted
passing score, but be careful not to unduly lower your standards
of excellence.
You obviously can't do Step 2 until after you have results
from actual test takers, but at some point in the future you
should look at the distribution of scores and consider adjusting
the passing score as appropriate.
There are many other scoring complexities you can get into
(but we won't here), including weighting items and sections,
requiring that some minimum level be achieved on each section
before one can pass the overall test, different passing scores
for the first attempt versus a test re-take, and whether guessing
should be penalized, among others. Our only advice is to keep
things as simple as possible.
Feedback to Test Takers
The real value in testing is to further the learning and
development process. How well that is accomplished depends
on what kind of feedback test takers receive, and when they
get it. To simply tell a test taker their score and whether
they passed or failed misses a whole world of development
opportunities. Good feedback consists of these elements:
- Total score (and sometime normative information on how
the test taker compares to others)
- Section scores, so the relative strengths and weaknesses
can be understood
- Items that were missed so the test taker understands
the specific knowledge or skill deficiencies
- Development suggestions for recommended actions to take
based on test results (self study, training programs or
materials, on-the-job actions, etc.)
Final Tips for Test Authors
Here are a few final thoughts that might be helpful:
- Make sure your tests reflect the application of knowledge
and skills not just information recall.
- Don't be tricky or picky with test takers. They won't
appreciate it, and it won't help the business.
- Keep track of all the tests you author. Over time, you
can build a valuable pool of test items.
- Look for opportunities to leverage technology in test
administration, scoring and reporting, item analysis, and
other areas. Once you've written an effective test, you
want the testing process to be fast, painless and cost-effective.
Using technology is extremely beneficial. For information
on selecting a technology partner, read the Insight
white paper titled Choosing
an Internet Assessment Vendor: A Practical Guide for Ensuring
a Value-Added Partner.
Criteria
Checklist for Test Authors
© 2007 Censeo Corporation
Return to Downloads page
Learn more about
knowledge testing
|
Learn more about
knowledge testingReturn to Downloads
page
|