Exam Development and Test Validation Process

Examinations are developed through the input of technical experts, also referred to as "Subject Matter Experts" or SME’s. To develop any national test, a minimum of five and a maximum of nine technical experts from three states are essential. Ideally, the more participation from more states is desired and sought.

Developing the Content

The developmental stage of an examination generally consists of a "job and task analysis" of the work done within the field of occupation. This analysis identifies elements of the occupation that can be classified as individual jobs. It continues to recognize tasks that need to be completed within each job element. There may be many jobs within an occupation and subsequently many tasks that need to be done before a job can be completed. This is the first step in developing "Content Validity".

Depending on the scope of development, SMEs may be involved in identifying other elements of the occupations, such as: tools, equipment, work environment, conditions of work that relate to ADA (Americans with Disabilities Act) requirements, etc. Sometimes, this information is already sufficiently documented and doesn’t require additional study.

Developing the Structure

After the job and task analysis has been completed, the same or another group of SMEs is retained to develop the structure of the test. The structure of a test is sometimes referred to as the "table of test specifications" and is used to determine the content and emphasis of the test. This is the part of the test building process that establishes "Construct Validity".

It is generally understood that a group of questions (items) may constitute a test of knowledge in a given field of occupation. Developing the table of test specifications maintains a degree of control over the content of the test. This process also satisfies the condition of "Semantic Validity" whereby the labels relate to the occupation being evaluated. This control of content helps to ensure that one aspect of validity is maintained, the second step for "Content Validity."

The structure or table of test specifications identifies the number of questions/items to be applied to a given section/category of the test. Doing so maintains the content relationship of the test while individual questions, relating to the section/category may be changed or randomly selected from a test item bank.

Developing the Items

Questions on a test are typically referred to as "test items." Each item takes the form of a multiple-choice question. The item is made up of three parts: 1) the question, called the "stem", a single correct answer, and a set of plausible, possible incorrect answers called "distracters." The number of distracters sometimes helps to elevate the difficulty or level of the test. Typically there are four choices, one correct and three plausible distracters.

There can be up to seven choices for a given item. The same or a similar occupational group of SMEs are involved with the development of test items. Items are generated in many different ways, and they are guided by the "table of test specifications." More items are generated than the test requires. Often, items are abandoned for any number of reasons and additional items are required to maintain the size or length of a test.

Items are reviewed for "Bias" toward protected groups by persons other than the SMEs. The test developer selects individuals that have a high degree of sensitivity to bias language to review the developed test items in an attempt to eliminate all or most language that may offend or bias an item for a protected group of people.

Pilot Testing

Pilot testing is an important step in the development of a test. Pilot testing consists of identifying individuals within an occupation that are at approximately the target level of the test. For example, a technician level test may require selecting individuals who have some level of experience within the occupation to pilot the test. Pilot participants are selected by knowledgeable people within the occupation. They are asked to identify participants who they feel perform at a level similar to that being targeted by the test.

Through this selection process, some aspect of "Criterion-Related Validity" is generated. Typically, selected participants have already been evaluated to some degree by the person who selects them. Therefore, there is often some correlation between the level of the participant and that of the test. Other criterion-related test results might be used to validate this process for a pilot group. The size of the pilot test group is selected to generate sufficient data. Depending on the number of people within the occupation, the number within a location, or the number of individuals that can be found to volunteer, will help determine the pilot test group. "Face Validity" is generated at this point in the test development process. Face Validity refers to the recognition of the test title, test categories, and test items as being part of the field of occupation. Individuals who pilot test are asked to respond or comment on each of these parts of the test.

The pilot test also asks participants to mark items, words or phrases that might have an impact on protected groups in a second attempt to eliminate bias.

Item Analysis

Item analysis is the process of technically reviewing the structure, response, and fit of a given item and the relationship of that item to the rest of the test. Through item analysis some levels of "Reliability" validation can be obtained. Item analysis may consist of any or all of the following statistical analysis:

Stability or Test/Retest - This method involves giving the same assessment twice, to the same group of individuals, at separate intervals (i.e., day, weeks, or months later). Reliability will be the correlation between the scores at Time 1 and Time 2.
Alternate Form - This method involves the creation of two forms of the same exam (slightly varying the items). The correlation between the scores of Test 1 and Test 2 will provide the reliability.
Discrimination Index - A discrimination index will be used to measure each examination item (i.e., question) so that distinctions can be made among the performances of examinees. The discrimination index is the difference in the percentage of high achieving students who got an item correct and the percentage of low achieving students who got the item correct.
Item Cross Correlation Matrix - The item cross correlation matrix will provide a measure of the individual responses to the overall performance on the examination as a whole. The overall reliability of a test is evaluated using some or all of the previously discussed methods and, possibly, one or more of the following:
- Kuder Richardson Formula - KR 20 & KR 21
- Cronbach Coefficient Alpha
- Spilt-half Reliability Coefficient
- Level of Difficulty
- Easiness Scale
- Coefficient of equivalence
- Spearman-Brown
- Standard error of measurement

Final Formatting

The item analysis will reveal items that are not working as expected. Those items will either be eliminated or the deficiency repaired. The test will be formatted to the correct number of total questions and each section/category is reviewed for the correct number of questions according to the table of test specifications. All other spelling and formatting difficulties are eliminated.

Test Delivery

Delivery of the test follows prescripts required for all national tests and requires a level of security. The delivery process consists of identifying individuals to proctor the test who have a high degree of personal conviction and agree to the requirements of handling and proctoring a national test.

Continuing Analysis

Test results are continually monitored on a periodic basis. As score anomalies occur, whole tests or individual items are scrutinized for problems. When a test shows a significant level of problems, it is slated for review ahead of its scheduled review period. All tests are reviewed on a three year basis.