Issues Related To Scoring Of Essay Type Test Items In Statistics

On By In 1

Constructing tests

Designing tests is an important part of assessing students understanding of course content and their level of competency in applying what they are learning.  Whether you use low-stakes and frequent evaluations–quizzes–or high-stakes and infrequent evaluations–midterm and final–careful design will help provide  more calibrated results.


Here are a few general guidelines to help you get started:

  • Consider your reasons for testing.
    • Will this quiz monitor the students’ progress so that you can adjust the pace of the course?
    • Will ongoing quizzes serve to motivate students?
    • Will this final provide data for a grade at the end of the quarter?
    • Will this mid-term challenge students to apply concepts learned so far?

The reason(s) for giving a test will help you determine features such as length, format, level of detail required in answers, and the time frame for returning results to the students.

  • Maintain consistency between goals for the course, methods of teaching, and the tests used to measure achievement of goals. If, for example, class time emphasizes review and recall of information, then so can the test; if class time emphasizes analysis and synthesis, then the test can also be designed to demonstrate how well students have learned these things.
  • Use testing methods that are appropriate to learning goals. For example, a multiple choice test might be useful for demonstrating memory and recall, for example, but it may require an essay or open-ended problem-solving for students to demonstrate more independent analysis or synthesis.
  • Help Students prepare. Most students will assume that the test is designed to measure what is most important for them to learn in the course. You can help students prepare for the test by clarifying course goals as well as reviewing material. This will allow the test to reinforce what you most want students to learn and retain.
  • Use consistent language (in stating goals, in talking in class, and in writing test questions) to describe expected outcomes. If you want to use words like explain or discuss, be sure that you use them consistently and that students know what you mean when you use them.
  • Design test items that allow students to show a range of learning. That is, students who have not fully mastered everything in the course should still be able to demonstrate how much they have learned.

Multiple choice exams

Multiple choice questions can be difficult to write, especially if you want students to go beyond recall of information, but the exams are easier to grade than essay or short-answer exams. On the other hand, multiple choice exams provide less opportunity than essay or short-answer exams for you to determine how well the students can think about the course content or use the language of the discipline in responding to questions.

If you decide you want to test mostly recall of information or facts and you need to do so in the most efficient way, then you should consider using multiple choice tests.

The following ideas may be helpful as you begin to plan for a multiple choice exam:

  • Since questions can result in misleading wording and misinterpretation, try to have a colleague answer your test questions before the students do.
  • Be sure that the question is clear within the stem so that students do not have to read the various options to know what the question is asking.
  • Avoid writing items that lead students to choose the right answer for the wrong reasons. For instance, avoid making the correct alternative the longest or most qualified one, or the only one that is grammatically appropriate to the stem.
  • Try to design items that tap students’ overall understanding of the subject. Although you may want to include some items that only require recognition, avoid the temptation to write items that are difficult because they are taken from obscure passages (footnotes, for instance).
  • Consider a formal assessment of your multiple-choice questions with what is known as an “item analysis” of the test.
    For example:
    • Which questions proved to be the most difficult?
    • Were there questions which most of the students with high grades missed?

This information can help you identify areas in which students need further work, and can also help you assess the test itself: Were the questions worded clearly? Was the level of difficulty appropriate? If scores are uniformly high, for example, you may be doing everything right, or have an unusually good class. On the other hand, your test may not have measured what you intended it to.


Essay questions

 

“Essay tests let students display their overall understanding of a topic and demonstrate their ability to think critically, organize their thoughts, and be creative and original. While essay and short-answer questions are easier to design than multiple-choice tests, they are more difficult and time-consuming to score. Moreover, essay tests can suffer from unreliable grading; that is, grades on the same response may vary from reader to reader or from time to time by the same reader. For this reason, some faculty prefer short-answer items to essay tests. On the other hand, essay tests are the best measure of students’ skills in higher-order thinking and written expression.”
(Barbara Gross Davis, Tools for Teaching, 1993, 272)

When are essay exams appropriate?

  • When you are measuring students’ ability to analyze, synthesize, or evaluate
  • When you have been teaching at these levels (i.e. writing intensive courses, upper-division undergraduate seminars, graduate courses) or the content lends it self to more critical analysis as opposed to recalling information

How do you design essay exams?

  • Be specific
  • Use words and phrases that alert students to the kind of thinking you expect; for example, identify, compare, or critique
  • Indicate with points (or time limits) the approximate amount of time students should spend on each question and the level of detail expected in their responses
  • Be aware of time; practice taking the exam yourself or ask a colleague to look at the questions

How do you grade essay exams?

  • Develop criteria for appropriate responses to each essay question
  • Develop a scoring guide that tell what you are looking for in each response and how much credit you intend to give for each part of the response
  • Read all of the responses to question 1, then all of the responses to question 2, and on through the exam. This will provide a more holistic view of how the class answered the individual questions

How do you help students succeed on essay exams?

  • Use study questions that ask for the same kind of thinking you expect on exams
  • During lecture or discussion emphasize examples of thinking that would be appropriate on essay exams
  • Provide practice exams or sample test questions
  • Show examples of successful exam answers

Assessing your test

Regardless of the kind of exams you use, you can assess their effectiveness by asking yourself some basic questions:

  • Did I test for what I thought I was testing for?
    If you wanted to know whether students could apply a concept to a new situation, but mostly asked questions determining whether they could label parts or define terms, then you tested for recall rather than application.
  • Did I test what I taught?
    For example, your questions may have tested the students’ understanding of surface features or procedures, while you had been lecturing on causation or relation–not so much what the names of the bones of the foot are, but how they work together when we walk.
  • Did I test for what I emphasized in class?
    Make sure that you have asked most of the questions about the material you feel is the most important, especially if you have emphasized it in class. Avoid questions on obscure material that are weighted the same as questions on crucial material.
  • Is the material I tested for really what I wanted students to learn?
    For example, if you wanted students to use analytical skills such as the ability to recognize patterns or draw inferences, but only used true-false questions requiring non-inferential recall, you might try writing more complex true-false or multiple-choice questions.

Understanding Item Analyses

Item analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors’ skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity. Separate item analyses can be requested for each raw score1 created during a given ScorePak® run.

Sample Item Analysis (30K PDF)

A basic assumption made by ScorePak® is that the test under analysis is composed of items measuring a single subject area or underlying ability. The quality of the test as a whole is assessed by estimating its “internal consistency.” The quality of individual items is assessed by comparing students’ item responses to their total test scores.

Following is a description of the various statistics provided on a ScorePak® item analysis report. This report has two parts. The first part assesses the items which made up the exam. The second part shows statistics summarizing the performance of the test as a whole.

Item Statistics

Item statistics are used to assess the performance of individual test items on the assumption that the overall quality of a test derives from the quality of its items. The ScorePak® item analysis report provides the following item information:

Item Number

This is the question number taken from the student answer sheet, and the ScorePak® Key Sheet. Up to 150 items can be scored on the Standard Answer Sheet.

Mean and Standard Deviation

The mean is the “average” student response to an item. It is computed by adding up the number of points earned by all students on the item, and dividing that total by the number of students.

The standard deviation, or S.D., is a measure of the dispersion of student scores on that item. That is, it indicates how “spread out” the responses were. The item standard deviation is most meaningful when comparing items which have more than one correct alternative and when scale scoring is used. For this reason it is not typically used to evaluate classroom tests.

Item Difficulty

For items with one correct alternative worth a single point, the item difficulty is simply the percentage of students who answer an item correctly. In this case, it is also equal to the item mean. The item difficulty index ranges from 0 to 100; the higher the value, the easier the question. When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative. Item difficulty is relevant for determining whether students have learned the concept being tested. It also plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not. The item will have low discrimination if it is so difficult that almost everyone gets it wrong or guesses, or so easy that almost everyone gets it right.

To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item. (The chance score for five-option questions, for example, is 20 because one-fifth of the students responding to the question could be expected to choose the correct option by guessing.) Ideal difficulty levels for multiple-choice items in terms of discrimination potential are:

FormatIdeal Difficulty
Five-response multiple-choice70
Four-response multiple-choice74
Three-response multiple-choice77
True-false (two-response multiple-choice)85

(From Lord, F.M. “The Relationship of the Reliability of Multiple-Choice Test to the Distribution of Item Difficulties,” Psychometrika, 1952, 18, 181-194.)

ScorePak® arbitrarily classifies item difficulty as “easy” if the index is 85% or above; “moderate” if it is between 51 and 84%; and “hard” if it is 50% or below.

Item Discrimination

Item discrimination refers to the ability of an item to differentiate among students on the basis of how well they know the material being tested. Various hand calculation procedures have traditionally been used to compare item responses to total test scores using high and low scoring groups of students. Computerized analyses provide more accurate assessment of the discrimination power of items because they take into account responses of all students rather than just high and low scoring groups.

The item discrimination index provided by ScorePak® is a Pearson Product Moment correlation2 between student responses to a particular item and total scores on all other items on the test. This index is the equivalent of a point-biserial coefficient in this application. It provides an estimate of the degree to which an individual item is measuring the same thing as the rest of the items.

Because the discrimination index reflects the degree to which an item and the test as a whole are measuring a unitary ability or attribute, values of the coefficient will tend to be lower for tests measuring a wide range of content areas than for more homogeneous tests. Item discrimination indices must always be interpreted in the context of the type of test which is being analyzed. Items with low discrimination indices are often ambiguously worded and should be examined. Items with negative indices should be examined to determine why a negative value was obtained. For example, a negative value may indicate that the item was mis-keyed, so that students who knew the material tended to choose an unkeyed, but correct, response option.

Tests with high internal consistency consist of items with mostly positive relationships with total test score. In practice, values of the discrimination index will seldom exceed .50 because of the differing shapes of item and total score distributions. ScorePak® classifies item discrimination as “good” if the index is above .30; “fair” if it is between .10 and.30; and “poor” if it is below .10.

Alternate Weight

This column shows the number of points given for each response alternative. For most tests, there will be one correct answer which will be given one point, but ScorePak® allows multiple correct alternatives, each of which may be assigned a different weight.

Means

The mean total test score (minus that item) is shown for students who selected each of the possible response alternatives. This information should be looked at in conjunction with the discrimination index; higher total test scores should be obtained by students choosing the correct, or most highly weighted alternative. Incorrect alternatives with relatively high means should be examined to determine why “better” students chose that particular alternative.

Frequencies and Distribution

The number and percentage of students who choose each alternative are reported. The bar graph on the right shows the percentage choosing each response; each “#” represents approximately 2.5%. Frequently chosen wrong alternatives may indicate common misconceptions among the students.

Difficulty and Discrimination Distributions

At the end of the Item Analysis report, test items are listed according their degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor). These distributions provide a quick overview of the test, and can be used to identify items which are not performing well and which can perhaps be improved or discarded.

Test Statistics

Two statistics are provided to evaluate the performance of the test as a whole.

Reliability Coefficient

The reliability of a test refers to the extent to which the test is likely to produce consistent scores. The particular reliability coefficient computed by ScorePak® reflects three characteristics of the test:

  • Intercorrelations among the items — the greater the relative number of positive relationships, and the stronger those relationships are, the greater the reliability. Item discrimination indices and the test’s reliability coefficient are related in this regard.
  • Test length — a test with more items will have a higher reliability, all other things being equal.
  • Test content — generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability.

Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for about 95% of the classroom tests scored by ScorePak®. High reliability means that the questions of a test tended to “pull together.” Students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect peculiarities of the items or the testing situation more than students’ knowledge of the subject matter.

As with many statistics, it is dangerous to interpret the magnitude of a reliability coefficient out of context. High reliability should be demanded in situations in which a single test score is used to make major decisions, such as professional licensure examinations. Because classroom examinations are typically combined with other scores to determine grades, the standards for a single test need not be as stringent. The following general guidelines can be used to interpret reliability coefficients for classroom exams:

ReliabilityInterpretation
.90 and aboveExcellent reliability; at the level of the best standardized tests
.80 – .90Very good for a classroom test
.70 – .80Good for a classroom test; in the range of most. There are probably a few items which could be improved.
.60 – .70Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved.
.50 – .60Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading.
.50 or belowQuestionable reliability. This test should not contribute heavily to the course grade, and it needs revision.

The measure of reliability used by ScorePak® is Cronbach’s Alpha. This is the general form of the more commonly reported KR-20 and can be applied to tests composed of items with different numbers of points given for different response alternatives. When coefficient alpha is applied to tests in which each item has only one correct answer and all correct answers are worth the same number of points, the resulting coefficient is identical to KR-20.

(Further discussion of test reliability can be found in J. C. Nunnally, Psychometric Theory. New York: McGraw-Hill, 1967, pp. 172-235, see especially formulas 6-26, p. 196.)

Standard Error of Measurement

The standard error of measurement is directly related to the reliability of the test. It is an index of the amount of variability in an individual student’s performance due to random measurement error. If it were possible to administer an infinite number of parallel tests, a student’s score would be expected to change from one administration to the next due to a number of factors. For each student, the scores would form a “normal” (bell-shaped) distribution. The mean of the distribution is assumed to be the student’s “true score,” and reflects what he or she “really” knows about the subject. The standard deviation of the distribution is called the standard error of measurement and reflects the amount of change in the student’s score which could be expected from one test administration to another.

Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of measurement is expressed in the same scale as the test scores. For example, multiplying all test scores by a constant will multiply the standard error of measurement by that same constant, but will leave the reliability coefficient unchanged.

A general rule of thumb to predict the amount of change which can be expected in individual test scores is to multiply the standard error of measurement by 1.5. Only rarely would one expect a student’s score to increase or decrease by more than that amount between two such similar tests. The smaller the standard error of measurement, the more accurate the measurement provided by the test.

(Further discussion of the standard error of measurement can be found in J. C. Nunnally, Psychometric Theory. New York: McGraw-Hill, 1967, pp.172-235, see especially formulas 6-34, p. 201.)

A Caution in Interpreting Item Analysis Results

Each of the various item statistics provided by ScorePak® provides information which can be used to improve individual test items and to increase the quality of the test as a whole. Such statistics must always be interpreted in the context of the type of test given and the individuals being tested. W. A. Mehrens and I. J. Lehmann provide the following set of cautions in using item analysis results (Measurement and Evaluation in Education and Psychology. New York: Holt, Rinehart and Winston, 1973, 333-334):

  • Item analysis data are not synonymous with item validity. An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity.
  • The discrimination index is not always a measure of item quality. There is a variety of reasons an item may have low discriminating power:(a) extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives;(b) an item may show low discrimination if the test measures many different content areas and cognitive skills. For example, if the majority of the test measures “knowledge of facts,” then an item assessing “ability to apply principles” may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives.
  • Item analysis data are tentative. Such data are influenced by the type and number of students being tested, instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item.

1 Raw scores are those scores which are computed by scoring answer sheets against a ScorePak® Key Sheet. Raw score names are EXAM1 through EXAM9, QUIZ1 through QUIZ9, MIDTRM1 through MIDTRM3, and FINAL. ScorePak® cannot analyze scores taken from the bonus section of student answer sheets or computed from other scores, because such scores are not derived from individual items which can be accessed by ScorePak®. Furthermore, separate analyses must be requested for different versions of the same exam. Return to the text. (anchor near note 1 in text)

2 A correlation is a statistic which indexes the degree of linear relationship between two variables. If the value of one variable is related to the value of another, they are said to be “correlated.” In positive relationships, the value of one variable tends to be high when the value of the other is high, and low when the other is low. In negative relationships, the value of one variable tends to be high when the other is low, and vice versa. The possible values of correlation coefficients range from -1.00 to 1.00. The strength of the relationship is shown by the absolute value of the coefficient (that is, how large the number is whether it is positive or negative). The sign indicates the direction of the relationship (whether positive or negative). Return to the text.

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *