Blog Series

Inside the Black Box of Machine Scoring

by Sarah Browning-Larson on Jul 15, 2020

  • Assessment
Sarah Browning-Larson

Learn More About ClearSight

One of the advantages of today's digitally delivered assessments is their abilities to reveal results immediately by using automatic machine scoring. Great, right? But how is that done? How does a test taker’s response, indicated through a selection or entry in a digital assessment, get evaluated and scored by a machine?

It's easy to understand that computer code can be programmed to score a multiple-choice item. If the correct answer is selected by the test taker, that selection is compared to the correct answer in the scoring program and if it matches, it is then recorded as correct. But what happens with more sophisticated technology-enhanced items or items that require written responses like essays?

Technology-Enhanced Items

One important part of technology-enhanced items is the scoring rules which designate correct and incorrect answers. First, the author needs to designate the correct responses. Let’s use a table-matching item as an example. For this item type, test takers need to select a cell or cells that represent pairings of information. To illustrate, let's consider the table-matching item below.

Choose the correct classifications for each number in the table. You may make more than one choice per row if needed.

Each checkbox in the table needs to be designated as a possible answer in a thoughtful way that can be translated into computer code. One simple way to do this is to letter the rows and number the columns (Table 1). Then, the cells are designated by first row and then column.

Each cell designator refers to a checkbox within the response area of the item. For this item, the correct answers correspond with A2, A3, B1, B3, C2, and D1. The incorrect answers are A1, B2, C1, C3, D2, and D3 (Table 2).

The second important part of scoring rules is determining the point value of each answer or answer combination. Let’s assume, for the sake of this example, this item is worth a total of two points.

  • To earn two points, all six correct checkboxes must be selected and no other checkboxes
  • To earn one point, the test taker must select five correct checkboxes and no other checkboxes
  • Any other answer will earn zero points

A2, A3, B1, B3, C2
A2, A3, B1, B3, D1
A2, A3, B1, C2, D1
A2, A3, B3, C2, D1
A2, B1, B3, C2, D1
A3, B1, B3, C2, D1

This brings us to the third important part of scoring rules, standardizing the representation of the answer selection(s) so a machine can read them and turn them into code. The six sets of answers that will earn a test taker 1 point are shown to the right, formatted in standardized strings.

Now, computer code can be written that will allow a scoring program to compare the test taker’s selections to these strings; if the response matches one of these strings, a score of one will be assigned.

Other types of items require different scoring rules. For example, one popular item type allows a test taker to construct a response to a math item using a keyboard or an onscreen keypad. To score these items, all correct responses that are acceptable must be identified. For example, if the correct answer is 1/2, are responses such as 0.5, 2/4, 3/6, and 5/10 also correct? This will depend on the item and needs to be part of the scoring rules and programming that goes into the machine scoring code.

Other item types may also require careful thinking about responses. For a drag and drop item, both the word or image that is being dragged and the drop zone (the space where the word or image is dropped) need carefully defined specifications. For example, can a word or image be dragged more than once, or can a drop zone accept more than one word or image? These drag and drop zone specifications help define correct responses.

Once scoring rules have been programmed, the item is ready for testing. Another critical step in scoring development happens after test takers have responded to an item. The responses and the score points assigned to them need to be carefully reviewed before a test score is finalized. Test takers may respond with correct answers that had not been previously identified. A review of actual responses will allow changes in the scoring programming to ensure that scoring is fair and accurate.

Scoring Written Essays

How can a machine understand the characteristics of a written essay well enough to provide a fair score? Certainly, things like punctuation, capitalization, and spelling that follow specific rules can be scored by a machine. But what about determining whether a writer has provided sufficient evidence or elaboration, or has just copied and rearranged words from a reading passage?

Although complex, automated essay scoring has been used to accurately grade essay writing. The process starts with samples of essays that have already been hand scored and checked by humans. Hand scoring uses rubrics, sets of scoring guidelines, that are familiar to many teachers.

These scored essays are then used to "train" the automated essay scoring engine, which uses advanced statistical techniques to evaluate specific writing characteristics that reflect fluency, grammar, sentence variety and complexity, and organization in addition to the specific words and phrases used. These characteristics allow the scoring engine to predict the score a human rater would give the essay.

Automated scores have proven to be sufficiently close to those given by human scorers. In practice, a percentage of scores produced by machines are checked and/or compared to scores determined by trained human raters to verify the accuracy of the scoring engine. For periodic and classroom testing, the scoring platform may allow teachers to change automated scores if their reading of an essay produces a different score.

ClearSight's machine scoring of technology-enhanced items and automated essay scoring provides many benefits to districts, schools, and teachers. It reduces teacher grading time and provides quick feedback to teachers and test takers. Millions of essays have been scored using automated essay scoring during the last few years and numerous studies have shown that automated writing scores are valid, reliable, and fair.

For more information on technology-enhanced items, click here


Sarah Browning-Larson is Director of Assessment for Voyager Sopris Learning.

Thank you for the comment! Your comment must be approved first
Load more comments