Download the latest version of this code here: https://github.com/emmaklugman/OpenTests

In this tutorial we show how school, district, and state assessment coordinators can create, administer, and score a test comprised of publicly available questions.

Please read our accompanying article How Can Released State Test Items Support Interim Assessment Purposes in an Educational Crisis? in Educational Measurement: Issues and Practice for additional details: https://doi.org/10.1111/emip.12390 (An open-access working paper version is available here: https://doi.org/10.26300/yr8t-0v59 )

Corresponding author: Emma M. Klugman,


Our article discusses many important cautions and caveats for use in the COVID-19 era, three of which we list here:

  1. Among tests, the tests we describe here should be lower in priority than tests of physical, mental, and social-emotional health, and lower in priority than classroom and district tests that may already be in place;

  2. We advise the use of these tests for aggregate (group-level) monitoring with no direct stakes on schools, educators, or students; and

  3. Score interpretations should be tempered by departures from typical administration conditions, where tests follow aligned instructional sequences and are administered in school settings.

In this illustration, we use released questions (items) from the 2013 administration of NAEP 4th Grade Math. Users can gather corresponding ingredients from state websites or request them from state testing programs. This recipe assumes expertise at the level of a first-year survey course in educational measurement.


Contents:

Ingredients List:

  • Describes the necessary ingredients for this recipe
  • Presents table showing public availability of these ingredients for selected testing programs

Recipe: Estimated Student Scale Scores from Sum Scores

  • Import 3PL IRT item parameters
  • Import theta to scale score equation or table
  • Import (or simulate) student response data
  • Estimate student thetas using sum scores
  • Convert estimated thetas to scale scores
  • Produce table converting sum scores to scale scores, using Test Characteristic Curve information
  • Use 3PL model to generate response probabilities for each item, for item mapping
  • Create table of item descriptions connected to scale scores for a chosen response probability
  • Export student ability data, including estimated thetas, scale scores, achievement levels, and corresponding exemplar items

Final Product: Table Showing Sum Scores with Corresponding Theta Estimates, Scale Score Estimates and Achievement Levels, with Exemplar Items Mapped at each Sum Score.

Appendices:

Appendix 1: Imprecision in Scale Score Estimates

  • Plot Test Information Function
  • Plot Conditional Standard Errors of Measurement

Appendix 2: Scale Score Estimation Using Full-Pattern Scoring

  • Estimate student ability using full-pattern scoring, (includes standard errors)
  • Export student ability data, including estimated thetas, standard errors, scale scores, and achievement levels

Appendix 3: Replicating NAEP’s theta to scale score equations

  • Import table of theta to scale score equations (NAEP reports these separately by subscale)
  • Provide function to match NAEP’s theta to scale score conversion, weighting transformation parameters appropriately

Ingredients List:

We list the 5 essential ingredients for score reports here:

  1. Test items

  2. Item parameter estimates

  3. A list or key enabling association of items and their corresponding estimates

  4. Linking functions from underlying theta scales to scale scores

  5. Achievement level cut scores

Table 1. Online Public Availability of Items and Parameter Estimates for the Construction of Open Tests Selected large‐scale national and international testing programs and programs from the 15 largest states as of August, 2020

Testing Program (1) Are Operational (or Field Tested) Items Available? (2) Are Item Parameter Estimates Available? (3) Is a Key Enabling a Merge of (1) and (2) Available?
NAEP Yes Yes Yes
PISA Yes Yes Yes
TIMSS Yes Yes Yes
Smarter balanced Yes No No
New Meridian (PARCC) Yes No No
California Yes No No
Texas Yes No No
Florida Yes No No
New York Yes 3-8 & Regents Yes 3-8 & Regents No 3-8; Yes Regents
Pennsylvania Yes Yes No
Illinois Yes No No
Ohio Yes Yes Haphazardly
Georgia Yes No No
North Carolina Yes No No
Michigan Yes No No
New Jersey Yes No No
Virginia Yes No No
Washington Yes No No
Arizona Yes Yes No
Massachusetts Yes Yes No

Recipe: Estimated Student Scale Scores from Sum Scores:

Importing item parameters and scale scores

# Import selected items
# NAEP example from: https://nces.ed.gov/NationsReportCard/nqt/Search
selected_items <- read.csv("data/Selected_Items.csv", skip = 1)

# Import item parameter data
# NAEP example from: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt_math.aspx 
item_parameters_raw <- read.csv("data/IRT_Parameters.csv")                            
# Join the two tables, using "NAEP.ID" as the key
merged_data_raw <- inner_join(x = selected_items, y = item_parameters_raw, by = "NAEP.ID")

# Drop some unnecessary rows
merged_data <- dplyr::select(merged_data_raw, -c(Question.ID, Block.ID, Type, Seq., Online, dj1, dj2, dj3, dj4))

# Takes theta(s), returns scale score(s) (for most tests, this looks like "scale_score = scale_multiplier*theta + scale_contant")
theta_to_scale_score <- function(theta, M = 31.90, K = 240.98) {
  # NAEP uses a weighted formula; we replicate that in Appendix 3
  # Here, our defaults for M and K are from the 2013 Grade 4 NAEP Math parameters for Number Properties and Operations:
  # https://nces.ed.gov/nationsreportcard/tdw/analysis/2013/trans_constants_math2013.aspx
  return(scale_score = M*theta + K)
}

# Takes theta(s), returns achievement levels
scale_score_to_achievement_level <- function(scale_scores, cutoffs = c(214, 249, 282)) {
  # Note that these cutscores usually vary by grade, subject, and year
  # Our default cutoffs taken from here for 2013 grade 4 Math: 
  # https://nces.ed.gov/nationsreportcard/mathematics/achieve.asp#grade4
  achievement_levels <- rep("Below Basic", length(scale_scores)) 
  achievement_levels[scale_scores >= cutoffs[1]] <- "Basic"
  achievement_levels[scale_scores >= cutoffs[2]] <- "Proficient"
  achievement_levels[scale_scores >= cutoffs[3]] <- "Advanced"
  return(achievement_levels)
}

Scale Score Estimation Using Sum Scoring

“Invert” the TCC to get a map from sum scores to thetas:

# Create a range of possible thetas
max_theta = 10
thetas = seq(from = -max_theta, to = max_theta, length.out = 200001)

# NAEP Technical Documentation shows that a normalizing constant D of 1.7 is used in their 3PL equations: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_3pl.aspx
D <- 1.7      # Change to D = 1 if no D appears in your technical manual

TCC <- function(theta, data){
  expected_scores <- rep(NA, length(theta))
  for(i in 1:length(theta)){
    expected_scores[i] <- sum(data$cj + (1 - data$cj)/(1 + exp(-D*data$aj*(theta[i] - data$bj))))
  }
  return(expected_scores)
}

all_sum_scores <- TCC(theta = thetas, data = merged_data)

# Plot Test Characteristic Curve (which underlies all of what follows)
plot(x = thetas, y = all_sum_scores, 
     type = "l", xlim = c(-4, 4), ylim = c(0, 31), 
     main = "Test Characteristic Curve", xlab = "Theta", ylab = "Expected Raw Score")

# Simplifies the above possible sum scores (decimals) into possible whole number scores
possible_sum_scores <- seq(from = ceiling(all_sum_scores[1]), to = floor(tail(all_sum_scores, 1)), by = 1)
all_whole_sum_scores <- 0:nrow(merged_data)

# For each possible whole number score, find the associated theta that's the best match
matching_theta_scores <- rep(NA, times = length(possible_sum_scores))
for(i in 1:length(all_whole_sum_scores)){
    matching_theta_scores[i] <- thetas[which.min(abs(all_sum_scores - all_whole_sum_scores[i]))]
}

# The TCC inversion above tries to find can find sum scores for any theta between -mac_theat to max_theta, if they're defined. We also extrapolate for sum scores beyond these, to provide a complete table for any possible sum score (with a floor for lowest scores, aligning with low-stakes uses)
need_extrapolating_high <- which(matching_theta_scores == max_theta)
need_extrapolating_low <- which(matching_theta_scores == -max_theta)
dont_need_extrapolating <- matching_theta_scores[-c(need_extrapolating_high, need_extrapolating_low)]

# Simple linear bounded HOSS/LOSS extrapolation (HOSS = highest obtainable scale score, LOSS = lowest obtainable scale score)
low_step_size <- dont_need_extrapolating[2] - dont_need_extrapolating[1]
LOSS_floor <- dont_need_extrapolating[1] - low_step_size
matching_theta_scores[need_extrapolating_low] <- LOSS_floor
high_step_size <- dont_need_extrapolating[length(dont_need_extrapolating)] - 
  dont_need_extrapolating[length(dont_need_extrapolating) - 1]
for(i in need_extrapolating_high){matching_theta_scores[i] <- matching_theta_scores[i-1] + high_step_size}

# Produce a table with sum scores, thetas, scale scores, and achievement levels
matching_theta_scores_df <- as.data.frame(cbind(all_whole_sum_scores, matching_theta_scores))
colnames(matching_theta_scores_df) <- c("SumScore", "Theta")

# Convert the thetas to scale scores and then achievement levels, using our custom functions above
matching_theta_scores_df$ScaleScore <- theta_to_scale_score(matching_theta_scores_df$Theta)
matching_theta_scores_df$AchievementLevel <- scale_score_to_achievement_level(matching_theta_scores_df$ScaleScore)

# Re-order and round the Scale Scores before reporting, drop the row numbers
matching_theta_scores_df <- matching_theta_scores_df[order(matching_theta_scores_df$SumScore, decreasing = T),]
matching_theta_scores_df$ScaleScore <- round(matching_theta_scores_df$ScaleScore)
row.names(matching_theta_scores_df) <- NULL

# View table
matching_theta_scores_df
# Export table
write.csv(matching_theta_scores_df, file = "results/Estimated_Scale_Scores_From_Sum_Scores.csv", 
          row.names = F)

Create an item map, using released item parameters

An item map takes test questions and pegs them to specific scale scores, such that users can visualize which items a student of a given scale score is likely have already mastered, and which items they are less likely to answer correctly yet.

Example item map: https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2013

In this section, we provide simple code to create item maps for any set of items for which item difficulties are known, using the items above as our example.

# This function takes item parameter(s) and a response probability and returns theta(s)
ip_and_RP_to_theta <- function(data, RP = 0.74){
  # RP = response probability ("a student with scale score x should have a RP% chance of answering this item correctly")
  # RP defaults to 74% here, as NAEP uses RP = 74 for MC items with four choices
  theta = data$bj - (1/{D*data$aj}) * log((1 - RP)/(RP - data$cj))     #307
  return(theta)
}

# Add a column to item_parameters pegging each item to a certain Theta with response probability RP 
merged_data$Theta <- ip_and_RP_to_theta(data = merged_data)

# Convert to scale score, round to 0dp, drop rownames
merged_data$ScaleScore <- round(theta_to_scale_score(theta = merged_data$Theta), 0)
row.names(merged_data) <- NULL

# View Item Map
merged_data
# Sort by Theta and export
write.csv(merged_data[order(merged_data$Theta, decreasing = TRUE), ], file = "results/Test_Item_Map.csv", row.names = F)

Combining item map with sum score table to produce a combined table as our final product

In this section, we show how to combine an item maps with a sum score table to produce our final product.

We present two options for which mapped items to use: only the items from the test constructed above, or a larger item pool, for example, from multiple years’ items (to allow for closer matches). Our “final product” from this tutorial (which also produced )

Option 1: use the item map from only the items used in the constructed test above

# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score from this set of items, within some max_dist tolerance
option1_raw <- difference_left_join(x = matching_theta_scores_df,
                             y = merged_data,
                             by = "ScaleScore",
                             max_dist = 5,
                             distance_col = "Distance")

# Replace Distance = NA (for those with no item map question within max-dist) with 0
option1_raw$Distance[is.na(option1_raw$Distance)] <- 0

# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance); if still multiples, pick randomly
option_1 <- option1_raw %>% 
    group_by(SumScore) %>% 
    filter(Distance == min(Distance)) %>%
    group_by(SumScore) %>% 
    sample_n(size = 1)

# Drop the distance column
option_1 <- subset(option_1, select = -c(Distance))
# Rename remaining columns for clarity, drop row names
colnames(option_1) <- c("SumScore", "Theta", "ScaleScore", "AchievementLevel", "ClosestItemDescription", "ClosestItemSubscale", "ClosestItemScaleScore")
rownames(option_1) <- NULL

# Export table
write.csv(option_1, file = "results/Master_Table_One_Year.csv", row.names = F)

Option 2: use a broader pool of items for the item map (read additional items from a file). This is the same code that produces “Table 2” in our published article, and that we consider the “final product” of this recipe.

# This time we work with ALL grade 4 math items:
# Ours is directly from NAEP maps, which already has scale scores
all_g4_items <- read.csv("data/Combined_Item_Maps.csv")
colnames(all_g4_items) <- c("ContentClassification", "ScaleScore", "Question", "Year")

# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score, within some max_dist tolerance
option_2_raw <- difference_left_join(x = matching_theta_scores_df,
                             y = all_g4_items,
                             by = "ScaleScore",
                             max_dist = 5,
                             distance_col = "Distance")

# Replace Distance = NA (for those with no item map question within max-dist) with 0
option_2_raw$Distance[is.na(option_2_raw$Distance)] <- 0

# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance), if still multiples of equal distance, pick randomly
option_2 <- option_2_raw %>% 
    group_by(SumScore) %>% 
    filter(Distance == min(Distance)) %>%
    group_by(SumScore) %>% 
    sample_n(size = 1)
    
# Drop the distance column
option_2 <- subset(option_2, select = -c(Distance))
# Rename remaining columns for clarity, drop row names, re-order columns, re-order rows
colnames(option_2)[which(names(option_2) == "ScaleScore.x")] <- "ScaleScore"
colnames(option_2)[which(names(option_2) == "ScaleScore.y")] <- "ClosestItemScaleScore"
rownames(option_2) <- NULL
option_2 <- option_2 %>% relocate(Question, .after = AchievementLevel)
option_2 <- option_2 %>% relocate(Theta, .after = Question)
option_2 <- option_2[order(option_2$SumScore, decreasing = T),]

Final Product:

This table shows sum scores with corresponding theta estimates, scale score estimates and achievement levels, with exemplar items mapped to each sum score.

# View and export table
kable(option_2)
SumScore ScaleScore AchievementLevel Question Theta ContentClassification ClosestItemScaleScore Year
31 355 Advanced NA 3.5727 NA NA NA
30 335 Advanced Divide a square into various shapes-Extended (CR) 2.9463 Geometry 335 2011
29 315 Advanced Compose numbers using place value to determine winners of a game-Extended (CR) 2.3199 Number Properties and Operations 315 2011
28 303 Advanced Solve a story problem involving comparison of unit costs-Extended Response (CR) 1.9334 Number Properties and Operations 302 2005
27 293 Advanced Solve a story problem involving comparison of unit costs-Satisfactory Response (CR) 1.6405 Number Properties and Operations 293 2005
26 286 Advanced Determine and apply a rule based on an input-output table (calculator available)-Satisfactory (CR) 1.3980 Algebra 286 2017
25 279 Proficient Identify the fraction closest to the given value (MC) 1.1863 Number Properties and Operations 279 2009
24 273 Proficient Identify given measurements on a ruler-Correct Response (CR) 0.9950 Measurement 273 2005
23 267 Proficient Reason about odd and even numbers-Correct (CR) 0.8173 Number Properties and Operations 267 2009
22 262 Proficient Identify multiple correct solution methods to an addition problem-Partial (SR) 0.6488 Number Properties and Operations 262 2017
21 256 Proficient Solve a story problem involving time (calculator available)-Partial (CR) 0.4863 Measurement 256 2011
20 251 Proficient Use an interactive tool to create a parallel line segment-Correct (CR) 0.3275 Geometry 251 2019
19 246 Basic Mark locations on a grid-Correct (CR) 0.1698 Algebra 246 2013
18 241 Basic Identify given measurements on a ruler-Partial Response (CR) 0.0113 Measurement 241 2005
17 236 Basic Solve a one-variable linear equation (calculator available)-Correct (CR) -0.1502 Algebra 236 2019
16 231 Basic Represent the probabilities of three outcomes using a picture-Correct (SR) -0.3173 Data Analysis, Statistics, and Probability 230 2019
15 225 Basic Identify number sentence that models a balanced scale (calculator available) (MC) -0.4926 Algebra 225 2007
14 219 Basic Compose numbers using place value to determine winners of a game-Minimal (CR) -0.6793 Number Properties and Operations 219 2011
13 213 Below Basic Identify which of four objects is heaviest (MC) -0.8814 Measurement 213 2003
12 206 Below Basic Find value of an unknown in a number sentence (MC) -1.1042 Algebra 207 2017
11 198 Below Basic Identify a reasonable amount of time to walk 2 miles (calculator available) (MC) -1.3557 Measurement 198 2003
10 188 Below Basic Determine numerical value of an unknown quantity in a whole number sentence (MC) -1.6489 Algebra 189 2011
9 177 Below Basic Divide a square into various shapes-Minimal (CR) -2.0079 Geometry 173 2011
8 162 Below Basic Identify a figure that is not symmetric (calculator available) (MC) -2.4804 Geometry 165 2011
7 140 Below Basic NA -3.1762 NA NA NA
6 98 Below Basic NA -4.4845 NA NA NA
5 56 Below Basic NA -5.7928 NA NA NA
4 56 Below Basic NA -5.7928 NA NA NA
3 56 Below Basic NA -5.7928 NA NA NA
2 56 Below Basic NA -5.7928 NA NA NA
1 56 Below Basic NA -5.7928 NA NA NA
0 56 Below Basic NA -5.7928 NA NA NA
write.csv(option_2, file = "results/Master_Table_All_Years.csv", row.names = F)

Appendix 1: Imprecision in Scale Score Estimates

How do we convey a sense of uncertainty with these sorts of score estimates?

Because students with “fixed” thetas can retake the test and score slightly differently each time, we simulate this below to create an empirical range of possible scale scores for each student sum score.

# Create function to calculate test information for a given theta
test_information <- function(theta, data = merged_data){
  data$aj = data$aj/D
  num <- ((data$aj ^ 2) * (1 - data$cj))
  denom_a <- (data$cj + exp(data$aj * (theta - data$bj)))
  denom_b <- ((1 + exp(-1 * data$aj * (theta - data$bj)))^2)
  return(sum(num/(denom_a * denom_b)))
}

# Plot Test Information Function
thetas <- seq(from = -4, to = 4, by = 0.01)
informations <- sapply(thetas, test_information)
plot(x = thetas, y = informations, 
     type = "l", xlim = c(-4, 4),
     main = "Test Information Function", xlab = "Theta", ylab = "Information")

# Invert test information to plot Conditional Standard Errors of Measurement
csems <- 1/sqrt(informations)
plot(x = thetas, y = csems, 
     type = "l", xlim = c(-4, 4),
     main = "Conditional Standard Errors of Measurement", xlab = "Theta", ylab = "CSEM")


Appendix 2: Scale Score Estimation Using Full-Pattern Scoring

Full-pattern scoring (using a full table recording how each student performed on each question) can improve ability estimation.

#install.packages("irtoys")      # May require some dependencies
library(irtoys)

# Turn the item parameters into the matrix form that irtoys expects
ip <- as.matrix(dplyr::select(merged_data, aj, bj, cj))               
# The R package irtoys uses a normalizing constant of 1, so we "undo" the normalizing constant first
ip[,1] <- ip[,1]/D

# Use parameter estimates to simulate answers for 100 students
# (Import student data here if using real answers)
set.seed(88)
sim_thetas <- rnorm(1000)
sim_responses <- sim(ip, sim_thetas)

# Put the parameter estimates and standard errors into the list structure that irtoys functions expect
# Note: ability estimation function does not need standard errors or variance-covariance matrix to run
parameter_list <- list(est = ip, se = NA, vcm = NA)

# Estimate student thetas, based on full-pattern scoring, using MLE (several other methods exist for this "ability" function)
mod_MLE<- ability(resp = sim_responses, ip = parameter_list, method = "MLE")

# Look at the first five rows
mod_MLE[1:5, ]
##              est       sem  n
## [1,] -0.33230723 0.9117495 31
## [2,]  0.47690365 0.8671007 31
## [3,]  2.94742907 1.0381432 31
## [4,] -2.09337926 1.2070893 31
## [5,]  0.04383402 0.8842579 31
# Convert these estimated thetas to scale scores and achievement levels
ability_df <- as.data.frame(mod_MLE)
colnames(ability_df) <- c("Theta", "se(Theta)", "QuestionsAnswered")
ability_df$ScaleScore <- theta_to_scale_score(ability_df$Theta)
ability_df$AchievementLevel <- scale_score_to_achievement_level(ability_df$ScaleScore)

# Round the scale scores to 0 dp before reporting
ability_df$ScaleScore <- round(ability_df$ScaleScore, 0)

# View first five rows of table:
ability_df[1:5, ]
# Export the data
write.csv(ability_df, file = "results/Estimated_Scores_From_Full_Pattern_Scores.csv", row.names = F)

Appendix 3: Replicating NAEP’s theta to scale score equations

NAEP’s theta to scale score transforming equations are separated by sub-scale, like this:

We provide an alternative theta_to_scale_score function below, which can weight these subscale parameters either by default proportions or by their proportions in your selected test items, to most accurately mimic NAEP scoring.

# Import table of transformation constants
scale_scores_raw <- read.csv("data/Score_Equations.csv")  
  
# Takes theta(s), returns scale score(s)
theta_to_scale_score <- function(theta, weights = c(0.4, 0.2, 0.15, 0.1, 0.15))
{
  # Complex version: weighted average of each of the five subscale equations
  # Default weights taken from here for 2013 grade 4 Math: 
  # https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_determination_composite.aspx
  # Alternatively, user can provide weights customized to the actual distribution of questions used
  A = weighted.mean(x = scale_scores_raw$A, w = weights)
  B = weighted.mean(x = scale_scores_raw$B, w = weights)
  return(scale_score = A*theta + B)
}