Download the latest version of this code here: https://github.com/emmaklugman/OpenTests

In this tutorial we show how school, district, and state assessment coordinators can create, administer, and score a test comprised of publicly available questions.

Please read our accompanying article How Can Released State Test Items Support Interim Assessment Purposes in an Educational Crisis? in Educational Measurement: Issues and Practice for additional details: https://doi.org/10.1111/emip.12390 (An open-access working paper version is available here: https://doi.org/10.26300/yr8t-0v59 )

Corresponding author: Emma M. Klugman, emma.m.klugman@gmail.com

Our article discusses many important cautions and caveats for use in the COVID-19 era, three of which we list here:

Among tests, the tests we describe here should be lower in priority than tests of physical, mental, and social-emotional health, and lower in priority than classroom and district tests that may already be in place;
We advise the use of these tests for aggregate (group-level) monitoring with no direct stakes on schools, educators, or students; and
Score interpretations should be tempered by departures from typical administration conditions, where tests follow aligned instructional sequences and are administered in school settings.

In this illustration, we use released questions (items) from the 2013 administration of NAEP 4th Grade Math. Users can gather corresponding ingredients from state websites or request them from state testing programs. This recipe assumes expertise at the level of a first-year survey course in educational measurement.

Import 3PL IRT item parameters
Import theta to scale score equation or table
Import (or simulate) student response data
Estimate student thetas using sum scores
Convert estimated thetas to scale scores
Produce table converting sum scores to scale scores, using Test Characteristic Curve information
Use 3PL model to generate response probabilities for each item, for item mapping
Create table of item descriptions connected to scale scores for a chosen response probability
Export student ability data, including estimated thetas, scale scores, achievement levels, and corresponding exemplar items

Final Product: Table Showing Sum Scores with Corresponding Theta Estimates, Scale Score Estimates and Achievement Levels, with Exemplar Items Mapped at each Sum Score.

Appendices:

Appendix 1: Imprecision in Scale Score Estimates

Plot Test Information Function
Plot Conditional Standard Errors of Measurement

Appendix 2: Scale Score Estimation Using Full-Pattern Scoring

Estimate student ability using full-pattern scoring, (includes standard errors)
Export student ability data, including estimated thetas, standard errors, scale scores, and achievement levels

Appendix 3: Replicating NAEP’s theta to scale score equations

Import table of theta to scale score equations (NAEP reports these separately by subscale)
Provide function to match NAEP’s theta to scale score conversion, weighting transformation parameters appropriately

Ingredients List:

We list the 5 essential ingredients for score reports here:

Test items
Item parameter estimates
A list or key enabling association of items and their corresponding estimates
Linking functions from underlying theta scales to scale scores
Achievement level cut scores

Table 1. Online Public Availability of Items and Parameter Estimates for the Construction of Open Tests Selected large‐scale national and international testing programs and programs from the 15 largest states as of August, 2020

Testing Program	(1) Are Operational (or Field Tested) Items Available?	(2) Are Item Parameter Estimates Available?	(3) Is a Key Enabling a Merge of (1) and (2) Available?
NAEP	Yes	Yes	Yes
PISA	Yes	Yes	Yes
TIMSS	Yes	Yes	Yes
Smarter balanced	Yes	No	No
New Meridian (PARCC)	Yes	No	No
California	Yes	No	No
Texas	Yes	No	No
Florida	Yes	No	No
New York	Yes 3-8 & Regents	Yes 3-8 & Regents	No 3-8; Yes Regents
Pennsylvania	Yes	Yes	No
Illinois	Yes	No	No
Ohio	Yes	Yes	Haphazardly
Georgia	Yes	No	No
North Carolina	Yes	No	No
Michigan	Yes	No	No
New Jersey	Yes	No	No
Virginia	Yes	No	No
Washington	Yes	No	No
Arizona	Yes	Yes	No
Massachusetts	Yes	Yes	No

Recipe: Estimated Student Scale Scores from Sum Scores:

Importing item parameters and scale scores

# Import selected items
# NAEP example from: https://nces.ed.gov/NationsReportCard/nqt/Search
selected_items <- read.csv("data/Selected_Items.csv", skip = 1)

# Import item parameter data
# NAEP example from: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt_math.aspx 
item_parameters_raw <- read.csv("data/IRT_Parameters.csv")                            
# Join the two tables, using "NAEP.ID" as the key
merged_data_raw <- inner_join(x = selected_items, y = item_parameters_raw, by = "NAEP.ID")

# Drop some unnecessary rows
merged_data <- dplyr::select(merged_data_raw, -c(Question.ID, Block.ID, Type, Seq., Online, dj1, dj2, dj3, dj4))

# Takes theta(s), returns scale score(s) (for most tests, this looks like "scale_score = scale_multiplier*theta + scale_contant")
theta_to_scale_score <- function(theta, M = 31.90, K = 240.98) {
  # NAEP uses a weighted formula; we replicate that in Appendix 3
  # Here, our defaults for M and K are from the 2013 Grade 4 NAEP Math parameters for Number Properties and Operations:
  # https://nces.ed.gov/nationsreportcard/tdw/analysis/2013/trans_constants_math2013.aspx
  return(scale_score = M*theta + K)
}

# Takes theta(s), returns achievement levels
scale_score_to_achievement_level <- function(scale_scores, cutoffs = c(214, 249, 282)) {
  # Note that these cutscores usually vary by grade, subject, and year
  # Our default cutoffs taken from here for 2013 grade 4 Math: 
  # https://nces.ed.gov/nationsreportcard/mathematics/achieve.asp#grade4
  achievement_levels <- rep("Below Basic", length(scale_scores)) 
  achievement_levels[scale_scores >= cutoffs[1]] <- "Basic"
  achievement_levels[scale_scores >= cutoffs[2]] <- "Proficient"
  achievement_levels[scale_scores >= cutoffs[3]] <- "Advanced"
  return(achievement_levels)
}

Scale Score Estimation Using Sum Scoring

“Invert” the TCC to get a map from sum scores to thetas:

# Create a range of possible thetas
max_theta = 10
thetas = seq(from = -max_theta, to = max_theta, length.out = 200001)

# NAEP Technical Documentation shows that a normalizing constant D of 1.7 is used in their 3PL equations: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_3pl.aspx
D <- 1.7      # Change to D = 1 if no D appears in your technical manual

TCC <- function(theta, data){
  expected_scores <- rep(NA, length(theta))
  for(i in 1:length(theta)){
    expected_scores[i] <- sum(data$cj + (1 - data$cj)/(1 + exp(-D*data$aj*(theta[i] - data$bj))))
  }
  return(expected_scores)
}

all_sum_scores <- TCC(theta = thetas, data = merged_data)

# Plot Test Characteristic Curve (which underlies all of what follows)
plot(x = thetas, y = all_sum_scores, 
     type = "l", xlim = c(-4, 4), ylim = c(0, 31), 
     main = "Test Characteristic Curve", xlab = "Theta", ylab = "Expected Raw Score")

# Simplifies the above possible sum scores (decimals) into possible whole number scores
possible_sum_scores <- seq(from = ceiling(all_sum_scores[1]), to = floor(tail(all_sum_scores, 1)), by = 1)
all_whole_sum_scores <- 0:nrow(merged_data)

# For each possible whole number score, find the associated theta that's the best match
matching_theta_scores <- rep(NA, times = length(possible_sum_scores))
for(i in 1:length(all_whole_sum_scores)){
    matching_theta_scores[i] <- thetas[which.min(abs(all_sum_scores - all_whole_sum_scores[i]))]
}

# The TCC inversion above tries to find can find sum scores for any theta between -mac_theat to max_theta, if they're defined. We also extrapolate for sum scores beyond these, to provide a complete table for any possible sum score (with a floor for lowest scores, aligning with low-stakes uses)
need_extrapolating_high <- which(matching_theta_scores == max_theta)
need_extrapolating_low <- which(matching_theta_scores == -max_theta)
dont_need_extrapolating <- matching_theta_scores[-c(need_extrapolating_high, need_extrapolating_low)]

# Simple linear bounded HOSS/LOSS extrapolation (HOSS = highest obtainable scale score, LOSS = lowest obtainable scale score)
low_step_size <- dont_need_extrapolating[2] - dont_need_extrapolating[1]
LOSS_floor <- dont_need_extrapolating[1] - low_step_size
matching_theta_scores[need_extrapolating_low] <- LOSS_floor
high_step_size <- dont_need_extrapolating[length(dont_need_extrapolating)] - 
  dont_need_extrapolating[length(dont_need_extrapolating) - 1]
for(i in need_extrapolating_high){matching_theta_scores[i] <- matching_theta_scores[i-1] + high_step_size}

# Produce a table with sum scores, thetas, scale scores, and achievement levels
matching_theta_scores_df <- as.data.frame(cbind(all_whole_sum_scores, matching_theta_scores))
colnames(matching_theta_scores_df) <- c("SumScore", "Theta")

# Convert the thetas to scale scores and then achievement levels, using our custom functions above
matching_theta_scores_df$ScaleScore <- theta_to_scale_score(matching_theta_scores_df$Theta)
matching_theta_scores_df$AchievementLevel <- scale_score_to_achievement_level(matching_theta_scores_df$ScaleScore)

# Re-order and round the Scale Scores before reporting, drop the row numbers
matching_theta_scores_df <- matching_theta_scores_df[order(matching_theta_scores_df$SumScore, decreasing = T),]
matching_theta_scores_df$ScaleScore <- round(matching_theta_scores_df$ScaleScore)
row.names(matching_theta_scores_df) <- NULL

# View table
matching_theta_scores_df

# Export table
write.csv(matching_theta_scores_df, file = "results/Estimated_Scale_Scores_From_Sum_Scores.csv", 
          row.names = F)

Create an item map, using released item parameters

An item map takes test questions and pegs them to specific scale scores, such that users can visualize which items a student of a given scale score is likely have already mastered, and which items they are less likely to answer correctly yet.

Example item map: https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2013

In this section, we provide simple code to create item maps for any set of items for which item difficulties are known, using the items above as our example.

# This function takes item parameter(s) and a response probability and returns theta(s)
ip_and_RP_to_theta <- function(data, RP = 0.74){
  # RP = response probability ("a student with scale score x should have a RP% chance of answering this item correctly")
  # RP defaults to 74% here, as NAEP uses RP = 74 for MC items with four choices
  theta = data$bj - (1/{D*data$aj}) * log((1 - RP)/(RP - data$cj))     #307
  return(theta)
}

# Add a column to item_parameters pegging each item to a certain Theta with response probability RP 
merged_data$Theta <- ip_and_RP_to_theta(data = merged_data)

# Convert to scale score, round to 0dp, drop rownames
merged_data$ScaleScore <- round(theta_to_scale_score(theta = merged_data$Theta), 0)
row.names(merged_data) <- NULL

# View Item Map
merged_data

# Sort by Theta and export
write.csv(merged_data[order(merged_data$Theta, decreasing = TRUE), ], file = "results/Test_Item_Map.csv", row.names = F)

Combining item map with sum score table to produce a combined table as our final product

In this section, we show how to combine an item maps with a sum score table to produce our final product.

We present two options for which mapped items to use: only the items from the test constructed above, or a larger item pool, for example, from multiple years’ items (to allow for closer matches). Our “final product” from this tutorial (which also produced )

Option 1: use the item map from only the items used in the constructed test above

# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score from this set of items, within some max_dist tolerance
option1_raw <- difference_left_join(x = matching_theta_scores_df,
                             y = merged_data,
                             by = "ScaleScore",
                             max_dist = 5,
                             distance_col = "Distance")

# Replace Distance = NA (for those with no item map question within max-dist) with 0
option1_raw$Distance[is.na(option1_raw$Distance)] <- 0

# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance); if still multiples, pick randomly
option_1 <- option1_raw %>% 
    group_by(SumScore) %>% 
    filter(Distance == min(Distance)) %>%
    group_by(SumScore) %>% 
    sample_n(size = 1)

# Drop the distance column
option_1 <- subset(option_1, select = -c(Distance))
# Rename remaining columns for clarity, drop row names
colnames(option_1) <- c("SumScore", "Theta", "ScaleScore", "AchievementLevel", "ClosestItemDescription", "ClosestItemSubscale", "ClosestItemScaleScore")
rownames(option_1) <- NULL

# Export table
write.csv(option_1, file = "results/Master_Table_One_Year.csv", row.names = F)

Option 2: use a broader pool of items for the item map (read additional items from a file). This is the same code that produces “Table 2” in our published article, and that we consider the “final product” of this recipe.

# This time we work with ALL grade 4 math items:
# Ours is directly from NAEP maps, which already has scale scores
all_g4_items <- read.csv("data/Combined_Item_Maps.csv")
colnames(all_g4_items) <- c("ContentClassification", "ScaleScore", "Question", "Year")

# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score, within some max_dist tolerance
option_2_raw <- difference_left_join(x = matching_theta_scores_df,
                             y = all_g4_items,
                             by = "ScaleScore",
                             max_dist = 5,
                             distance_col = "Distance")

# Replace Distance = NA (for those with no item map question within max-dist) with 0
option_2_raw$Distance[is.na(option_2_raw$Distance)] <- 0

# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance), if still multiples of equal distance, pick randomly
option_2 <- option_2_raw %>% 
    group_by(SumScore) %>% 
    filter(Distance == min(Distance)) %>%
    group_by(SumScore) %>% 
    sample_n(size = 1)
    
# Drop the distance column
option_2 <- subset(option_2, select = -c(Distance))
# Rename remaining columns for clarity, drop row names, re-order columns, re-order rows
colnames(option_2)[which(names(option_2) == "ScaleScore.x")] <- "ScaleScore"
colnames(option_2)[which(names(option_2) == "ScaleScore.y")] <- "ClosestItemScaleScore"
rownames(option_2) <- NULL
option_2 <- option_2 %>% relocate(Question, .after = AchievementLevel)
option_2 <- option_2 %>% relocate(Theta, .after = Question)
option_2 <- option_2[order(option_2$SumScore, decreasing = T),]

Final Product:

This table shows sum scores with corresponding theta estimates, scale score estimates and achievement levels, with exemplar items mapped to each sum score.

# View and export table
kable(option_2)

SumScore	ScaleScore	AchievementLevel	Question	Theta	ContentClassification	ClosestItemScaleScore	Year
31	355	Advanced	NA	3.5727	NA	NA	NA
30	335	Advanced	Divide a square into various shapes-Extended (CR)	2.9463	Geometry	335	2011
29	315	Advanced	Compose numbers using place value to determine winners of a game-Extended (CR)	2.3199	Number Properties and Operations	315	2011
28	303	Advanced	Solve a story problem involving comparison of unit costs-Extended Response (CR)	1.9334	Number Properties and Operations	302	2005
27	293	Advanced	Solve a story problem involving comparison of unit costs-Satisfactory Response (CR)	1.6405	Number Properties and Operations	293	2005
26	286	Advanced	Determine and apply a rule based on an input-output table (calculator available)-Satisfactory (CR)	1.3980	Algebra	286	2017
25	279	Proficient	Identify the fraction closest to the given value (MC)	1.1863	Number Properties and Operations	279	2009
24	273	Proficient	Identify given measurements on a ruler-Correct Response (CR)	0.9950	Measurement	273	2005
23	267	Proficient	Reason about odd and even numbers-Correct (CR)	0.8173	Number Properties and Operations	267	2009
22	262	Proficient	Identify multiple correct solution methods to an addition problem-Partial (SR)	0.6488	Number Properties and Operations	262	2017
21	256	Proficient	Solve a story problem involving time (calculator available)-Partial (CR)	0.4863	Measurement	256	2011
20	251	Proficient	Use an interactive tool to create a parallel line segment-Correct (CR)	0.3275	Geometry	251	2019
19	246	Basic	Mark locations on a grid-Correct (CR)	0.1698	Algebra	246	2013
18	241	Basic	Identify given measurements on a ruler-Partial Response (CR)	0.0113	Measurement	241	2005
17	236	Basic	Solve a one-variable linear equation (calculator available)-Correct (CR)	-0.1502	Algebra	236	2019
16	231	Basic	Represent the probabilities of three outcomes using a picture-Correct (SR)	-0.3173	Data Analysis, Statistics, and Probability	230	2019
15	225	Basic	Identify number sentence that models a balanced scale (calculator available) (MC)	-0.4926	Algebra	225	2007
14	219	Basic	Compose numbers using place value to determine winners of a game-Minimal (CR)	-0.6793	Number Properties and Operations	219	2011
13	213	Below Basic	Identify which of four objects is heaviest (MC)	-0.8814	Measurement	213	2003
12	206	Below Basic	Find value of an unknown in a number sentence (MC)	-1.1042	Algebra	207	2017
11	198	Below Basic	Identify a reasonable amount of time to walk 2 miles (calculator available) (MC)	-1.3557	Measurement	198	2003
10	188	Below Basic	Determine numerical value of an unknown quantity in a whole number sentence (MC)	-1.6489	Algebra	189	2011
9	177	Below Basic	Divide a square into various shapes-Minimal (CR)	-2.0079	Geometry	173	2011
8	162	Below Basic	Identify a figure that is not symmetric (calculator available) (MC)	-2.4804	Geometry	165	2011
7	140	Below Basic	NA	-3.1762	NA	NA	NA
6	98	Below Basic	NA	-4.4845	NA	NA	NA
5	56	Below Basic	NA	-5.7928	NA	NA	NA
4	56	Below Basic	NA	-5.7928	NA	NA	NA
3	56	Below Basic	NA	-5.7928	NA	NA	NA
2	56	Below Basic	NA	-5.7928	NA	NA	NA
1	56	Below Basic	NA	-5.7928	NA	NA	NA
0	56	Below Basic	NA	-5.7928	NA	NA	NA

write.csv(option_2, file = "results/Master_Table_All_Years.csv", row.names = F)

Appendix 1: Imprecision in Scale Score Estimates

How do we convey a sense of uncertainty with these sorts of score estimates?

Because students with “fixed” thetas can retake the test and score slightly differently each time, we simulate this below to create an empirical range of possible scale scores for each student sum score.

# Create function to calculate test information for a given theta
test_information <- function(theta, data = merged_data){
  data$aj = data$aj/D
  num <- ((data$aj ^ 2) * (1 - data$cj))
  denom_a <- (data$cj + exp(data$aj * (theta - data$bj)))
  denom_b <- ((1 + exp(-1 * data$aj * (theta - data$bj)))^2)
  return(sum(num/(denom_a * denom_b)))
}

# Plot Test Information Function
thetas <- seq(from = -4, to = 4, by = 0.01)
informations <- sapply(thetas, test_information)
plot(x = thetas, y = informations, 
     type = "l", xlim = c(-4, 4),
     main = "Test Information Function", xlab = "Theta", ylab = "Information")

# Invert test information to plot Conditional Standard Errors of Measurement
csems <- 1/sqrt(informations)
plot(x = thetas, y = csems, 
     type = "l", xlim = c(-4, 4),
     main = "Conditional Standard Errors of Measurement", xlab = "Theta", ylab = "CSEM")

Appendix 2: Scale Score Estimation Using Full-Pattern Scoring

Full-pattern scoring (using a full table recording how each student performed on each question) can improve ability estimation.

#install.packages("irtoys")      # May require some dependencies
library(irtoys)

# Turn the item parameters into the matrix form that irtoys expects
ip <- as.matrix(dplyr::select(merged_data, aj, bj, cj))               
# The R package irtoys uses a normalizing constant of 1, so we "undo" the normalizing constant first
ip[,1] <- ip[,1]/D

# Use parameter estimates to simulate answers for 100 students
# (Import student data here if using real answers)
set.seed(88)
sim_thetas <- rnorm(1000)
sim_responses <- sim(ip, sim_thetas)

# Put the parameter estimates and standard errors into the list structure that irtoys functions expect
# Note: ability estimation function does not need standard errors or variance-covariance matrix to run
parameter_list <- list(est = ip, se = NA, vcm = NA)

# Estimate student thetas, based on full-pattern scoring, using MLE (several other methods exist for this "ability" function)
mod_MLE<- ability(resp = sim_responses, ip = parameter_list, method = "MLE")

# Look at the first five rows
mod_MLE[1:5, ]

##              est       sem  n
## [1,] -0.33230723 0.9117495 31
## [2,]  0.47690365 0.8671007 31
## [3,]  2.94742907 1.0381432 31
## [4,] -2.09337926 1.2070893 31
## [5,]  0.04383402 0.8842579 31

# Convert these estimated thetas to scale scores and achievement levels
ability_df <- as.data.frame(mod_MLE)
colnames(ability_df) <- c("Theta", "se(Theta)", "QuestionsAnswered")
ability_df$ScaleScore <- theta_to_scale_score(ability_df$Theta)
ability_df$AchievementLevel <- scale_score_to_achievement_level(ability_df$ScaleScore)

# Round the scale scores to 0 dp before reporting
ability_df$ScaleScore <- round(ability_df$ScaleScore, 0)

# View first five rows of table:
ability_df[1:5, ]

# Export the data
write.csv(ability_df, file = "results/Estimated_Scores_From_Full_Pattern_Scores.csv", row.names = F)

Appendix 3: Replicating NAEP’s theta to scale score equations

NAEP’s theta to scale score transforming equations are separated by sub-scale, like this:

https://nces.ed.gov/nationsreportcard/tdw/analysis/2013/trans_constants_math2013.aspx

We provide an alternative theta_to_scale_score function below, which can weight these subscale parameters either by default proportions or by their proportions in your selected test items, to most accurately mimic NAEP scoring.

# Import table of transformation constants
scale_scores_raw <- read.csv("data/Score_Equations.csv")  
  
# Takes theta(s), returns scale score(s)
theta_to_scale_score <- function(theta, weights = c(0.4, 0.2, 0.15, 0.1, 0.15))
{
  # Complex version: weighted average of each of the five subscale equations
  # Default weights taken from here for 2013 grade 4 Math: 
  # https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_determination_composite.aspx
  # Alternatively, user can provide weights customized to the actual distribution of questions used
  A = weighted.mean(x = scale_scores_raw$A, w = weights)
  B = weighted.mean(x = scale_scores_raw$B, w = weights)
  return(scale_score = A*theta + B)
}

How Can Released State Test Items Support Interim Assessment Purposes in an Educational Crisis?

Emma Klugman and Andrew Ho, Harvard Graduate School of Education

9/10/2020

Contents: