Download the latest version of this code here: https://github.com/emmaklugman/OpenTests
In this tutorial we show how school, district, and state assessment coordinators can create, administer, and score a test comprised of publicly available questions.
Please read our accompanying article How Can Released State Test Items Support Interim Assessment Purposes in an Educational Crisis? in Educational Measurement: Issues and Practice for additional details: https://doi.org/10.1111/emip.12390 (An open-access working paper version is available here: https://doi.org/10.26300/yr8t-0v59 )
Corresponding author: Emma M. Klugman, emma.m.klugman@gmail.com
Our article discusses many important cautions and caveats for use in the COVID-19 era, three of which we list here:
Among tests, the tests we describe here should be lower in priority than tests of physical, mental, and social-emotional health, and lower in priority than classroom and district tests that may already be in place;
We advise the use of these tests for aggregate (group-level) monitoring with no direct stakes on schools, educators, or students; and
Score interpretations should be tempered by departures from typical administration conditions, where tests follow aligned instructional sequences and are administered in school settings.
In this illustration, we use released questions (items) from the 2013 administration of NAEP 4th Grade Math. Users can gather corresponding ingredients from state websites or request them from state testing programs. This recipe assumes expertise at the level of a first-year survey course in educational measurement.
We list the 5 essential ingredients for score reports here:
Test items
Item parameter estimates
A list or key enabling association of items and their corresponding estimates
Linking functions from underlying theta scales to scale scores
Achievement level cut scores
Table 1. Online Public Availability of Items and Parameter Estimates for the Construction of Open Tests Selected large‐scale national and international testing programs and programs from the 15 largest states as of August, 2020
Testing Program | (1) Are Operational (or Field Tested) Items Available? | (2) Are Item Parameter Estimates Available? | (3) Is a Key Enabling a Merge of (1) and (2) Available? |
---|---|---|---|
NAEP | Yes | Yes | Yes |
PISA | Yes | Yes | Yes |
TIMSS | Yes | Yes | Yes |
Smarter balanced | Yes | No | No |
New Meridian (PARCC) | Yes | No | No |
California | Yes | No | No |
Texas | Yes | No | No |
Florida | Yes | No | No |
New York | Yes 3-8 & Regents | Yes 3-8 & Regents | No 3-8; Yes Regents |
Pennsylvania | Yes | Yes | No |
Illinois | Yes | No | No |
Ohio | Yes | Yes | Haphazardly |
Georgia | Yes | No | No |
North Carolina | Yes | No | No |
Michigan | Yes | No | No |
New Jersey | Yes | No | No |
Virginia | Yes | No | No |
Washington | Yes | No | No |
Arizona | Yes | Yes | No |
Massachusetts | Yes | Yes | No |
# Import selected items
# NAEP example from: https://nces.ed.gov/NationsReportCard/nqt/Search
selected_items <- read.csv("data/Selected_Items.csv", skip = 1)
# Import item parameter data
# NAEP example from: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt_math.aspx
item_parameters_raw <- read.csv("data/IRT_Parameters.csv")
# Join the two tables, using "NAEP.ID" as the key
merged_data_raw <- inner_join(x = selected_items, y = item_parameters_raw, by = "NAEP.ID")
# Drop some unnecessary rows
merged_data <- dplyr::select(merged_data_raw, -c(Question.ID, Block.ID, Type, Seq., Online, dj1, dj2, dj3, dj4))
# Takes theta(s), returns scale score(s) (for most tests, this looks like "scale_score = scale_multiplier*theta + scale_contant")
theta_to_scale_score <- function(theta, M = 31.90, K = 240.98) {
# NAEP uses a weighted formula; we replicate that in Appendix 3
# Here, our defaults for M and K are from the 2013 Grade 4 NAEP Math parameters for Number Properties and Operations:
# https://nces.ed.gov/nationsreportcard/tdw/analysis/2013/trans_constants_math2013.aspx
return(scale_score = M*theta + K)
}
# Takes theta(s), returns achievement levels
scale_score_to_achievement_level <- function(scale_scores, cutoffs = c(214, 249, 282)) {
# Note that these cutscores usually vary by grade, subject, and year
# Our default cutoffs taken from here for 2013 grade 4 Math:
# https://nces.ed.gov/nationsreportcard/mathematics/achieve.asp#grade4
achievement_levels <- rep("Below Basic", length(scale_scores))
achievement_levels[scale_scores >= cutoffs[1]] <- "Basic"
achievement_levels[scale_scores >= cutoffs[2]] <- "Proficient"
achievement_levels[scale_scores >= cutoffs[3]] <- "Advanced"
return(achievement_levels)
}
“Invert” the TCC to get a map from sum scores to thetas:
# Create a range of possible thetas
max_theta = 10
thetas = seq(from = -max_theta, to = max_theta, length.out = 200001)
# NAEP Technical Documentation shows that a normalizing constant D of 1.7 is used in their 3PL equations: https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_3pl.aspx
D <- 1.7 # Change to D = 1 if no D appears in your technical manual
TCC <- function(theta, data){
expected_scores <- rep(NA, length(theta))
for(i in 1:length(theta)){
expected_scores[i] <- sum(data$cj + (1 - data$cj)/(1 + exp(-D*data$aj*(theta[i] - data$bj))))
}
return(expected_scores)
}
all_sum_scores <- TCC(theta = thetas, data = merged_data)
# Plot Test Characteristic Curve (which underlies all of what follows)
plot(x = thetas, y = all_sum_scores,
type = "l", xlim = c(-4, 4), ylim = c(0, 31),
main = "Test Characteristic Curve", xlab = "Theta", ylab = "Expected Raw Score")
# Simplifies the above possible sum scores (decimals) into possible whole number scores
possible_sum_scores <- seq(from = ceiling(all_sum_scores[1]), to = floor(tail(all_sum_scores, 1)), by = 1)
all_whole_sum_scores <- 0:nrow(merged_data)
# For each possible whole number score, find the associated theta that's the best match
matching_theta_scores <- rep(NA, times = length(possible_sum_scores))
for(i in 1:length(all_whole_sum_scores)){
matching_theta_scores[i] <- thetas[which.min(abs(all_sum_scores - all_whole_sum_scores[i]))]
}
# The TCC inversion above tries to find can find sum scores for any theta between -mac_theat to max_theta, if they're defined. We also extrapolate for sum scores beyond these, to provide a complete table for any possible sum score (with a floor for lowest scores, aligning with low-stakes uses)
need_extrapolating_high <- which(matching_theta_scores == max_theta)
need_extrapolating_low <- which(matching_theta_scores == -max_theta)
dont_need_extrapolating <- matching_theta_scores[-c(need_extrapolating_high, need_extrapolating_low)]
# Simple linear bounded HOSS/LOSS extrapolation (HOSS = highest obtainable scale score, LOSS = lowest obtainable scale score)
low_step_size <- dont_need_extrapolating[2] - dont_need_extrapolating[1]
LOSS_floor <- dont_need_extrapolating[1] - low_step_size
matching_theta_scores[need_extrapolating_low] <- LOSS_floor
high_step_size <- dont_need_extrapolating[length(dont_need_extrapolating)] -
dont_need_extrapolating[length(dont_need_extrapolating) - 1]
for(i in need_extrapolating_high){matching_theta_scores[i] <- matching_theta_scores[i-1] + high_step_size}
# Produce a table with sum scores, thetas, scale scores, and achievement levels
matching_theta_scores_df <- as.data.frame(cbind(all_whole_sum_scores, matching_theta_scores))
colnames(matching_theta_scores_df) <- c("SumScore", "Theta")
# Convert the thetas to scale scores and then achievement levels, using our custom functions above
matching_theta_scores_df$ScaleScore <- theta_to_scale_score(matching_theta_scores_df$Theta)
matching_theta_scores_df$AchievementLevel <- scale_score_to_achievement_level(matching_theta_scores_df$ScaleScore)
# Re-order and round the Scale Scores before reporting, drop the row numbers
matching_theta_scores_df <- matching_theta_scores_df[order(matching_theta_scores_df$SumScore, decreasing = T),]
matching_theta_scores_df$ScaleScore <- round(matching_theta_scores_df$ScaleScore)
row.names(matching_theta_scores_df) <- NULL
# View table
matching_theta_scores_df
# Export table
write.csv(matching_theta_scores_df, file = "results/Estimated_Scale_Scores_From_Sum_Scores.csv",
row.names = F)
An item map takes test questions and pegs them to specific scale scores, such that users can visualize which items a student of a given scale score is likely have already mastered, and which items they are less likely to answer correctly yet.
Example item map: https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2013
In this section, we provide simple code to create item maps for any set of items for which item difficulties are known, using the items above as our example.
# This function takes item parameter(s) and a response probability and returns theta(s)
ip_and_RP_to_theta <- function(data, RP = 0.74){
# RP = response probability ("a student with scale score x should have a RP% chance of answering this item correctly")
# RP defaults to 74% here, as NAEP uses RP = 74 for MC items with four choices
theta = data$bj - (1/{D*data$aj}) * log((1 - RP)/(RP - data$cj)) #307
return(theta)
}
# Add a column to item_parameters pegging each item to a certain Theta with response probability RP
merged_data$Theta <- ip_and_RP_to_theta(data = merged_data)
# Convert to scale score, round to 0dp, drop rownames
merged_data$ScaleScore <- round(theta_to_scale_score(theta = merged_data$Theta), 0)
row.names(merged_data) <- NULL
# View Item Map
merged_data
# Sort by Theta and export
write.csv(merged_data[order(merged_data$Theta, decreasing = TRUE), ], file = "results/Test_Item_Map.csv", row.names = F)
In this section, we show how to combine an item maps with a sum score table to produce our final product.
We present two options for which mapped items to use: only the items from the test constructed above, or a larger item pool, for example, from multiple years’ items (to allow for closer matches). Our “final product” from this tutorial (which also produced )
Option 1: use the item map from only the items used in the constructed test above
# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score from this set of items, within some max_dist tolerance
option1_raw <- difference_left_join(x = matching_theta_scores_df,
y = merged_data,
by = "ScaleScore",
max_dist = 5,
distance_col = "Distance")
# Replace Distance = NA (for those with no item map question within max-dist) with 0
option1_raw$Distance[is.na(option1_raw$Distance)] <- 0
# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance); if still multiples, pick randomly
option_1 <- option1_raw %>%
group_by(SumScore) %>%
filter(Distance == min(Distance)) %>%
group_by(SumScore) %>%
sample_n(size = 1)
# Drop the distance column
option_1 <- subset(option_1, select = -c(Distance))
# Rename remaining columns for clarity, drop row names
colnames(option_1) <- c("SumScore", "Theta", "ScaleScore", "AchievementLevel", "ClosestItemDescription", "ClosestItemSubscale", "ClosestItemScaleScore")
rownames(option_1) <- NULL
# Export table
write.csv(option_1, file = "results/Master_Table_One_Year.csv", row.names = F)
Option 2: use a broader pool of items for the item map (read additional items from a file). This is the same code that produces “Table 2” in our published article, and that we consider the “final product” of this recipe.
# This time we work with ALL grade 4 math items:
# Ours is directly from NAEP maps, which already has scale scores
all_g4_items <- read.csv("data/Combined_Item_Maps.csv")
colnames(all_g4_items) <- c("ContentClassification", "ScaleScore", "Question", "Year")
# "Fuzzy join" the matching_theta_scores_df ScaleScore with the nearest item map scale score, within some max_dist tolerance
option_2_raw <- difference_left_join(x = matching_theta_scores_df,
y = all_g4_items,
by = "ScaleScore",
max_dist = 5,
distance_col = "Distance")
# Replace Distance = NA (for those with no item map question within max-dist) with 0
option_2_raw$Distance[is.na(option_2_raw$Distance)] <- 0
# For those Scale Scores with multiple item map matches within max_dist, pick the closest (smallest distance), if still multiples of equal distance, pick randomly
option_2 <- option_2_raw %>%
group_by(SumScore) %>%
filter(Distance == min(Distance)) %>%
group_by(SumScore) %>%
sample_n(size = 1)
# Drop the distance column
option_2 <- subset(option_2, select = -c(Distance))
# Rename remaining columns for clarity, drop row names, re-order columns, re-order rows
colnames(option_2)[which(names(option_2) == "ScaleScore.x")] <- "ScaleScore"
colnames(option_2)[which(names(option_2) == "ScaleScore.y")] <- "ClosestItemScaleScore"
rownames(option_2) <- NULL
option_2 <- option_2 %>% relocate(Question, .after = AchievementLevel)
option_2 <- option_2 %>% relocate(Theta, .after = Question)
option_2 <- option_2[order(option_2$SumScore, decreasing = T),]
This table shows sum scores with corresponding theta estimates, scale score estimates and achievement levels, with exemplar items mapped to each sum score.
# View and export table
kable(option_2)
SumScore | ScaleScore | AchievementLevel | Question | Theta | ContentClassification | ClosestItemScaleScore | Year |
---|---|---|---|---|---|---|---|
31 | 355 | Advanced | NA | 3.5727 | NA | NA | NA |
30 | 335 | Advanced | Divide a square into various shapes-Extended (CR) | 2.9463 | Geometry | 335 | 2011 |
29 | 315 | Advanced | Compose numbers using place value to determine winners of a game-Extended (CR) | 2.3199 | Number Properties and Operations | 315 | 2011 |
28 | 303 | Advanced | Solve a story problem involving comparison of unit costs-Extended Response (CR) | 1.9334 | Number Properties and Operations | 302 | 2005 |
27 | 293 | Advanced | Solve a story problem involving comparison of unit costs-Satisfactory Response (CR) | 1.6405 | Number Properties and Operations | 293 | 2005 |
26 | 286 | Advanced | Determine and apply a rule based on an input-output table (calculator available)-Satisfactory (CR) | 1.3980 | Algebra | 286 | 2017 |
25 | 279 | Proficient | Identify the fraction closest to the given value (MC) | 1.1863 | Number Properties and Operations | 279 | 2009 |
24 | 273 | Proficient | Identify given measurements on a ruler-Correct Response (CR) | 0.9950 | Measurement | 273 | 2005 |
23 | 267 | Proficient | Reason about odd and even numbers-Correct (CR) | 0.8173 | Number Properties and Operations | 267 | 2009 |
22 | 262 | Proficient | Identify multiple correct solution methods to an addition problem-Partial (SR) | 0.6488 | Number Properties and Operations | 262 | 2017 |
21 | 256 | Proficient | Solve a story problem involving time (calculator available)-Partial (CR) | 0.4863 | Measurement | 256 | 2011 |
20 | 251 | Proficient | Use an interactive tool to create a parallel line segment-Correct (CR) | 0.3275 | Geometry | 251 | 2019 |
19 | 246 | Basic | Mark locations on a grid-Correct (CR) | 0.1698 | Algebra | 246 | 2013 |
18 | 241 | Basic | Identify given measurements on a ruler-Partial Response (CR) | 0.0113 | Measurement | 241 | 2005 |
17 | 236 | Basic | Solve a one-variable linear equation (calculator available)-Correct (CR) | -0.1502 | Algebra | 236 | 2019 |
16 | 231 | Basic | Represent the probabilities of three outcomes using a picture-Correct (SR) | -0.3173 | Data Analysis, Statistics, and Probability | 230 | 2019 |
15 | 225 | Basic | Identify number sentence that models a balanced scale (calculator available) (MC) | -0.4926 | Algebra | 225 | 2007 |
14 | 219 | Basic | Compose numbers using place value to determine winners of a game-Minimal (CR) | -0.6793 | Number Properties and Operations | 219 | 2011 |
13 | 213 | Below Basic | Identify which of four objects is heaviest (MC) | -0.8814 | Measurement | 213 | 2003 |
12 | 206 | Below Basic | Find value of an unknown in a number sentence (MC) | -1.1042 | Algebra | 207 | 2017 |
11 | 198 | Below Basic | Identify a reasonable amount of time to walk 2 miles (calculator available) (MC) | -1.3557 | Measurement | 198 | 2003 |
10 | 188 | Below Basic | Determine numerical value of an unknown quantity in a whole number sentence (MC) | -1.6489 | Algebra | 189 | 2011 |
9 | 177 | Below Basic | Divide a square into various shapes-Minimal (CR) | -2.0079 | Geometry | 173 | 2011 |
8 | 162 | Below Basic | Identify a figure that is not symmetric (calculator available) (MC) | -2.4804 | Geometry | 165 | 2011 |
7 | 140 | Below Basic | NA | -3.1762 | NA | NA | NA |
6 | 98 | Below Basic | NA | -4.4845 | NA | NA | NA |
5 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
4 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
3 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
2 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
1 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
0 | 56 | Below Basic | NA | -5.7928 | NA | NA | NA |
write.csv(option_2, file = "results/Master_Table_All_Years.csv", row.names = F)
How do we convey a sense of uncertainty with these sorts of score estimates?
Because students with “fixed” thetas can retake the test and score slightly differently each time, we simulate this below to create an empirical range of possible scale scores for each student sum score.
# Create function to calculate test information for a given theta
test_information <- function(theta, data = merged_data){
data$aj = data$aj/D
num <- ((data$aj ^ 2) * (1 - data$cj))
denom_a <- (data$cj + exp(data$aj * (theta - data$bj)))
denom_b <- ((1 + exp(-1 * data$aj * (theta - data$bj)))^2)
return(sum(num/(denom_a * denom_b)))
}
# Plot Test Information Function
thetas <- seq(from = -4, to = 4, by = 0.01)
informations <- sapply(thetas, test_information)
plot(x = thetas, y = informations,
type = "l", xlim = c(-4, 4),
main = "Test Information Function", xlab = "Theta", ylab = "Information")
# Invert test information to plot Conditional Standard Errors of Measurement
csems <- 1/sqrt(informations)
plot(x = thetas, y = csems,
type = "l", xlim = c(-4, 4),
main = "Conditional Standard Errors of Measurement", xlab = "Theta", ylab = "CSEM")
Full-pattern scoring (using a full table recording how each student performed on each question) can improve ability estimation.
#install.packages("irtoys") # May require some dependencies
library(irtoys)
# Turn the item parameters into the matrix form that irtoys expects
ip <- as.matrix(dplyr::select(merged_data, aj, bj, cj))
# The R package irtoys uses a normalizing constant of 1, so we "undo" the normalizing constant first
ip[,1] <- ip[,1]/D
# Use parameter estimates to simulate answers for 100 students
# (Import student data here if using real answers)
set.seed(88)
sim_thetas <- rnorm(1000)
sim_responses <- sim(ip, sim_thetas)
# Put the parameter estimates and standard errors into the list structure that irtoys functions expect
# Note: ability estimation function does not need standard errors or variance-covariance matrix to run
parameter_list <- list(est = ip, se = NA, vcm = NA)
# Estimate student thetas, based on full-pattern scoring, using MLE (several other methods exist for this "ability" function)
mod_MLE<- ability(resp = sim_responses, ip = parameter_list, method = "MLE")
# Look at the first five rows
mod_MLE[1:5, ]
## est sem n
## [1,] -0.33230723 0.9117495 31
## [2,] 0.47690365 0.8671007 31
## [3,] 2.94742907 1.0381432 31
## [4,] -2.09337926 1.2070893 31
## [5,] 0.04383402 0.8842579 31
# Convert these estimated thetas to scale scores and achievement levels
ability_df <- as.data.frame(mod_MLE)
colnames(ability_df) <- c("Theta", "se(Theta)", "QuestionsAnswered")
ability_df$ScaleScore <- theta_to_scale_score(ability_df$Theta)
ability_df$AchievementLevel <- scale_score_to_achievement_level(ability_df$ScaleScore)
# Round the scale scores to 0 dp before reporting
ability_df$ScaleScore <- round(ability_df$ScaleScore, 0)
# View first five rows of table:
ability_df[1:5, ]
# Export the data
write.csv(ability_df, file = "results/Estimated_Scores_From_Full_Pattern_Scores.csv", row.names = F)
NAEP’s theta to scale score transforming equations are separated by sub-scale, like this:
We provide an alternative theta_to_scale_score function below, which can weight these subscale parameters either by default proportions or by their proportions in your selected test items, to most accurately mimic NAEP scoring.
# Import table of transformation constants
scale_scores_raw <- read.csv("data/Score_Equations.csv")
# Takes theta(s), returns scale score(s)
theta_to_scale_score <- function(theta, weights = c(0.4, 0.2, 0.15, 0.1, 0.15))
{
# Complex version: weighted average of each of the five subscale equations
# Default weights taken from here for 2013 grade 4 Math:
# https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_determination_composite.aspx
# Alternatively, user can provide weights customized to the actual distribution of questions used
A = weighted.mean(x = scale_scores_raw$A, w = weights)
B = weighted.mean(x = scale_scores_raw$B, w = weights)
return(scale_score = A*theta + B)
}