Skip to main content

When are differences different? Musings on IRT analyses of the HCAP cognitive battery

Written by: Alden Gross

Published on: Nov 12, 2021


My name is Alden Gross. I am a psychometrician and epidemiologist who does cognitive aging research. Specifically, I am involved in harmonizing cognitive performance data among several HCAP countries to share in the Gateway to Global Aging.

There are several steps towards creating harmonized cognitive scores. First, there is what we call pre-statistical harmonization, which is essentially an accounting exercise wherein we gather all the information we can about the battery of tests administered in each study - this includes test versions, administration instructions, scoring instructions, any deviations from standard administration, and so forth. Although the HCAP neuropsychological test battery was designed for administration in multiple countries, adaptations of varying magnitudes were made for assorted language or cultural reasons. For example, comparing the HRS-HCAP in the US to LASI-DAD in India, a question that asks people to read and follow a written command ("Close your eyes") was modified for illiterate participants in India by asking respondents just to mimic the interviewer. Whether such a modification led to a truly comparable question is dubious, since the neurological reason for asking the question (processing written instructions) is lost.

What about other, more esoteric differences in test administration? For example, the HCAP includes a Letter Cancellation task in which respondents are to strike out the letters P and W amidst a bunch of distractors on a page. In the US, this test is given on a standard 8.5x11in sheet, whereas in England (for ELSA HCAP), the standard paper size is A4. Does paper size, which could affect font or number of presented letters, matter?

Sometimes the administration of a test is identical but coding procedures differed. A great example of this is the Brave Man story, a story recall task. Recall on this episodic memory test can be scored via exact scoring (i.e., requiring respondents to recall exactly the element of the story) or gist scoring (where points are given for participants who recall the general idea of a story element). Now, most of the HCAPs do provide separate variables so investigators can calculate exact and gist scores. Except ELSA’s HCAP. They only provided gist scoring, with no way to back-calculate exact scores. The solution, which often happens in harmonization work, is to take the lowest common denominator, which in this case means converting all scores to exact scoring. But then, how do we know that exact scores are cross-nationally comparable?

To summarize, how can we be sure that we are systematically reviewing every cognitive test item to be certain they match across cohorts? How serious are various adaptations to the tests? It turns out that there is a systematic method for evaluating differences across countries in how tests were administered. Our framework for generating statistically comparable scores is based in Item Response Theory (IRT). IRT is a general latent variable modeling approach that can be thought of in the broader family of latent variable methods. IRT takes the focus away from overall summaries of test performance and instead focuses on qualities and characteristics of items that comprise a test battery such as HCAP. Most of the HCAPs can be described as having 40-50 cognitive test items in their batteries, across about 20 instruments. Here is a good description of the LASI-DAD battery (India’s HCAP); here is a paper describing the HRS-HCAP test battery. We are interested in how each test item relates to others in coming up with an overall test score.

Differential item functioning (DIF) is an aspect of IRT that allows us to disentangle "true" or "real" differences in a latent trait - such as cognitive functioning - from measurement related differences or nonequivalent tests across groups. Among other things, DIF detection is what helps the Educational Testing Service screen out SAT and GRE test items that are more laden with racial or sex or regional differences rather than verbal or mathematical abilities.

A trick to DIF testing is that, outside of educational testing, the field is kind of a wild west characterized by a large variety of approaches. The most traditional approach to DIF testing is via a multiple indicator, multiple causes (MIMIC) model. One estimates a latent variable model to estimate a latent trait for cognitive functioning in which the latent trait is presumed to lead to (in a non-causal sense on the individual level; we must be careful in deference to Dr. Borsboom) responses on multiple individual test items; that is, the model is multivariate and includes multiple indicators. Factor loadings and thresholds for each test item describe, respectively, how well the item correlates with the latent trait and where along the latent trait the item provides optimal measurement (for instance, an easy test item that most people answer correctly is said to provide the most information on the easier end of the latent cognitive spectrum). In a MIMIC model, we regress the latent trait and each item on a grouping variable (thus, is Multiple Causes part of MIMIC), such as indicators for cohorts or countries, as a way to statistically test whether these factor loadings and thresholds differ across groups. Here is my favorite exemplar of a paper that used MIMIC.

In addition to MIMIC models, there is alignment analysis, an Mplus software capability introduced around 2014 that is also based around a latent variable model. But the procedure is a little different and kind of like a black box: the alignment algorithm first fits a model across different groups in which all items are presumed to have the same parameters across groups (an invariant model). Next, the algorithm systematically tests each loading and threshold for every item to identify which one differs the most by a grouping variable; baked in is this simplicity function that rewards solutions that have few larger non-invariant parameters instead of solutions with many smaller non-invariant parameters. At the end of the model, we get a list of items whose factor loadings and thresholds may differ across groups. Here's a decent example of a paper that used alignment analysis on LASI Pilot data to identify potential DIF by rurality and education in cognitive testing (as it turns out, people in more rural areas can name more animals than urbanites!).

In addition to MIMIC models and alignment analysis, there is DIF testing via ordinal logistic regression. Here is my most favorite paper that taught me this method. Here, we first regress each test item (using ordinal logistic regression for categorical items or linear regression for continuous items) on a factor score for the latent cognition variable of interest (estimated from a model that initially presumes no DIF). We save out the Beta parameter describing the association of the latent trait with the item (B_model1). In a second model, we additionally adjust for the grouping variable, and save out the Beta parameter for the association of the latent trait with the item (B_model2). We take the difference (B_model1-B_model2) and divide by B_model1 to calculate the percent change in the relationship of an item with the latent variable, before vs after adjustment for the grouping variable (which, again, is cohort). Cool! Now, if the percentage change in coefficients is "big", then there is DIF by cohort, and the latent trait should be recalculated without that indicator as a presumed anchor item. One then moves onto the next item. What counts as a "big" change? We rely on Maldonado and Greenland's 1993 simulation of how much a relationship has to change for confounding to be considered empirically considerable; their simulations advised 10% and as an epidemiologist I will do whatever Sander Greenland tells me to do.

We have covered MIMIC, alignment, and ordinal logistic regression approaches to DIF detection. But wait, there is more! There's David Thissen's IRTLRDIF, a stand-alone program that elegantly calculates DIF in the simplistic case where one has only binary indicators and a unidimensional construct. There are a lot of other DIF detection procedures that I am not as familiar with.

So, what are we doing for the HCAPs? A little of everything: the sandbox is so big and there are so many candidate items to evaluate for DIF, we are not putting our cards all into one approach for DIF detection right now. We are evaluating multiple approaches currently. Something that is becoming apparent is that each of these approaches may identify DIF in a set of items, but the exact item set is not replicated across each approach. That is, whether we consider item u1 as having DIF vs item u2 as having DIF will depend on the method. We have learned an important lesson here. The lesson is that it is probably less important to identify specific items that have DIF than it is to identify whether there is impactful DIF in an item set (regardless of which items we blame for it). I should clarify something about that last sentence before I get jumped in the alleyway behind the Psychometrics Convention: our context is in detection of DIF in already collected data, and we aren’t going to go back in time and change the items. If, however, our context is that you want to get recommendations for comparable test items for a new study you will do, then we probably do care about which particular items have DIF; my best advice here is to use your existing data to determine why individual items might be showing DIF. Construct an interpretive argument and bring that to your next study meeting.

Regardless of which items in one’s battery show DIF, it is critical to evaluate impactful or salient DIF. What is that?! Well, the approaches above tell us how to detect DIF in individual items; we can provide % changes in the logistic regression approach or regressions on group from the MIMIC model to describe the magnitude of DIF on individual items, but these statistics do not tell us how much our final overall scores are affected by the cumulative sum of all the DIF. What is the overall impact of DIF on a score (as opposed to magnitude)? As it turns out, we have ways to test for that, too! What we do is to calculate non-DIF adjusted cognitive factor scores, then DIF-adjusted cognitive factor scores, and take the difference. We calculate a proportion of observations fall outside of +-0.3 standard deviations of the difference of 0. Why 0.3 SD's? In computer adaptive testing, which is based on IRT, most algorithms stop testing respondents once the standard error of measurement falls below 0.3; the technical term for this is "tolerable measurement slop" (thank you, Dr. Paul Crane).

You might ask how DIF can be detected in some items but not produce salient or impactful differences on the estimated latent trait. Great question. There are at least three reasons. Maybe the magnitude of DIF that was detected is pretty small. Or, maybe there is positive DIF on one item that is counteracted by negative DIF on an equally important item; the items balance out overall. Or, maybe the items with DIF do not actually contribute so much to the estimation of the latent trait (because they do not correlate very well with the rest of the items in the battery).

To summarize, at the end of this long meandering road, the point of this blog post is to illustrate on a conceptual level that there are differences in HCAP batteries across countries. We have a framework for identifying those differences. More importantly, we have a framework for identifying whether those differences lead to impactful differences on individual cognitive scores. These scores are going to eventually be used to do cross-national research and so we need to be sure we get them as close to “correct” as technically possible. Will we produce the perfect score? In the immortal words of Captain Picard, "It is possible to commit no mistakes and still lose. That is not a weakness. That is life."

Back to Top