Version 2.0.0 of the R package lsasim comes with new, interesting features for researchers working with background questionnaires. Of particular interest for this blog post is the ability to calculate the theoretical linear regression coefficients for the expectation of the latent trait \(\theta\) on the answers to the \(Q\) background questionnaire items. Mathematically, given the equation \[ E(\theta | Q) = \beta_0 + \sum_{i = 1}^q \beta_i Q_i, \] we’re interested in calculating \(\beta = \{\beta_1, \ldots, \beta_Q\}\). This is done directly by the function beta_gen or, indirectly, by calling questionnaire_gen with the arguments theta = TRUE, full_output = TRUE, and family = "gaussian".

Since questionnaire_gen generates sample data from theoretical parameters such as the covariance matrix between \(\theta\) and \(Q\), estimates for \(\beta\) can be calculated directly by applying linear regression—e.g. using the lm function available in base R—in the generated data. As a matter of fact, beta_gen uses this same covariance matrix to calculate the theoretical \(\beta\). If cognitive responses are also available for those same students who wrote the background questionnaire, they can also be used by an IRT package such as TAM or mirt to estimate \(\beta\).

An interesting exercise would be to check if those estimates for the regression coefficients match the parameters calculated by beta_gen. In order to reduce sample variability and get estimates which are as close as reasonably possible to the true values, one sample might not be enough, and a more complex simulation study must be setup. Then again, simply performing one experiment over and over is no guarantee that we will reach the expected results. In this post, we will show you how one apparently innocuous design decision can greatly affect the results.

Simulation setup 1

Let a population of infinite size—in statistical terms—out of which an unbiased sample of students is drawn. In the figure below, the population is contained within the black circumference on the left, and two independent student samples are represented by the blue and red circles.

The sampled students are then subjected to a background questionnaire and a cognitive questionnaire; their responses are graded and Item Response Theory—namely the TAM package—is used to estimate \(\theta\) and \(\beta\) from the cognitive responses.

Next, we administer another cognitive questionnaire to this same group of students. This seems to make sense at first, since the cognitive questionnaire is the only one being used by TAM to calculate the regression coefficients, and the student sample is an unbiased sample anyway. So a new cognitive questionnaire is administered, the answers are once again scored and \(\hat\beta\) are once again estimated. This procedure of administering a new cognitive questionnaire is repeated a total of 1 000 times; those 1 000 estimates of \(\beta\) are then compared with the true value. For starters, summary statistics for \(\hat\beta\) in this setup are given in the tables below, which differ by the sample size (the table at the top results from a sample of 1 000 students; the one at the bottom results from a sample of 10 000).

N Variable Mean SD
1000 theta 0.0058 0.9827
1000 q1 -0.0132 1.0215
1000 q2 -0.0016 0.9858
N Variable Mean SD
10000 theta -0.0025 1.0005
10000 q1 0.0038 1.0115
10000 q2 0.0021 0.9950

Individual estimates from those 1 000 replications of the cognitive questionnaires applied to the 1 000-strong student sample and the other one with 10 000 students can be seen in the figures below. The central red line corresponds to the true \(\beta\) provided by beta_gen.

Something doesn’t seem right. If we are dealing with unbiased samples, \(\hat\beta\) should generally be hovering around \(\beta\), but that does not seem to be the case, at least not for all the coefficients. What could be wrong? Perhaps the sample sizes are not large enough? Well, even samples of a few hundred observations are often large enough to achieve sufficient convergence and evidence of unbiasedness, so the idea of bias from a sample of 10 000—see the figure for q1 and N=10000 for the clearest picture—being not systematic but sample-derived feels rather far-fetched.

Not to rely on visual inspection alone, let us examine the tables below, which compare \(\beta\) with the estimates \(\hat\beta\), the standard deviations of \(\hat\beta\) and bias, absolute bias and Root Mean Squared Error. The values may seem low enough to convince some analysts that there is actually no bias in our estimates, but the offsets seen in the images above just seem too systematic to be left alone. So what could be wrong with setup 1?

N Variable True beta Est beta SD Bias ASB RMSE
1000 theta 0.00000 0.00544 0.00079 0.00544 0.00544 0.00550
1000 q1 0.56935 0.58386 0.01682 0.01451 0.01837 0.02220
1000 q2 -0.78526 -0.81754 0.01435 -0.03228 0.03242 0.03532
N Variable True beta Est beta SD Bias ASB RMSE
10000 theta 0.00000 -0.00032 0.00024 -0.00032 0.00034 0.00040
10000 q1 0.56935 0.56358 0.00521 -0.00577 0.00643 0.00777
10000 q2 -0.78526 -0.78773 0.00424 -0.00248 0.00403 0.00491

The issue with simulation setup 1

This first simulation setup relies on the convenience of sampling the population only once for the background questionnaire. Even if it is an unbiased sample from the population, it is a finite population with different parameters. For example, even if the population has \(\theta \sim N(0, 1)\), a finite sample will have \(\theta \sim N(\mu, \sigma)\) with \(\mu \neq 0\) and \(\sigma \neq 1\). When we apply cognitive questionnaires to that sample, they are going to reflect the parameters of the sample, not the population, and that’s where the bias comes from. This bias may be small if we retrieve a large enough sample, but it will still contain a systematic difference which can be elimiated by a better simulation setup. If we are to obtain proper \(\hat\beta\), we must sample directly from the population on every replication. This is what our second setup does.

Simulation setup 2

In this scenario, our infinite population is sampled once and that same sample is administered both one background questionnaire and one cognitive questionnaire. Then, \(\hat\beta\) is calculated for this sample. Next, another sample is obtained from the population and one of each type of questionnaire is administered to it. The figure below shows a visual representation of this setup.

The difference between the two setups seems small, but the results are definitely affected. First, we look at some summary statistics for the new \(\hat\beta\). They are quite different from the ones from the first setup, which is already an indicator that the two setups yield different results.

N Variable Mean SD
1000 theta 0.0395 1.0047
1000 q1 0.0402 1.0019
1000 q2 0.0458 1.0059
N Variable Mean SD
10000 theta -0.0080 0.9913
10000 q1 -0.0092 0.9919
10000 q2 -0.0070 0.9925

Then we plot the estimated coefficients for each replication, and the differences are much clearer, with a defintive improvement of our results. The systematic bias that was seen before is gone, and \(\hat\beta\) converges to \(\beta\) even for the smaller sample size.

Finally, the tables below complement the figures above with some summary statistics, showing similar numbers from the first setup. Once again, the tables suggest that everything is in order, which means that the improvements were most noticeable in the figures.

N Variable True beta Est beta SD Bias ASB RMSE
1000 theta 0.00000 -0.00021 0.03287 -0.00021 0.02646 0.03285
1000 q1 1.00602 1.01007 0.07322 0.00404 0.05805 0.07330
1000 q2 -0.02098 -0.02298 0.07159 -0.00200 0.05674 0.07158
N Variable True beta Est beta SD Bias ASB RMSE
10000 theta 0.00000 -0.00022 0.01075 -0.00022 0.00849 0.01075
10000 q1 1.00602 1.01214 0.02351 0.00612 0.01976 0.02428
10000 q2 -0.02098 -0.02141 0.02336 -0.00043 0.01878 0.02335

Conclusion

We live in an era where computing power allows us to run multiple simulation studies in a matter of minutes, something unthinkable just a couple of decades ago. This does not mean, however, that any estimation problem can be overcome with computational brute force. Careful design is necessary to make sure the data is generated from an adequate source; we have showed here one example of how one small decision can impact the final results in a relevant but hardly-noticeable manner.

A secondary but also very important conclusion we can draw here is the importance of incorporating numeric and visual inspection protocols into a scientist’s routine. The final tables in the two simulation setups presented similar values, and one would be excused for assuming neither setup contains bias just by looking at them. However, the figures show that one of those setups introduces an important—albeit small—and systematic bias into our estimates.