Comment

Simulation bias: Why analyzing simulation trials is imperative

The use of computer simulations to investigate questions about cognition is nowadays a common practice. For about a year we have been working on a project where we use computer simulations based on the Rational Speech Act model (Frank & Goodman, 2012). As with any simulation study, we were taking extra care to not bias our simulation trials and adopted randomly generated trials. As it turns out, randomly generating trials may actually introduce bias! Do you want to learn from our mistakes? Continue reading then.

As it turns out, randomly generating trials may actually introduce bias!

The Rational Speech Act model characterizes pragmatic reasoning used by speakers and listeners when choosing or interpreting a referential signal. The input of the RSA model consists of a mapping M (also called language function L) that captures which signal s the speaker or listener beliefs can be used to refer to a particular referent r. For example, when a student in a large class is trying to refer to a classmate who’s name she doesn’t know, she may say ‘the gamer’ or ‘the teacher’s favourite’ or ‘the tall one’. Any of these signals may refer to some of the classmates. The example mapping may look like this:

 A toy example mapping for the Rational Speech Act model. Each white square in a row indicates that the corresponding signal can be used for the referent. E.g. ‘red hair’ can be used to refer to classmate 1 but also to classmate 2.

A toy example mapping for the Rational Speech Act model. Each white square in a row indicates that the corresponding signal can be used for the referent. E.g. ‘red hair’ can be used to refer to classmate 1 but also to classmate 2.

Using recursive pragmatic reasoning, the Rational Speech Act characterizes how speakers select the most probable signal for a referent by ‘reasoning from the listeners perspective’. Vice versa, RSA characterizes how listeners infer the most probable referent by ‘reasoning from the communicators perspective’. The exact operations of the model are not relevant for the points I want to make here. What is relevant is that more realistic mappings will be, of course, much much larger. In naturalistic contexts communicators may be able to refer to as many as hundreds of referents and they may have thousands of potentially relevant signals at their disposal. In our study we wanted to scale the simulations to mappings of more realistic size. This, however, meant that we were no longer able to hand-pick simulation trials as the number of possible mappings grows exponentially with its size. At the time I thought this was a good thing. Now we were forced to be unbiased!

This meant that we were no longer able to hand-pick simulation trials. At the time I thought this was a good thing. Now we were forced to be unbiased!

In our research, we are interested in understanding how the ambiguity of the mappings of two interlocutors and the asymmetry between the mappings influences their communicative success. A trial in our simulation consists of a pair of mappings (M1, M2) where both mappings were generated at random for a particular level of ambiguity. For example, in a mapping with ambiguity 1/3 x number of referents, each signal will refer to a third of the referents. Which referents each signal refers to was randomized. We could then compute the asymmetry between M1 and M2 as the normalized difference and use the values of the three parameters to group our trials.

 Interlocutors may have different mappings. For example, one may belief that classmate 2 is the teacher’s favourite, the other may not. At the same time, each mapping also have a particular level of ambiguity. The mapping on the left is more ambiguous (each signal refers to more referents) than the mapping on the right.  Designed by Freepik

Interlocutors may have different mappings. For example, one may belief that classmate 2 is the teacher’s favourite, the other may not. At the same time, each mapping also have a particular level of ambiguity. The mapping on the left is more ambiguous (each signal refers to more referents) than the mapping on the right.
Designed by Freepik

At first glance, this procedure may seem unbiased, it is randomized after all. However, while trying to understand our simulation results we noticed something odd. Our trials were not spread out across the entire domains of the parameters. And furthermore, there seemed to be structure to their distribution in parameter space! Take a look at the following graph which displays in the parameter space a bunch of randomly generated pairs of mappings with 6 referents and 10 signals. (Please bare with me, larger mappings follow the same patterns but it is impossible to depict them clearly.)

This graph displays the distribution of trials within the parameter space ambiguity(Speaker) x ambiguity(Listener) x asymmetry(Speaker, Listener). The ambiguity of the listener’s mapping is on the horizontal axis of each plot, and each of the six plots corresponds to the ambiguity of the speaker’s mapping. Within those conditions the violin plots display the distribution of randomly generated trials according to the asymmetry between the two mappings. Wider violin parts mean more trials around that level of asymmetry.

If we would not have investigated the distribution of our trials, we would have never been aware of this bias! We expected (and assumed) the random trials to be evenly distributed across the parameter space.

Clearly, there is some relationship between the three parameters and large parts of the parameter space are not even part of our simulation. If we would not have investigated the distribution of our trials, we would have never been aware of this bias! We expected (and assumed) the random trials to be evenly distributed across the parameter space. This plot encourages us to investigate why the trials were distributed like this. We were able to analytically determine the mean of the randomized distributions, and also the minimum and maximum of the parameter space. We can use these data to plot the actual shape of the parameter space.

The green bars display the minimum, maximum and mean (central black bars) asymmetry. The minimum and maximum are theoretical upper and lower limits, no trials can exist outside the green bars. The mean is based on the randomized generation procedure described above. The red violin plots are again the randomly generated trials.

Based on the mathematical analysis we now know the inherent relationship between speaker ambiguity, listener ambiguity and their asymmetry. We can also see that the randomized trial generation did capture the parameter relationship, but it did not generate all possible trials. Even worse, the random trial generation does not even generate trials that we believe are quite common in human-human communication, namely, those where both mappings have relatively low ambiguity but also relatively low asymmetry. The numbers are not in our favor if we want to stick with random trial generation. The total set of possible pairs of mappings is extremely large and hence the probability of randomly generating trials outside the mean is very low. For example, the probability of two non-ambiguous to be of maximum asymmetry (in the 6 by 10 example) is:

low-prob.png

There some important lessons to be learned here. First, randomly generating trials does not by itself ensure that the trials will be uniformly (unbiasedly) distributed over the parameter space. This means that it is very important to analyze trial distributions in any simulation study, before doing any other analysis. It is definitely something I will incorporate in my own research practice. Second, while the first point may be considered less relevant when modeling an empirical study where all trials are known, I would beg to differ. In those cases, it is still worthwhile to understand the theoretical limits of the parameter space. Especially since that space may be shaped by relationships between the parameters that are often unknown. These relationships are not just theoretical confabulations. If we believe the simulation model holds true, then those relationships will exist in reality too. Finally, given the first two points, it seems that the way forward is to incorporate a more informed trial generation procedure. Such a procedure may involve the following steps:

  1. Define the parameters.

  2. Analyze the parameter space, its theoretical upper and lower bounds and possible parameter dependencies.

  3. Provide argumentation for those parts of the parameter space you will study and for those parts you will not study. This argumentation can be based on empirical observations and/or on theoretical considerations.

  4. Develop a trial generation procedure that generates trials evenly across the parts of the parameter space specified in Step 3.

We ourselves are currently working on steps 3 and 4 after struggling through steps 1 and 2. Analyzing the parameter space is hard, but necessary for doing well-founded and interpretable computational simulation research. Personally, I was quite surprised to conclude that in order to uniformly generate simulation trials we may need to supplement ‘unbiased’ random generation with ‘biased’ targeted generation.

Personally, I was quite surprised to conclude that in order to uniformly generate simulation trials we may need to supplement ‘unbiased’ random generation with ‘biased’ targeted generation.

Acknowledgements

Thanks go to Max Hinne for invaluable brainstorms about this topic. This blog post is based on research done in collaboration with Iris van Rooij, Mark Dingemanse, George Kachergis and Ivan Toni. The opinions in this blog post (and possible consequences thereof) are the responsibility of this post’s author (Mark Blokpoel) alone.

References

Frank, M. C., & Goodman, N. D. (2012). Predicting pragmatic reasoning in language games. Science, 336(6084), 998. http://dx.doi.org/10.1126/science.1218633

Comment