An Information-theoretic Analysis of Regressions in Naturalistic Reading

Published in Cognition, 2024

Find paper here

Regressions, or backward saccades, are common during reading, accounting for between 5% and 20% of all saccades. And yet, relatively little is known about what causes them. We provide an information-theoretic operationalization for two previous qualitative hypotheses about regressions, which we dub reactivation and reanalysis. We argue that these hypotheses make different predictions about the pointwise mutual information or PMI between a regression’s source and target. Intuitively, the PMI between two words measures how much more (or less) likely one word is to be present given the other. On one hand, the reactivation hypothesis predicts that regressions occur between words that are associated, implying high positive values of PMI. On the other hand, the reanalysis hypothesis predicts that regressions should occur between words that are disassociated with each other, implying negative, low values of PMI. As a second theoretical contribution, we expand on previous theories by considering not only PMI but also expected values of PMI, E[PMI], where the expectation is taken over all possible realizations of the regression's target. The rationale for this is that language processing involves making inferences under uncertainty, and readers may be uncertain about what they have read, especially if a previous word was skipped. To test both theories, we use contemporary language models to estimate PMI-based statistics over word pairs in three corpora of eye tracking data in English, as well as in six languages across three language families (Indo-European, Uralic, and Turkic). Our results are consistent across languages and models tested: Positive values of PMI and E[PMI] consistently help to predict the patterns of regressions during reading, whereas negative values of PMI and E[PMI] do not. Our information-theoretic interpretation increases the predictive scope of both theories and our studies present the first systematic crosslinguistic analysis of regressions in the literature. Our results support the reactivation hypothesis and, more broadly, they expand the number of language processing behaviors that can be linked to information-theoretic principles.

@article{wilcox2024informationtheoretic,
    author = {
        Ethan G. Wilcox and
        Tiago Pimentel and
        Clara Meister and
        Ryan Cotterell
    },
    article = {Cognition},
    title = {An Information-theoretic Analysis of Regressions in Naturalistic Reading},
    year = {2024},
    url = {https://doi.org/10.31234/osf.io/3qf9a},
    pages = {},
}