How (Non-)Optimal is the Lexicon?

Tiago Pimentel*, Irene Nikkarinen*, Kyle Mahowald, Ryan Cotterell, Damián Blasi

Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) · 2021

The mapping of messages to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent messages (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world's languages. In spite of their relevance, the relative importance of these multiple factors is not well understood. Using a code-theoretic model of the lexicon and a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology, graphotactics and Zipf's law of abbreviation can sufficiently account for most of the complexity of natural code lengths.

@inproceedings{pimentel-etal-2021-nonoptimal,
    author = {
        Tiago Pimentel and
        Irene Nikkarinen and
        Kyle Mahowald and
        Ryan Cotterell and
        Damián Blasi
    },
    booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
    title = {How (Non-)Optimal is the Lexicon?},
    year = {2021},
    doi = {10.18653/v1/2021.naacl-main.350},
    url = {https://aclanthology.org/2021.naacl-main.350/},
    pages = {4426--4438},
}