How (Non-)Optimal is the Lexicon?

Published in North American Chapter of the Association for Computational Linguistics, 2021

The mapping of messages to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent messages (Zipf’s law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world’s languages. In spite of their relevance, the relative importance of these multiple factors is not well understood. Using a code-theoretic model of the lexicon and a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon’s optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology, graphotactics and Zipf’s law of abbreviation can sufficiently account for most of the complexity of natural code lengths.

@inproceedings{pimentel-etal-2021-nonoptimal,
    title = "How (Non-)Optimal is the Lexicon?",
    author = "Pimentel, Tiago  and
      Nikkarinen, Irene  and
      Mahowald, Kyle and
      Cotterell, Ryan  and
      Blasi, Dami\'{a}n",
    booktitle = "Proceedings of the 2021 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2021",
    address = "Virtual",
    publisher = "Association for Computational Linguistics",
}