Phonotactic Complexity and Its Trade-offs

Published in Transactions of the Association for Computational Linguistics (TACL), 2020

We present methods for calculating a measure of phonotactic complexity—bits per phoneme—that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of − 0.74 between bits per phoneme and the average length of words.

@article{pimentel-etal-2020-phonotactic,
    title = "Phonotactic Complexity and Its Trade-offs",
    author = "Pimentel, Tiago  and
      Roark, Brian  and
      Cotterell, Ryan",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "8",
    year = "2020",
    url = "https://www.aclweb.org/anthology/2020.tacl-1.1",
    doi = "10.1162/tacl_a_00296",
    pages = "1--18",
}