Tokenisation is NP-Complete
Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
In this work, we prove the NP-completeness of two variants of tokenisation, defined here as the problem of compressing a dataset to at most 𝛿 symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
@inproceedings{whittington-etal-2025-tokenisation,
    author = {
        Philip Whittington and
        Gregor Bachmann and
        Tiago Pimentel 
    },
    booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    title = {Tokenisation is NP-Complete},
    year = {2025},
    url = {https://aclanthology.org/2025.acl-long.1365/},
    pages = {},
}
