Tokenisation is NP-Complete
Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
In this work, we prove the NP-completeness of two variants of tokenisation, defined here as the problem of compressing a dataset to at most 𝛿 symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
@inproceedings{whittington-etal-2025-tokenisation,
author = {
Philip Whittington and
Gregor Bachmann and
Tiago Pimentel
},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
title = {Tokenisation is NP-Complete},
year = {2025},
url = {https://aclanthology.org/2025.acl-long.1365/},
pages = {},
}
