// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'" :toc: :nofooter: :!webfonts: :source-highlighter: rouge :rouge-style: molokai :sectlinks: = Wordlist == Description The wordlist for xpin begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^]) But this list is not sufficient. It contains profane, negative, or words otherwise unfit for this algorithm. Because the wordlist required for xpin is relatively small (8194), we can reduce this 53,000 word list substantially. == Processing Processing steps (source code is available in the `wordlist/` directory): * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case + [source] ---- WORD,FREQUENCY THE,18399669358 OF,12042045526 BE,9032373066 AND,8588851162 ---- * `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order + [source] ---- WORD,LEMMATIZED_WORD,LEMMATIZER ARE,BE,SPACY ITS,IT,SPACY NEED,NEE,SPACY THOUGHT,THINK,SPACY SOMETIMES,SOMETIME,SPACY ---- * `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture + [source] ---- WORD1,WORD2 ADD,ADDS ADS,ADDS AFFECTED,EFFECT AFFECT,EFFECT AFFECTIONS,AFFECTION ---- * `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result. Words can be excluded for any reason. + [source] ---- WORD A AARON ABA ABANDON ABANDONING ---- * `04-deduplicated-words` - The final list of words and the associated numeric value + [source] ---- WORD,NUMBER THE,1 OF,2 BEE,3 ARE,3 BE,3 ---- == Usage If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module). [source,sh,title='sh'] ---- $ cd ./wordlist/ # Create the environment $ python3 -m virtualenv -p "$(which python3)" venv $ . venv/bin/activate # Install the ~4GiB of data required (venv) $ pip install -r requirements.txt # Lemmatize all the words from 00-frequency-list.csv.gz (venv) $ ./01-lemmatized-words.py # Read your lemmatzied words (venv) $ zcat 01-lemmatized-words.csv.gz | less # Generate the final wordlist from all previous files (venv) $ ./04-deduplicated-words.py ---- ++++ ++++