// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'" :toc: :nofooter: :!webfonts: :source-highlighter: rouge :rouge-style: molokai :sectlinks: = Wordlist == Description The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^]) But this list is not sufficient. It contains profane, negative, or words otherwise unfit for this algorithm. Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially. == Processing Processing steps (source code is available in the `wordlist/` directory): * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case + [source] ---- WORD,FREQUENCY THE,18399669358 OF,12042045526 BE,9032373066 AND,8588851162 ---- * `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order + [source] ---- WORD,LEMMATIZED_WORD,LEMMATIZER ARE,BE,SPACY ITS,IT,SPACY NEED,NEE,SPACY THOUGHT,THINK,SPACY SOMETIMES,SOMETIME,SPACY ---- * `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture + [source] ---- WORD1,WORD2 ADD,ADDS ADS,ADDS AFFECTED,EFFECT AFFECT,EFFECT AFFECTIONS,AFFECTION ---- * `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result. Words can be excluded for any reason. + [source] ---- WORD A AARON ABA ABANDON ABANDONING ---- * `04-deduplicated-words` - The final list of words and the associated numeric value + [source] ---- WORD,NUMBER THE,1 OF,2 BEE,3 ARE,3 BE,3 ---- == Usage If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module). [source,sh,title='sh'] ---- $ cd ./wordlist/ # Create the environment $ python3 -m virtualenv -p "$(which python3)" venv $ . venv/bin/activate # Install the ~4GiB of data required (venv) $ pip install -r requirements.txt # Lemmatize all the words from 00-frequency-list.csv.gz (venv) $ ./01-lemmatized-words.py # Read your lemmatzied words (venv) $ zcat 01-lemmatized-words.csv.gz | less # Generate the final wordlist from all previous files (venv) $ ./04-deduplicated-words.py ---- ++++ ++++