From d3fdd17d3cede97faaf12e05e497de4c24b49ea0 Mon Sep 17 00:00:00 2001 From: Austen Adler Date: Thu, 2 Mar 2023 01:04:59 -0500 Subject: [PATCH] Update wordlist docs --- docs/WORDLIST.adoc | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/WORDLIST.adoc b/docs/WORDLIST.adoc index 7d51c0f..fef0a2b 100644 --- a/docs/WORDLIST.adoc +++ b/docs/WORDLIST.adoc @@ -9,12 +9,16 @@ = Wordlist +== Description + The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^]) But this list is not sufficient. It contains profane, negative, or words otherwise unfit for this algorithm. Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially. +== Processing + Processing steps (source code is available in the `wordlist/` directory): * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case @@ -72,3 +76,27 @@ BEE,3 ARE,3 BE,3 ---- + +== Usage + +If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module). + +[source,sh,title='sh'] +---- +$ cd ./wordlist/ + +# Create the environment +$ python3 -m virtualenv -p "$(which python3)" venv +$ . venv/bin/activate + +# Install the ~4GiB of data required +(venv) $ pip install -r requirements.txt + +# Lemmatize all the words from 00-frequency-list.csv.gz +(venv) $ ./01-lemmatized-words.py +# Read your lemmatzied words +(venv) $ zcat 01-lemmatized-words.csv.gz | less + +# Generate the final wordlist from all previous files +(venv) $ ./04-deduplicated-words.py +----