Update wordlist docs

2023-03-02 01:04:59 -05:00 · 2023-03-02 01:04:59 -05:00 · d3fdd17d3c
commit d3fdd17d3c
parent e9fa9be7a2
1 changed files with 28 additions and 0 deletions
--- a/docs/WORDLIST.adoc
+++ b/docs/WORDLIST.adoc
@ -9,12 +9,16 @@

 = Wordlist

+== Description
+
 The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])

 But this list is not sufficient.
 It contains profane, negative, or words otherwise unfit for this algorithm.
 Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.

+== Processing
+
 Processing steps (source code is available in the `wordlist/` directory):

 * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
@ -72,3 +76,27 @@ BEE,3
 ARE,3
 BE,3
 ----
+
+== Usage
+
+If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).
+
+[source,sh,title='sh']
+----
+$ cd ./wordlist/
+
+# Create the environment
+$ python3 -m virtualenv -p "$(which python3)" venv
+$ . venv/bin/activate
+
+# Install the ~4GiB of data required
+(venv) $ pip install -r requirements.txt
+
+# Lemmatize all the words from 00-frequency-list.csv.gz
+(venv) $ ./01-lemmatized-words.py
+# Read your lemmatzied words
+(venv) $ zcat 01-lemmatized-words.csv.gz | less
+
+# Generate the final wordlist from all previous files
+(venv) $ ./04-deduplicated-words.py
+----