Update wordlist docs

This commit is contained in:
Austen Adler 2023-03-02 01:04:59 -05:00
parent e9fa9be7a2
commit d3fdd17d3c

View File

@ -9,12 +9,16 @@
= Wordlist = Wordlist
== Description
The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^]) The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
But this list is not sufficient. But this list is not sufficient.
It contains profane, negative, or words otherwise unfit for this algorithm. It contains profane, negative, or words otherwise unfit for this algorithm.
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially. Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
== Processing
Processing steps (source code is available in the `wordlist/` directory): Processing steps (source code is available in the `wordlist/` directory):
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
@ -72,3 +76,27 @@ BEE,3
ARE,3 ARE,3
BE,3 BE,3
---- ----
== Usage
If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).
[source,sh,title='sh']
----
$ cd ./wordlist/
# Create the environment
$ python3 -m virtualenv -p "$(which python3)" venv
$ . venv/bin/activate
# Install the ~4GiB of data required
(venv) $ pip install -r requirements.txt
# Lemmatize all the words from 00-frequency-list.csv.gz
(venv) $ ./01-lemmatized-words.py
# Read your lemmatzied words
(venv) $ zcat 01-lemmatized-words.csv.gz | less
# Generate the final wordlist from all previous files
(venv) $ ./04-deduplicated-words.py
----