Update wordlist docs
This commit is contained in:
parent
e9fa9be7a2
commit
d3fdd17d3c
@ -9,12 +9,16 @@
|
|||||||
|
|
||||||
= Wordlist
|
= Wordlist
|
||||||
|
|
||||||
|
== Description
|
||||||
|
|
||||||
The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
|
The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
|
||||||
|
|
||||||
But this list is not sufficient.
|
But this list is not sufficient.
|
||||||
It contains profane, negative, or words otherwise unfit for this algorithm.
|
It contains profane, negative, or words otherwise unfit for this algorithm.
|
||||||
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
|
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
|
||||||
|
|
||||||
|
== Processing
|
||||||
|
|
||||||
Processing steps (source code is available in the `wordlist/` directory):
|
Processing steps (source code is available in the `wordlist/` directory):
|
||||||
|
|
||||||
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
|
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
|
||||||
@ -72,3 +76,27 @@ BEE,3
|
|||||||
ARE,3
|
ARE,3
|
||||||
BE,3
|
BE,3
|
||||||
----
|
----
|
||||||
|
|
||||||
|
== Usage
|
||||||
|
|
||||||
|
If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).
|
||||||
|
|
||||||
|
[source,sh,title='sh']
|
||||||
|
----
|
||||||
|
$ cd ./wordlist/
|
||||||
|
|
||||||
|
# Create the environment
|
||||||
|
$ python3 -m virtualenv -p "$(which python3)" venv
|
||||||
|
$ . venv/bin/activate
|
||||||
|
|
||||||
|
# Install the ~4GiB of data required
|
||||||
|
(venv) $ pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Lemmatize all the words from 00-frequency-list.csv.gz
|
||||||
|
(venv) $ ./01-lemmatized-words.py
|
||||||
|
# Read your lemmatzied words
|
||||||
|
(venv) $ zcat 01-lemmatized-words.csv.gz | less
|
||||||
|
|
||||||
|
# Generate the final wordlist from all previous files
|
||||||
|
(venv) $ ./04-deduplicated-words.py
|
||||||
|
----
|
||||||
|
Loading…
Reference in New Issue
Block a user