this_algorithm/docs/WORDLIST.adoc

// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"

:toc:
:nofooter:
:!webfonts:
:source-highlighter: rouge
:rouge-style: molokai
:sectlinks:

= Wordlist

== Description

The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])

But this list is not sufficient.
It contains profane, negative, or words otherwise unfit for this algorithm.
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.

== Processing

Processing steps (source code is available in the `wordlist/` directory):

* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
+
[source]
----
WORD,FREQUENCY
THE,18399669358
OF,12042045526
BE,9032373066
AND,8588851162
----
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
+
[source]
----
WORD,LEMMATIZED_WORD,LEMMATIZER
ARE,BE,SPACY
ITS,IT,SPACY
NEED,NEE,SPACY
THOUGHT,THINK,SPACY
SOMETIMES,SOMETIME,SPACY
----
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
+
[source]
----
WORD1,WORD2
ADD,ADDS
ADS,ADDS
AFFECTED,EFFECT
AFFECT,EFFECT
AFFECTIONS,AFFECTION
----
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
Words can be excluded for any reason.
+
[source]
----
WORD
A
AARON
ABA
ABANDON
ABANDONING
----
* `04-deduplicated-words` - The final list of words and the associated numeric value
+
[source]
----
WORD,NUMBER
THE,1
OF,2
BEE,3
ARE,3
BE,3
----

== Usage

If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).

[source,sh,title='sh']
----
$ cd ./wordlist/

# Create the environment
$ python3 -m virtualenv -p "$(which python3)" venv
$ . venv/bin/activate

# Install the ~4GiB of data required
(venv) $ pip install -r requirements.txt

# Lemmatize all the words from 00-frequency-list.csv.gz
(venv) $ ./01-lemmatized-words.py
# Read your lemmatzied words
(venv) $ zcat 01-lemmatized-words.csv.gz | less

# Generate the final wordlist from all previous files
(venv) $ ./04-deduplicated-words.py
----
Add wordlist document 2023-03-02 00:58:31 -05:00			`// echo WORDLIST.adoc \| entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"`

			`:toc:`
			`:nofooter:`
			`:!webfonts:`
			`:source-highlighter: rouge`
			`:rouge-style: molokai`
			`:sectlinks:`

			`= Wordlist`

Update wordlist docs 2023-03-02 01:04:59 -05:00			`== Description`

Add wordlist document 2023-03-02 00:58:31 -05:00			`The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])`

			`But this list is not sufficient.`
			`It contains profane, negative, or words otherwise unfit for this algorithm.`
			`Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.`

Update wordlist docs 2023-03-02 01:04:59 -05:00			`== Processing`

Add wordlist document 2023-03-02 00:58:31 -05:00			Processing steps (source code is available in the `wordlist/` directory):

			* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
			`+`
			`[source]`
			`----`
			`WORD,FREQUENCY`
			`THE,18399669358`
			`OF,12042045526`
			`BE,9032373066`
			`AND,8588851162`
			`----`
			* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
			`+`
			`[source]`
			`----`
			`WORD,LEMMATIZED_WORD,LEMMATIZER`
			`ARE,BE,SPACY`
			`ITS,IT,SPACY`
			`NEED,NEE,SPACY`
			`THOUGHT,THINK,SPACY`
			`SOMETIMES,SOMETIME,SPACY`
			`----`
			* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
			`+`
			`[source]`
			`----`
			`WORD1,WORD2`
			`ADD,ADDS`
			`ADS,ADDS`
			`AFFECTED,EFFECT`
			`AFFECT,EFFECT`
			`AFFECTIONS,AFFECTION`
			`----`
			* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
			`Words can be excluded for any reason.`
			`+`
			`[source]`
			`----`
			`WORD`
			`A`
			`AARON`
			`ABA`
			`ABANDON`
			`ABANDONING`
			`----`
			* `04-deduplicated-words` - The final list of words and the associated numeric value
			`+`
			`[source]`
			`----`
			`WORD,NUMBER`
			`THE,1`
			`OF,2`
			`BEE,3`
			`ARE,3`
			`BE,3`
			`----`
Update wordlist docs 2023-03-02 01:04:59 -05:00
			`== Usage`

			If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).

			`[source,sh,title='sh']`
			`----`
			`$ cd ./wordlist/`

			`# Create the environment`
			`$ python3 -m virtualenv -p "$(which python3)" venv`
			`$ . venv/bin/activate`

			`# Install the ~4GiB of data required`
			`(venv) $ pip install -r requirements.txt`

			`# Lemmatize all the words from 00-frequency-list.csv.gz`
			`(venv) $ ./01-lemmatized-words.py`
			`# Read your lemmatzied words`
			`(venv) $ zcat 01-lemmatized-words.csv.gz \| less`

			`# Generate the final wordlist from all previous files`
			`(venv) $ ./04-deduplicated-words.py`
			`----`