this_algorithm/docs/WORDLIST.adoc

// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"

:toc:
:nofooter:
:!webfonts:
:source-highlighter: rouge
:rouge-style: molokai
:sectlinks:

= Wordlist

== Description

The wordlist for xpin begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])

But this list is not sufficient.
It contains profane, negative, or words otherwise unfit for this algorithm.
Because the wordlist required for xpin is relatively small (8194), we can reduce this 53,000 word list substantially.

== Processing

Processing steps (source code is available in the `wordlist/` directory):

* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
+
[source]
----
WORD,FREQUENCY
THE,18399669358
OF,12042045526
BE,9032373066
AND,8588851162
----
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
+
[source]
----
WORD,LEMMATIZED_WORD,LEMMATIZER
ARE,BE,SPACY
ITS,IT,SPACY
NEED,NEE,SPACY
THOUGHT,THINK,SPACY
SOMETIMES,SOMETIME,SPACY
----
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
+
[source]
----
WORD1,WORD2
ADD,ADDS
ADS,ADDS
AFFECTED,EFFECT
AFFECT,EFFECT
AFFECTIONS,AFFECTION
----
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
Words can be excluded for any reason.
+
[source]
----
WORD
A
AARON
ABA
ABANDON
ABANDONING
----
* `04-deduplicated-words` - The final list of words and the associated numeric value
+
[source]
----
WORD,NUMBER
THE,1
OF,2
BEE,3
ARE,3
BE,3
----

== Usage

If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).

[source,sh,title='sh']
----
$ cd ./wordlist/

# Create the environment
$ python3 -m virtualenv -p "$(which python3)" venv
$ . venv/bin/activate

# Install the ~4GiB of data required
(venv) $ pip install -r requirements.txt

# Lemmatize all the words from 00-frequency-list.csv.gz
(venv) $ ./01-lemmatized-words.py
# Read your lemmatzied words
(venv) $ zcat 01-lemmatized-words.csv.gz | less

# Generate the final wordlist from all previous files
(venv) $ ./04-deduplicated-words.py
----

++++
<style>
#header, #content, #footnotes, #footer {
    max-width: unset !important;
}
.hll {
    background-color: #ff0;
}
</style>
++++
Add wordlist document 2023-03-02 00:58:31 -05:00			`// echo WORDLIST.adoc \| entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"`

			`:toc:`
			`:nofooter:`
			`:!webfonts:`
			`:source-highlighter: rouge`
			`:rouge-style: molokai`
			`:sectlinks:`

			`= Wordlist`

Update wordlist docs 2023-03-02 01:04:59 -05:00			`== Description`

Rename this_algorithm to xpin 2023-03-02 18:55:40 -05:00			`The wordlist for xpin begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])`
Add wordlist document 2023-03-02 00:58:31 -05:00
			`But this list is not sufficient.`
			`It contains profane, negative, or words otherwise unfit for this algorithm.`
Rename this_algorithm to xpin 2023-03-02 18:55:40 -05:00			`Because the wordlist required for xpin is relatively small (8194), we can reduce this 53,000 word list substantially.`
Add wordlist document 2023-03-02 00:58:31 -05:00
Update wordlist docs 2023-03-02 01:04:59 -05:00			`== Processing`

Add wordlist document 2023-03-02 00:58:31 -05:00			Processing steps (source code is available in the `wordlist/` directory):

			* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
			`+`
			`[source]`
			`----`
			`WORD,FREQUENCY`
			`THE,18399669358`
			`OF,12042045526`
			`BE,9032373066`
			`AND,8588851162`
			`----`
			* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
			`+`
			`[source]`
			`----`
			`WORD,LEMMATIZED_WORD,LEMMATIZER`
			`ARE,BE,SPACY`
			`ITS,IT,SPACY`
			`NEED,NEE,SPACY`
			`THOUGHT,THINK,SPACY`
			`SOMETIMES,SOMETIME,SPACY`
			`----`
			* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
			`+`
			`[source]`
			`----`
			`WORD1,WORD2`
			`ADD,ADDS`
			`ADS,ADDS`
			`AFFECTED,EFFECT`
			`AFFECT,EFFECT`
			`AFFECTIONS,AFFECTION`
			`----`
			* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
			`Words can be excluded for any reason.`
			`+`
			`[source]`
			`----`
			`WORD`
			`A`
			`AARON`
			`ABA`
			`ABANDON`
			`ABANDONING`
			`----`
			* `04-deduplicated-words` - The final list of words and the associated numeric value
			`+`
			`[source]`
			`----`
			`WORD,NUMBER`
			`THE,1`
			`OF,2`
			`BEE,3`
			`ARE,3`
			`BE,3`
			`----`
Update wordlist docs 2023-03-02 01:04:59 -05:00
			`== Usage`

			If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).

			`[source,sh,title='sh']`
			`----`
			`$ cd ./wordlist/`

			`# Create the environment`
			`$ python3 -m virtualenv -p "$(which python3)" venv`
			`$ . venv/bin/activate`

			`# Install the ~4GiB of data required`
			`(venv) $ pip install -r requirements.txt`

			`# Lemmatize all the words from 00-frequency-list.csv.gz`
			`(venv) $ ./01-lemmatized-words.py`
			`# Read your lemmatzied words`
			`(venv) $ zcat 01-lemmatized-words.csv.gz \| less`

			`# Generate the final wordlist from all previous files`
			`(venv) $ ./04-deduplicated-words.py`
			`----`
Update docs 2023-03-02 01:10:02 -05:00
			`++++`
			`<style>`
			`#header, #content, #footnotes, #footer {`
			`max-width: unset !important;`
			`}`
			`.hll {`
			`background-color: #ff0;`
			`}`
			`</style>`
			`++++`