this_algorithm/docs/WORDLIST.adoc
2023-03-02 18:55:40 -05:00

114 lines
2.9 KiB
Plaintext

// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"
:toc:
:nofooter:
:!webfonts:
:source-highlighter: rouge
:rouge-style: molokai
:sectlinks:
= Wordlist
== Description
The wordlist for xpin begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
But this list is not sufficient.
It contains profane, negative, or words otherwise unfit for this algorithm.
Because the wordlist required for xpin is relatively small (8194), we can reduce this 53,000 word list substantially.
== Processing
Processing steps (source code is available in the `wordlist/` directory):
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
+
[source]
----
WORD,FREQUENCY
THE,18399669358
OF,12042045526
BE,9032373066
AND,8588851162
----
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
+
[source]
----
WORD,LEMMATIZED_WORD,LEMMATIZER
ARE,BE,SPACY
ITS,IT,SPACY
NEED,NEE,SPACY
THOUGHT,THINK,SPACY
SOMETIMES,SOMETIME,SPACY
----
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
+
[source]
----
WORD1,WORD2
ADD,ADDS
ADS,ADDS
AFFECTED,EFFECT
AFFECT,EFFECT
AFFECTIONS,AFFECTION
----
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
Words can be excluded for any reason.
+
[source]
----
WORD
A
AARON
ABA
ABANDON
ABANDONING
----
* `04-deduplicated-words` - The final list of words and the associated numeric value
+
[source]
----
WORD,NUMBER
THE,1
OF,2
BEE,3
ARE,3
BE,3
----
== Usage
If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).
[source,sh,title='sh']
----
$ cd ./wordlist/
# Create the environment
$ python3 -m virtualenv -p "$(which python3)" venv
$ . venv/bin/activate
# Install the ~4GiB of data required
(venv) $ pip install -r requirements.txt
# Lemmatize all the words from 00-frequency-list.csv.gz
(venv) $ ./01-lemmatized-words.py
# Read your lemmatzied words
(venv) $ zcat 01-lemmatized-words.csv.gz | less
# Generate the final wordlist from all previous files
(venv) $ ./04-deduplicated-words.py
----
++++
<style>
#header, #content, #footnotes, #footer {
max-width: unset !important;
}
.hll {
background-color: #ff0;
}
</style>
++++