111 lines
2.9 KiB
Plaintext
111 lines
2.9 KiB
Plaintext
// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"
|
|
|
|
include::common.adoc.template[]
|
|
:toc:
|
|
:stem:
|
|
|
|
= Wordlist
|
|
|
|
== Description
|
|
|
|
The wordlist for xpin begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
|
|
|
|
But this list is not sufficient.
|
|
It contains profane, negative, or words otherwise unfit for this algorithm.
|
|
Because the wordlist required for xpin is relatively small (8194), we can reduce this 53,000 word list substantially.
|
|
|
|
== Processing
|
|
|
|
Processing steps (source code is available in the `wordlist/` directory):
|
|
|
|
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
|
|
+
|
|
[source]
|
|
----
|
|
WORD,FREQUENCY
|
|
THE,18399669358
|
|
OF,12042045526
|
|
BE,9032373066
|
|
AND,8588851162
|
|
----
|
|
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
|
|
+
|
|
[source]
|
|
----
|
|
WORD,LEMMATIZED_WORD,LEMMATIZER
|
|
ARE,BE,SPACY
|
|
ITS,IT,SPACY
|
|
NEED,NEE,SPACY
|
|
THOUGHT,THINK,SPACY
|
|
SOMETIMES,SOMETIME,SPACY
|
|
----
|
|
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
|
|
+
|
|
[source]
|
|
----
|
|
WORD1,WORD2
|
|
ADD,ADDS
|
|
ADS,ADDS
|
|
AFFECTED,EFFECT
|
|
AFFECT,EFFECT
|
|
AFFECTIONS,AFFECTION
|
|
----
|
|
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
|
|
Words can be excluded for any reason.
|
|
+
|
|
[source]
|
|
----
|
|
WORD
|
|
A
|
|
AARON
|
|
ABA
|
|
ABANDON
|
|
ABANDONING
|
|
----
|
|
* `04-deduplicated-words` - The final list of words and the associated numeric value
|
|
+
|
|
[source]
|
|
----
|
|
WORD,NUMBER
|
|
THE,1
|
|
OF,2
|
|
BEE,3
|
|
ARE,3
|
|
BE,3
|
|
----
|
|
|
|
== Usage
|
|
|
|
If you really do want to tinker with the wordlist, you just need python and 4GiB of storage, almost exclusively due to the link:https://spacy.io/[spacy^] dependency (more specifically, the link:https://spacy.io/models[`en_core_web_trf`^] accurate `en` module).
|
|
|
|
[source,sh,title='sh']
|
|
----
|
|
$ cd ./wordlist/
|
|
|
|
# Create the environment
|
|
$ python3 -m virtualenv -p "$(which python3)" venv
|
|
$ . venv/bin/activate
|
|
|
|
# Install the ~4GiB of data required
|
|
(venv) $ pip install -r requirements.txt
|
|
|
|
# Lemmatize all the words from 00-frequency-list.csv.gz
|
|
(venv) $ ./01-lemmatized-words.py
|
|
# Read your lemmatzied words
|
|
(venv) $ zcat 01-lemmatized-words.csv.gz | less
|
|
|
|
# Generate the final wordlist from all previous files
|
|
(venv) $ ./04-deduplicated-words.py
|
|
----
|
|
|
|
++++
|
|
<style>
|
|
#header, #content, #footnotes, #footer {
|
|
max-width: unset !important;
|
|
}
|
|
.hll {
|
|
background-color: #ff0;
|
|
}
|
|
</style>
|
|
++++
|