Add wordlist document

This commit is contained in:
Austen Adler 2023-03-02 00:58:31 -05:00
parent d8e39fa626
commit e9fa9be7a2
2 changed files with 83 additions and 8 deletions

View File

@ -539,19 +539,20 @@ image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods a
==== Wordlist generation ==== Wordlist generation
The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list. The final wordlist can be generated by running the scripts in `./wordlist/`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
See link:WORDLIST.html[WORDLIST] for more information
The output is of the format: The output is of the format:
[source] [source]
---- ----
word,number WORD,NUMBER
the,1 THE,1
of,2 OF,2
him,3 HIM,3
hymn,3 HYMN,3
apple,4 APPLE,4
apples,4 APPLES,4
---- ----
=== Implementation [[implementation]] === Implementation [[implementation]]

74
docs/WORDLIST.adoc Normal file
View File

@ -0,0 +1,74 @@
// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"
:toc:
:nofooter:
:!webfonts:
:source-highlighter: rouge
:rouge-style: molokai
:sectlinks:
= Wordlist
The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
But this list is not sufficient.
It contains profane, negative, or words otherwise unfit for this algorithm.
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
Processing steps (source code is available in the `wordlist/` directory):
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
+
[source]
----
WORD,FREQUENCY
THE,18399669358
OF,12042045526
BE,9032373066
AND,8588851162
----
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
+
[source]
----
WORD,LEMMATIZED_WORD,LEMMATIZER
ARE,BE,SPACY
ITS,IT,SPACY
NEED,NEE,SPACY
THOUGHT,THINK,SPACY
SOMETIMES,SOMETIME,SPACY
----
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
+
[source]
----
WORD1,WORD2
ADD,ADDS
ADS,ADDS
AFFECTED,EFFECT
AFFECT,EFFECT
AFFECTIONS,AFFECTION
----
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
Words can be excluded for any reason.
+
[source]
----
WORD
A
AARON
ABA
ABANDON
ABANDONING
----
* `04-deduplicated-words` - The final list of words and the associated numeric value
+
[source]
----
WORD,NUMBER
THE,1
OF,2
BEE,3
ARE,3
BE,3
----