Add wordlist document

2023-03-02 00:58:31 -05:00 · 2023-03-02 00:58:31 -05:00 · e9fa9be7a2
commit e9fa9be7a2
parent d8e39fa626
2 changed files with 83 additions and 8 deletions
--- a/docs/DESIGN.adoc
+++ b/docs/DESIGN.adoc
@ -539,19 +539,20 @@ image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods a
 ==== Wordlist generation
-The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
+The final wordlist can be generated by running the scripts in `./wordlist/`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
 See link:WORDLIST.html[WORDLIST] for more information
 The output is of the format:
 [source]
 ----
-word,number
+WORD,NUMBER
-the,1
+THE,1
-of,2
+OF,2
-him,3
+HIM,3
-hymn,3
+HYMN,3
-apple,4
+APPLE,4
-apples,4
+APPLES,4
 ----
 === Implementation [[implementation]]
--- a/docs/WORDLIST.adoc
+++ b/docs/WORDLIST.adoc
@ -0,0 +1,74 @@
 // echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"
 :toc:
 :nofooter:
 :!webfonts:
 :source-highlighter: rouge
 :rouge-style: molokai
 :sectlinks:
 = Wordlist
 The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
 But this list is not sufficient.
 It contains profane, negative, or words otherwise unfit for this algorithm.
 Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
 Processing steps (source code is available in the `wordlist/` directory):
 * `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
 +
 [source]
 ----
 WORD,FREQUENCY
 THE,18399669358
 OF,12042045526
 BE,9032373066
 AND,8588851162
 ----
 * `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
 +
 [source]
 ----
 WORD,LEMMATIZED_WORD,LEMMATIZER
 ARE,BE,SPACY
 ITS,IT,SPACY
 NEED,NEE,SPACY
 THOUGHT,THINK,SPACY
 SOMETIMES,SOMETIME,SPACY
 ----
 * `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
 +
 [source]
 ----
 WORD1,WORD2
 ADD,ADDS
 ADS,ADDS
 AFFECTED,EFFECT
 AFFECT,EFFECT
 AFFECTIONS,AFFECTION
 ----
 * `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
 Words can be excluded for any reason.
 +
 [source]
 ----
 WORD
 A
 AARON
 ABA
 ABANDON
 ABANDONING
 ----
 * `04-deduplicated-words` - The final list of words and the associated numeric value
 +
 [source]
 ----
 WORD,NUMBER
 THE,1
 OF,2
 BEE,3
 ARE,3
 BE,3
 ----