Update docs

2023-02-16 20:51:16 -05:00 · 2023-02-16 20:51:16 -05:00 · e2778d281c
commit e2778d281c
parent 97f2cc5d05
3 changed files with 49 additions and 5 deletions
--- a/docs/DESIGN.adoc
+++ b/docs/DESIGN.adoc
@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does

 Considerations when designing a wordlist:

-. Word complexity
-. Plural vs singular
-. Homonyms
-. Different language
-. Repetition
+. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
+. Plural and singular confusion (`take`/`takes`)
+. Homonyms (`him`/`hymn`)
+. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
+. Bad words/negative words (`DEATH` should probably be excluded)
+. Different languages (if the algorithm could ever made non-English)
+
+this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
+
+NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
+
+==== Wordlist
+
+this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
+Only the first few thousand words from `frequency-all.txt.gz` were used.
+
+==== Lemmatization
+
+Lemmatization is the process of reducing words into a common form.
+This includes mapping singulars words to plurals.
+
+The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
+
+==== Annotated Wordlist
+
+Complementary to automated lemmatization is manual lemmatization and word removal.
+See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
+
+For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
+Not pictured is the last column which allows custom mappings:
+
+image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
+
+==== Wordlist generation
+
+The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
+
+The output is of the format:
+
+[source]
+----
+word,number
+the,1
+of,2
+him,3
+hymn,3
+apple,4
+apples,4
+----

 === Implementation [[implementation]]

--- a/docs/annotated_wordlist_example.png
+++ b/docs/annotated_wordlist_example.png
--- a/docs/annotated_words.ods
+++ b/docs/annotated_words.ods