Update docs

2023-02-16 20:51:16 -05:00 · 2023-02-16 20:51:16 -05:00 · e2778d281c
commit e2778d281c
parent 97f2cc5d05
3 changed files with 49 additions and 5 deletions
--- a/docs/DESIGN.adoc
+++ b/docs/DESIGN.adoc
@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does
 Considerations when designing a wordlist:
-. Word complexity
+. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
-. Plural vs singular
+. Plural and singular confusion (`take`/`takes`)
-. Homonyms
+. Homonyms (`him`/`hymn`)
-. Different language
+. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
-. Repetition
+. Bad words/negative words (`DEATH` should probably be excluded)
 . Different languages (if the algorithm could ever made non-English)
 this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
 NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
 ==== Wordlist
 this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
 Only the first few thousand words from `frequency-all.txt.gz` were used.
 ==== Lemmatization
 Lemmatization is the process of reducing words into a common form.
 This includes mapping singulars words to plurals.
 The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
 ==== Annotated Wordlist
 Complementary to automated lemmatization is manual lemmatization and word removal.
 See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
 For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
 Not pictured is the last column which allows custom mappings:
 image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
 ==== Wordlist generation
 The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
 The output is of the format:
 [source]
 ----
 word,number
 the,1
 of,2
 him,3
 hymn,3
 apple,4
 apples,4
 ----
 === Implementation [[implementation]]
--- a/docs/annotated_wordlist_example.png
+++ b/docs/annotated_wordlist_example.png
--- a/docs/annotated_words.ods
+++ b/docs/annotated_words.ods