diff --git a/docs/DESIGN.adoc b/docs/DESIGN.adoc index 6afbbc5..740979d 100644 --- a/docs/DESIGN.adoc +++ b/docs/DESIGN.adoc @@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does Considerations when designing a wordlist: -. Word complexity -. Plural vs singular -. Homonyms -. Different language -. Repetition +. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include) +. Plural and singular confusion (`take`/`takes`) +. Homonyms (`him`/`hymn`) +. Repetition (`1823 APPLE APPLE BLUE` might be confusing) +. Bad words/negative words (`DEATH` should probably be excluded) +. Different languages (if the algorithm could ever made non-English) + +this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion. + +NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition. + +==== Wordlist + +this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies. +Only the first few thousand words from `frequency-all.txt.gz` were used. + +==== Lemmatization + +Lemmatization is the process of reducing words into a common form. +This includes mapping singulars words to plurals. + +The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization. + +==== Annotated Wordlist + +Complementary to automated lemmatization is manual lemmatization and word removal. +See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map. + +For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative. +Not pictured is the last column which allows custom mappings: + +image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example] + +==== Wordlist generation + +The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list. + +The output is of the format: + +[source] +---- +word,number +the,1 +of,2 +him,3 +hymn,3 +apple,4 +apples,4 +---- === Implementation [[implementation]] diff --git a/docs/annotated_wordlist_example.png b/docs/annotated_wordlist_example.png new file mode 100644 index 0000000..38c19c1 Binary files /dev/null and b/docs/annotated_wordlist_example.png differ diff --git a/docs/annotated_words.ods b/docs/annotated_words.ods index 24a54e0..667868e 100644 Binary files a/docs/annotated_words.ods and b/docs/annotated_words.ods differ