Update docs

This commit is contained in:
Austen Adler 2023-02-16 20:51:16 -05:00
parent 97f2cc5d05
commit e2778d281c
3 changed files with 49 additions and 5 deletions

View File

@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does
Considerations when designing a wordlist:
. Word complexity
. Plural vs singular
. Homonyms
. Different language
. Repetition
. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
. Plural and singular confusion (`take`/`takes`)
. Homonyms (`him`/`hymn`)
. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
. Bad words/negative words (`DEATH` should probably be excluded)
. Different languages (if the algorithm could ever made non-English)
this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
==== Wordlist
this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
Only the first few thousand words from `frequency-all.txt.gz` were used.
==== Lemmatization
Lemmatization is the process of reducing words into a common form.
This includes mapping singulars words to plurals.
The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
==== Annotated Wordlist
Complementary to automated lemmatization is manual lemmatization and word removal.
See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
Not pictured is the last column which allows custom mappings:
image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
==== Wordlist generation
The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
The output is of the format:
[source]
----
word,number
the,1
of,2
him,3
hymn,3
apple,4
apples,4
----
=== Implementation [[implementation]]

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.