Update docs
This commit is contained in:
parent
97f2cc5d05
commit
e2778d281c
@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does
|
||||
|
||||
Considerations when designing a wordlist:
|
||||
|
||||
. Word complexity
|
||||
. Plural vs singular
|
||||
. Homonyms
|
||||
. Different language
|
||||
. Repetition
|
||||
. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
|
||||
. Plural and singular confusion (`take`/`takes`)
|
||||
. Homonyms (`him`/`hymn`)
|
||||
. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
|
||||
. Bad words/negative words (`DEATH` should probably be excluded)
|
||||
. Different languages (if the algorithm could ever made non-English)
|
||||
|
||||
this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
|
||||
|
||||
NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
|
||||
|
||||
==== Wordlist
|
||||
|
||||
this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
|
||||
Only the first few thousand words from `frequency-all.txt.gz` were used.
|
||||
|
||||
==== Lemmatization
|
||||
|
||||
Lemmatization is the process of reducing words into a common form.
|
||||
This includes mapping singulars words to plurals.
|
||||
|
||||
The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
|
||||
|
||||
==== Annotated Wordlist
|
||||
|
||||
Complementary to automated lemmatization is manual lemmatization and word removal.
|
||||
See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
|
||||
|
||||
For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
|
||||
Not pictured is the last column which allows custom mappings:
|
||||
|
||||
image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
|
||||
|
||||
==== Wordlist generation
|
||||
|
||||
The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
|
||||
|
||||
The output is of the format:
|
||||
|
||||
[source]
|
||||
----
|
||||
word,number
|
||||
the,1
|
||||
of,2
|
||||
him,3
|
||||
hymn,3
|
||||
apple,4
|
||||
apples,4
|
||||
----
|
||||
|
||||
=== Implementation [[implementation]]
|
||||
|
||||
|
BIN
docs/annotated_wordlist_example.png
Normal file
BIN
docs/annotated_wordlist_example.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 21 KiB |
Binary file not shown.
Loading…
Reference in New Issue
Block a user