Update docs
This commit is contained in:
parent
97f2cc5d05
commit
e2778d281c
@ -502,11 +502,55 @@ Note that this is just what each component is responsible for encoding, but does
|
|||||||
|
|
||||||
Considerations when designing a wordlist:
|
Considerations when designing a wordlist:
|
||||||
|
|
||||||
. Word complexity
|
. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
|
||||||
. Plural vs singular
|
. Plural and singular confusion (`take`/`takes`)
|
||||||
. Homonyms
|
. Homonyms (`him`/`hymn`)
|
||||||
. Different language
|
. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
|
||||||
. Repetition
|
. Bad words/negative words (`DEATH` should probably be excluded)
|
||||||
|
. Different languages (if the algorithm could ever made non-English)
|
||||||
|
|
||||||
|
this_algorithm has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
|
||||||
|
|
||||||
|
NOTE: this_algorithm will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
|
||||||
|
|
||||||
|
==== Wordlist
|
||||||
|
|
||||||
|
this_algorithm used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
|
||||||
|
Only the first few thousand words from `frequency-all.txt.gz` were used.
|
||||||
|
|
||||||
|
==== Lemmatization
|
||||||
|
|
||||||
|
Lemmatization is the process of reducing words into a common form.
|
||||||
|
This includes mapping singulars words to plurals.
|
||||||
|
|
||||||
|
The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
|
||||||
|
|
||||||
|
==== Annotated Wordlist
|
||||||
|
|
||||||
|
Complementary to automated lemmatization is manual lemmatization and word removal.
|
||||||
|
See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
|
||||||
|
|
||||||
|
For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
|
||||||
|
Not pictured is the last column which allows custom mappings:
|
||||||
|
|
||||||
|
image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
|
||||||
|
|
||||||
|
==== Wordlist generation
|
||||||
|
|
||||||
|
The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
|
||||||
|
|
||||||
|
The output is of the format:
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
word,number
|
||||||
|
the,1
|
||||||
|
of,2
|
||||||
|
him,3
|
||||||
|
hymn,3
|
||||||
|
apple,4
|
||||||
|
apples,4
|
||||||
|
----
|
||||||
|
|
||||||
=== Implementation [[implementation]]
|
=== Implementation [[implementation]]
|
||||||
|
|
||||||
|
BIN
docs/annotated_wordlist_example.png
Normal file
BIN
docs/annotated_wordlist_example.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 21 KiB |
Binary file not shown.
Loading…
Reference in New Issue
Block a user