From e9fa9be7a249b35dec2ce4e65304d1cba3265ae2 Mon Sep 17 00:00:00 2001 From: Austen Adler Date: Thu, 2 Mar 2023 00:58:31 -0500 Subject: [PATCH] Add wordlist document --- docs/DESIGN.adoc | 17 ++++++----- docs/WORDLIST.adoc | 74 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+), 8 deletions(-) create mode 100644 docs/WORDLIST.adoc diff --git a/docs/DESIGN.adoc b/docs/DESIGN.adoc index 0ef2861..8efc15b 100644 --- a/docs/DESIGN.adoc +++ b/docs/DESIGN.adoc @@ -539,19 +539,20 @@ image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods a ==== Wordlist generation -The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list. +The final wordlist can be generated by running the scripts in `./wordlist/`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list. +See link:WORDLIST.html[WORDLIST] for more information The output is of the format: [source] ---- -word,number -the,1 -of,2 -him,3 -hymn,3 -apple,4 -apples,4 +WORD,NUMBER +THE,1 +OF,2 +HIM,3 +HYMN,3 +APPLE,4 +APPLES,4 ---- === Implementation [[implementation]] diff --git a/docs/WORDLIST.adoc b/docs/WORDLIST.adoc new file mode 100644 index 0000000..7d51c0f --- /dev/null +++ b/docs/WORDLIST.adoc @@ -0,0 +1,74 @@ +// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'" + +:toc: +:nofooter: +:!webfonts: +:source-highlighter: rouge +:rouge-style: molokai +:sectlinks: + += Wordlist + +The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^]) + +But this list is not sufficient. +It contains profane, negative, or words otherwise unfit for this algorithm. +Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially. + +Processing steps (source code is available in the `wordlist/` directory): + +* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case ++ +[source] +---- +WORD,FREQUENCY +THE,18399669358 +OF,12042045526 +BE,9032373066 +AND,8588851162 +---- +* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order ++ +[source] +---- +WORD,LEMMATIZED_WORD,LEMMATIZER +ARE,BE,SPACY +ITS,IT,SPACY +NEED,NEE,SPACY +THOUGHT,THINK,SPACY +SOMETIMES,SOMETIME,SPACY +---- +* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture ++ +[source] +---- +WORD1,WORD2 +ADD,ADDS +ADS,ADDS +AFFECTED,EFFECT +AFFECT,EFFECT +AFFECTIONS,AFFECTION +---- +* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result. +Words can be excluded for any reason. ++ +[source] +---- +WORD +A +AARON +ABA +ABANDON +ABANDONING +---- +* `04-deduplicated-words` - The final list of words and the associated numeric value ++ +[source] +---- +WORD,NUMBER +THE,1 +OF,2 +BEE,3 +ARE,3 +BE,3 +----