Add wordlist document
This commit is contained in:
parent
d8e39fa626
commit
e9fa9be7a2
@ -539,19 +539,20 @@ image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods a
|
|||||||
|
|
||||||
==== Wordlist generation
|
==== Wordlist generation
|
||||||
|
|
||||||
The final wordlist can be generated by using `./docs/wordlist-new.ipynb`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
|
The final wordlist can be generated by running the scripts in `./wordlist/`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
|
||||||
|
See link:WORDLIST.html[WORDLIST] for more information
|
||||||
|
|
||||||
The output is of the format:
|
The output is of the format:
|
||||||
|
|
||||||
[source]
|
[source]
|
||||||
----
|
----
|
||||||
word,number
|
WORD,NUMBER
|
||||||
the,1
|
THE,1
|
||||||
of,2
|
OF,2
|
||||||
him,3
|
HIM,3
|
||||||
hymn,3
|
HYMN,3
|
||||||
apple,4
|
APPLE,4
|
||||||
apples,4
|
APPLES,4
|
||||||
----
|
----
|
||||||
|
|
||||||
=== Implementation [[implementation]]
|
=== Implementation [[implementation]]
|
||||||
|
74
docs/WORDLIST.adoc
Normal file
74
docs/WORDLIST.adoc
Normal file
@ -0,0 +1,74 @@
|
|||||||
|
// echo WORDLIST.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg WORDLIST.adoc; printf 'Done ($(date -Isecond))\n'"
|
||||||
|
|
||||||
|
:toc:
|
||||||
|
:nofooter:
|
||||||
|
:!webfonts:
|
||||||
|
:source-highlighter: rouge
|
||||||
|
:rouge-style: molokai
|
||||||
|
:sectlinks:
|
||||||
|
|
||||||
|
= Wordlist
|
||||||
|
|
||||||
|
The wordlist for this_algorithm begins with the wordlist from link:https://github.com/ps-kostikov/english-word-frequency/[ps-kostikov/english-word-frequency^] (link:https://github.com/ps-kostikov/english-word-frequency/blob/master/data/frequency_list.txt[data/frequency_list.txt^])
|
||||||
|
|
||||||
|
But this list is not sufficient.
|
||||||
|
It contains profane, negative, or words otherwise unfit for this algorithm.
|
||||||
|
Because the wordlist required for this_algorithm is relatively small (8194), we can reduce this 53,000 word list substantially.
|
||||||
|
|
||||||
|
Processing steps (source code is available in the `wordlist/` directory):
|
||||||
|
|
||||||
|
* `00-frequency-list` - A base list of most possible words (not necessarily including words from step 02), sorted by desire to include, which is frequency in this case
|
||||||
|
+
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
WORD,FREQUENCY
|
||||||
|
THE,18399669358
|
||||||
|
OF,12042045526
|
||||||
|
BE,9032373066
|
||||||
|
AND,8588851162
|
||||||
|
----
|
||||||
|
* `01-lemmatized-words` - List of words that should be lemmatized and represent the same underlying value, in any order
|
||||||
|
+
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
WORD,LEMMATIZED_WORD,LEMMATIZER
|
||||||
|
ARE,BE,SPACY
|
||||||
|
ITS,IT,SPACY
|
||||||
|
NEED,NEE,SPACY
|
||||||
|
THOUGHT,THINK,SPACY
|
||||||
|
SOMETIMES,SOMETIME,SPACY
|
||||||
|
----
|
||||||
|
* `02-custom-lemmatizations` - List of custom lemmatizations, used for any words of homonyms that the automatic lemmatization failed to capture
|
||||||
|
+
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
WORD1,WORD2
|
||||||
|
ADD,ADDS
|
||||||
|
ADS,ADDS
|
||||||
|
AFFECTED,EFFECT
|
||||||
|
AFFECT,EFFECT
|
||||||
|
AFFECTIONS,AFFECTION
|
||||||
|
----
|
||||||
|
* `03-exclude` - Words to include. If any word in a lemmatization group is present, the entire group is excluded from the result.
|
||||||
|
Words can be excluded for any reason.
|
||||||
|
+
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
WORD
|
||||||
|
A
|
||||||
|
AARON
|
||||||
|
ABA
|
||||||
|
ABANDON
|
||||||
|
ABANDONING
|
||||||
|
----
|
||||||
|
* `04-deduplicated-words` - The final list of words and the associated numeric value
|
||||||
|
+
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
WORD,NUMBER
|
||||||
|
THE,1
|
||||||
|
OF,2
|
||||||
|
BEE,3
|
||||||
|
ARE,3
|
||||||
|
BE,3
|
||||||
|
----
|
Loading…
Reference in New Issue
Block a user