859 lines
32 KiB
Plaintext
859 lines
32 KiB
Plaintext
// echo DESIGN.adoc | entr sh -c "podman run --rm -it --network none -v "${PWD}:/documents/" asciidoctor/docker-asciidoctor asciidoctor -r asciidoctor-mathematical -a mathematical-format=svg DESIGN.adoc; printf 'Done ($(date -Isecond))\n'"
|
|
|
|
include::common.adoc.template[]
|
|
:toc:
|
|
:stem:
|
|
|
|
= Design
|
|
|
|
The goal of this document is to walk through how the design was chosen.
|
|
|
|
If you want to see the algorithm definition, go to link:algorithm[ALGORITHM].
|
|
|
|
== 10,000 meter view
|
|
|
|
This project allows anyone to address ~1 square meter of land by bringing together:
|
|
|
|
. Memorizability (the format is encoded like an address; ex: `2891 APPLE SPONGE BALLOON`)
|
|
. Accuracy (between `0.73` and `1.52` square meter resolution; no point is too close or too far from any other point)
|
|
. Accessible via multiple interfaces, open source, and offline operation
|
|
|
|
Using these tricks:
|
|
|
|
. Using Hilbert curves to map the 2-dimesnional surface area of earth to a 1-dimentional integer, thanks to Google's link:https://s2geometry.io/[S2 Geometry^] addressing scheme, and the unofficial Rust link:https://lib.rs/crates/s2[s2^] crate
|
|
. Encoding the entire space stem:[4.22*10^14] points using a small wordlist so only very common words are used
|
|
. Efficiency
|
|
|
|
== Goals [[goals]]
|
|
|
|
The list is in order of importance
|
|
|
|
. Create a memorizable address-like mapping of human-scale. Ex: houses, shops, roads
|
|
+
|
|
About 1 square meter points on Earth
|
|
. Accessible, open source, and offline. The only requirement of this project should be to have a computing device.
|
|
** Provide many user-friendly interfaces for *all* common uses cases
|
|
** Decode/encode to standards like UTM and lat/lon
|
|
** Small binary, with low minimum CPU requirements
|
|
** Minimum (including zero) network traffic if possible
|
|
. Optimize memorizability
|
|
** Use only common words by using a smaller wordlist
|
|
** Map homonyms and singluar/plural words to identical values
|
|
. Equally distribute points
|
|
.. The internal data structure can, therefore, not be lat/lon directly (see <<algorithm>>)
|
|
. Locality (TODO: I don't know if this is a good idea or not)
|
|
|
|
=== Non-goals [[non-goals]]
|
|
|
|
. By-hand encoding/decoding
|
|
** This is a nice feature for Xaddress, but it is not feasable with my algorithm
|
|
. Variable-resolution grid size
|
|
** It is not important that this algorithm can support higher- or lower-resolution mappings. The resolution is fixed
|
|
|
|
== Comparison [[comparison]]
|
|
|
|
Yes, this is yet another standard, but I believe it has significant improvements that no other existing algorithm can provide.
|
|
|
|
For a detailed comparison, go link:https://wiki.openstreetmap.org/wiki/What3words[here^] instead
|
|
|
|
.Comparison with similar algorithms
|
|
[%header,cols="h,,,,"]
|
|
|===
|
|
|
|
|
|xpin
|
|
|link:https://what3words.com/[what3words^]
|
|
|link:https://xaddress.org/[Xaddress^]
|
|
|link:https://maps.google.com/pluscodes/[Plus Codes^]/link:https://en.wikipedia.org/wiki/Open_Location_Code[Open Location Codes^]
|
|
|
|
|Format
|
|
m|2891 APPLE SPONGE BALLOON
|
|
m|///clip.apples.leap
|
|
m|7150 MAGICAL PEARL
|
|
m|849VCWC8+R9
|
|
|
|
|Open Source
|
|
|Yes
|
|
|No
|
|
|Yes
|
|
|Yes
|
|
|
|
|Memorizable
|
|
footnote:[This is subjective, of course. I am defining this to mean similar enough to an address, which I consider memorizable]
|
|
|Yes
|
|
|Yes (is shorter than xpin)
|
|
|Yes (is shorter than xpin)
|
|
|No
|
|
|
|
|Small wordlist
|
|
footnote:[This is important for a few reasons.
|
|
Firstly, smaller wordlist means you can exclusively use very comon words (for example, clicking around in 30 seconds on what3words, link:https://what3words.com/rampage.unanswerable.desirability[`///balaclava.jostles.ghoulish`^] was found as an address.
|
|
These words are not commonly used).
|
|
Secondly, it is easier to translate the common words to other languages.
|
|
Thirdly, it allows plural words and homonyms to be mapped to the same point easily.]
|
|
|Yes (Under 5000)
|
|
|No (25,000-40,000)
|
|
|No (~200k)
|
|
|N/A
|
|
|
|
|Compact (can be recorded in a small number of bytes)
|
|
|No
|
|
|No
|
|
|No
|
|
|Yes
|
|
|
|
|Relative Uniform Grid size
|
|
footnote:[The varaince in distance between two points is low.
|
|
Non-uniform grid size usually comes from mapping linearly to latitude/longitude.
|
|
For example, 1deg lon at the equator is a much longer distance than 1deg lon near a pole
|
|
]
|
|
|Yes (Points range from .73m^2^ to 1.52m^2^)
|
|
footnote:[Mappings are uniform in Hilbert space, but variance comes from reprojecting back to Earth's non-spherical surface]
|
|
footnote:[This is because a link:https://en.wikipedia.org/wiki/Hilbert_curve[Hilbert curve^] is used to evenly distribute points instead of mapping linearly to latitude/longitude.
|
|
See link:https://s2geometry.io/resources/s2cell_statistics[S2 Cell Statistics^] (level 23) for more.]
|
|
|?
|
|
footnote:[TODO: Does anybody know the answer to this?]
|
|
|No
|
|
(11 meters at the equator and .1 meters at the poles)
|
|
footnote:[Minimum point resolution is lat 0.0001, lon 0.0001. Multiplying the circumference at the equator, stem:[4.0*10^7*.0001/360=11] meters at the equator and the circle around 90deg north 11m in radius, stem:[11*2*3.14*.0001/360=1.9*10^-5] meters (though Xaddress is not addressable here since it is not a country)
|
|
]
|
|
|N/A?
|
|
footnote:[TODO: I don't know]
|
|
|
|
|Whole-Earth coverage
|
|
|Yes
|
|
|Yes
|
|
|No
|
|
|Yes
|
|
|
|
|Encode/decode offline
|
|
|Yes
|
|
|Yes (closed-source app is required)
|
|
|Yes (though no app exists)
|
|
|Yes
|
|
|
|
|Locality (similar addresses are physically nearby)
|
|
|? (This is being considered. Either option is possible)
|
|
|No
|
|
|No
|
|
|?
|
|
|
|
|===
|
|
|
|
== Algorithm [[algorithm]]
|
|
|
|
The algorithm was selected by thinking through each of these parts in order.
|
|
|
|
=== Addressing Scheme [[addressing-scheme]]
|
|
|
|
The original idea for creating a new algorithm is the issue with existing, similar projects that map to latitude, longitude.
|
|
Mapping linearly to polar coordinates like lat, lon returns a non-uniform distribution of points.
|
|
|
|
The arc length of 1° around the equator is
|
|
|
|
[stem]
|
|
++++
|
|
d_(0 text(°))=d_(text(circumference)) * theta / (360 text(°))
|
|
= (4.01*10^7 text(m)) / (360 text(°))
|
|
= 1.11*10^5 text(m)
|
|
++++
|
|
|
|
While the arc length of 1° at latitude 89.9° is approximately
|
|
|
|
[stem]
|
|
++++
|
|
d_(89.9 text(°))
|
|
= cos(89.9 text(°)) * d_(text(circumference)) * theta / (360 text(°))
|
|
= 194 text(m)
|
|
++++
|
|
|
|
Any algorithm that is linearly lat/lon-based does not have a uniform distribution of points.
|
|
|
|
Xaddress does not address this and what3words created a proprietary algorithm so that more-dense locations have more points.
|
|
The aim of this algorithm is to fairly distribute points across the globe, and encode to a reasonable resolution that is as functional or more-functional than what3words and Xaddress.
|
|
|
|
The addressing scheme I want to use, in principle, maps Earth's surface to a link:https://en.wikipedia.org/wiki/Hilbert_curve[Hilbert curve^].
|
|
This is done by treating Earth as a perfect sphere, then projecting a Hilbert curve on each of the 6 faces of the bounding box (cube) of the sphere.
|
|
A more graphical representation of this Earth Cube can be found on Google's S2 Geometry website link:https://s2geometry.io/resources/earthcube[here^].
|
|
This application only uses the addressing scheme of the S2 geometry library, and no other features;
|
|
|
|
Next, the coordinates are reprojected a few times in order to more accurately represent Earth's surface.
|
|
Again, all of these come from link:https://s2geometry.io[S2].
|
|
They are described link:https://s2geometry.io/devguide/s2cell_hierarchy#coordinate-systems[here^].
|
|
|
|
.Hilbert curves projected and transformed onto Earth's surface. From link:https://s2.sidewalklabs.com/planetaryview/[sidewalklabs planetary view^]
|
|
image::./32g2t7pp.png[Image of Earth with 6 Hilbert curve projections]
|
|
|
|
=== Search space [[search-space]]
|
|
|
|
S2 addresses map to a cell at a given level. Two cell level candidates were chosen due to their size and average resolution.
|
|
|
|
.Statistics of the two candidates for cell levels for this project. More can be seen link:https://s2geometry.io/resources/s2cell_statistics[here^]
|
|
|===
|
|
|Level |Min area |Max area |Average area |Number of cells |Bits required
|
|
|
|
|22
|
|
|2.90m^2^
|
|
|6.08m^2^
|
|
|4.83m^2^
|
|
|1.05*10^14^
|
|
|47
|
|
|
|
|23
|
|
|0.73m^2^
|
|
|1.52m^2^
|
|
|1.21m^2^
|
|
|4.22*10^14^
|
|
|49
|
|
|
|
|===
|
|
|
|
The last column indicates the number of cells that need to be mappable by this algorithm.
|
|
That is, I need a format that has values that map to all 1.05*10^14^ or 4.22*10^14^ cells.
|
|
|
|
=== Wordlist Size [[wordlist-size]]
|
|
|
|
Goal 1 of this project is to make the address memorizable.
|
|
Using only digits or letters is not memorizable, and there are already better methods to map to latlon (just use lat/lon if you only want to use numbers; use Plus Codes if you only want to use letters and numbers).
|
|
|
|
A wordlist would be a good candidate for memorizability since current addresses are word-based.
|
|
Xaddress has a similar idea - map a number and some words together to make a memorizable address.
|
|
|
|
In order to find a good wordlist that can map to all cases, a format needs to be selected.
|
|
Given some format of known values, the minimum size of the wordlist can be found.
|
|
|
|
For example, if I wanted to see how many words I need to support level 22 with a format like `1234 WORD1 WORD2` (4-digit number, then 1 word, then a second word), I could compute
|
|
|
|
[stem]
|
|
++++
|
|
n_(22)
|
|
= 1.05*10^14
|
|
= 10^5 * n_(text(w)) * n_(text(w))
|
|
++++
|
|
|
|
Where stem:[n_(22)] is the number of words in level 22, stem:[10^5] is the number of possible 4-digit numbers (0000-9999), and stem:[n_(text(w))] is the number of words in the wordlist.
|
|
Therefore, the number of words required in a wordlist to support this would be
|
|
|
|
[stem]
|
|
++++
|
|
1.05*10^14 = 10^5 * n_(text(w)) * n_(text(w))
|
|
= 10^5 * n_(text(w))^2
|
|
++++
|
|
|
|
[stem]
|
|
++++
|
|
n_(text(w)) = sqrt(1.05*10^14 / 10^5)
|
|
= 1.025*10^5
|
|
++++
|
|
|
|
I would need a wordlist of 102,500 to support an algorithm like this.
|
|
I compare the wordlist sizes for different formats below.
|
|
|
|
The general formula is
|
|
|
|
[stem]
|
|
++++
|
|
n_(text(w)) = (text(total_combinations)/text(num_prefix))^(1/text(num_words))
|
|
++++
|
|
|
|
Where stem:[text(num_prefix)] is the number of number/letter combinations in the prefix and stem:text(total_combinations) is the total numer of results stem:[n_(22)] or stem:[n_(23)].
|
|
|
|
.Comparison of approximate wordlist size for different formats, sorted by wordlist size
|
|
[%header,cols="m,,,"]
|
|
|===
|
|
|Format |n~22~ (thousand) |n~23~ (thousand) |Consider?
|
|
|
|
|999 WORD WORD
|
|
|324.04
|
|
|649.62
|
|
|No, wordlist too big
|
|
|
|
|WORD WORD WORD
|
|
|47.18
|
|
|75.01
|
|
|No, this is the exact format what3words uses
|
|
|
|
|(0-999)(A-Z0-9) WORD WORD
|
|
footnote:wordlistsize[The letters `OD0Q LI1 Z2 S5 8B` would be excluded from the alphanumeric list, making the number of alphanumeric characters 36-13=23]
|
|
|62.40
|
|
|125.00
|
|
|No, wordlist too big
|
|
|
|
|(0-999)(A-Z0-9)(A-Z0-9) WORD WORD
|
|
footnote:wordlistsize[]
|
|
|14.10
|
|
|28.20
|
|
|No, wordlist too big
|
|
|
|
|(1-128) WORD WORD WORD
|
|
|9.36
|
|
|14.88
|
|
|No, restricting numbers to 128 is not worth it
|
|
|
|
|999 WORD WORD WORD
|
|
|4.72
|
|
|7.50
|
|
|Maybe
|
|
|
|
|WORD WORD WORD WORD
|
|
|3.20
|
|
|4.53
|
|
|No, this does not look like an address
|
|
|
|
|9999 WORD WORD WORD
|
|
|2.19
|
|
|3.48
|
|
|Maybe
|
|
|
|
|(3-9A-Y)(3-9A-Y)(3-9A-Y) WORD WORD
|
|
footnote:wordlistsize[]
|
|
|92.90
|
|
|186.00
|
|
|No, wordlist too big
|
|
|
|
|(3-9A-Y)(3-9A-Y)(3-9A-Y) WORD WORD WORD
|
|
footnote:wordlistsize[]
|
|
|2.05
|
|
|3.26
|
|
|Maybe -- contender
|
|
|
|
|(1-1024) WORD WORD WORD
|
|
|4.68
|
|
|7.44
|
|
|Maybe -- contender
|
|
|
|
|===
|
|
|
|
TODO: Decide if the second to last implementation makes sense.
|
|
The alphanumeric component needs to encode 13 bits if the word component only encodes 12 bits each (stem:[49-12*3=13]).
|
|
stem:[log_2(23^3) = log_2(12.20*10^3) = 13.6], so there is enough room.
|
|
|
|
NOTE: This project will use the `(1-1024) WORD0 WORD1 WORD2` variation (1 number component, and 3 word component).
|
|
|
|
It is longer than Xaddress and what3words, but with the tradeoff of having a significantly smaller dictionary than both.
|
|
It requires a larger wordlist than the `9999 WORD WORD WORD` variation, but it allows versioning.
|
|
|
|
=== Reversibility [[reversibility]]
|
|
|
|
Xaddress allows addresses to be encoded as either `WORD1 WORD2 0000` or `0000 WORD2 WORD1`.
|
|
This might make more sense in loactions where numbers might come before word portions of addresses.
|
|
TODO: I need better reasoning here with examples.
|
|
|
|
NOTE: This algorithm will allow exactly two encoding types.
|
|
Both `0000 WORD0 WORD1 WORD2` and `WORD2 WORD1 WORD0 0000` formats are equivalent.
|
|
|
|
=== Versioning [[versioning]]
|
|
|
|
If there is ever any version update, the protocol should be able to support it in some fashion.
|
|
Old addresses should be decodable by new decoders and should also be able to report that they cannot decode new versions.
|
|
|
|
Therefore, some version identifier should be embedded within the code.
|
|
|
|
I believe 2 bits for version information is sufficient.
|
|
These versions allow for adding different languages, adding different processing techniques to words, or any other generic change since it is read first.
|
|
|
|
NOTE: Least significant bits 11-12 in the number component will be used for versioning when the number component is parsed as a 32-bit unsigned integer.
|
|
|
|
For this version, version 0, all three bits will be 0.
|
|
For example, the bits responsible for determining the algorithm version of this project for the address `382 WORD WORD WORD` are:
|
|
|
|
[source]
|
|
----
|
|
# 382 parsed as a 32-bit unsigned integer yields:
|
|
0000 0000 0000 0000 0000 0001 0111 1110
|
|
# Version bits are: ^^
|
|
# Data bits are: ^^ ^^^^ ^^^^
|
|
|
|
# Therefore, this address is using version 0
|
|
----
|
|
|
|
=== Locality [[locality]]
|
|
|
|
Due to the use of a Hilbert curve, points on the CellID mapping (immediate step before encoding as `999 WORD WORD WORD`), it is posible that nearby addresses can have similiar addresses.
|
|
This is similar to the real world where `123 Random Road, Washington D.C.` is close to `222 Random Road, Washington D.C.`.
|
|
|
|
This is not always a good idea.
|
|
For example, imagine if the wordlist did not have distinct enuogh words.
|
|
The address `111 word word word` will have a similar location to `111 words word word`, which may cause confusion, whereas if they were in significantly distant locations, there would be less confusion.
|
|
|
|
There are a few options
|
|
|
|
[cols=',a,a,a']
|
|
|===
|
|
|Option |Example |Pros |Cons
|
|
|
|
|Intentionally scramble addresses to avoid locality
|
|
|Knowing that CellID-encoded addresses have locality, Intentionally randomize bit order or word order so that nearby locations intentionally have significantly different addresses.
|
|
|
|
`1234 APPLE GRAPE ORANGE` and `1234 SPONGE GREEN FACE` may be close together or may be far apart
|
|
|* Users will be used to drastically different addresses, so there will be no confusion for close addresses to be dissimiliar or distant addresses being similar
|
|
|* Loses ability for humans to see nearby locations from just looking at the address alone
|
|
|
|
|Preserve locality in some components
|
|
|`1234 APPLE GRAPE ORANGE` and `1234 SPONGE GREEN ORANGE` are relatively close together because `ORANGE` is equivalent and `GRAPE` and `GREEN` are similar
|
|
For example, words 2 and 3 could be analogous to country and state.
|
|
|* Allows for some form of at-a-glance distance comparison
|
|
|* Might lead to less confusion if close addresses are not always given the similar names
|
|
|
|
|Similarity implies locality in all components.
|
|
|`1234 APPLE GRAPE ORANGE` and `1234 SPONGE GREEN ORANGE` are somewhat close together because `ORANGE` is equivalent and `GRAPE` and `GREEN` are similar
|
|
For example, the digits could be the smallest resolution on a per-adjacent cell basis and word1, word2, and word3 can be analogous to a local city, state, and country respectively
|
|
|* Simple implementation - just alphabetize the wordlits
|
|
* Allows for rough estimations of closeness
|
|
* It might be easier to memorize multiple locations if only some components change for nearby areas
|
|
|* (TODO: Confirm) Locality is broken at the prime meridian, so close locations to the prime meridian will have significantly different addresses
|
|
* Confusing property of similar addressess indicating closeness, but differing addresses not necessarily indicating distance
|
|
|
|
|Sameness implies locality in all components + scramble addresses
|
|
|`1234 APPLE GRAPE ORANGE` and `1234 SPONGE GREEN ORANGE` are somewhat close together because `ORANGE` is equivalent, but that is the only equivalent component, so nothing else can be inferred
|
|
|* May reduce confusion when trying to memorize close locations (as it may be hard to remember many addresses with very similar, but not the same, words)
|
|
* Simple implementation - just randomize the wordlist
|
|
* Behaves similarly to addresses in United States where city, state, zip code, and country are all included
|
|
* No confusing property of dissimilar addresses not representing nearby locations
|
|
|* ?
|
|
|
|
|===
|
|
|
|
The domain of each component of the address `0000 WORD0 WORD1 WORD2` is as follows:
|
|
|
|
. `0000` - Responsible for the least significant bits in the encoded layout (smallest area)
|
|
. `WORD0`
|
|
. `WORD1`
|
|
. `WORD2` - Responsible for the most significant bits in the encoded layout (largest area)
|
|
|
|
NOTE: This algorithm will preserve locality in all components by requiring that sameness implies locality in every component.
|
|
|
|
See <<encoded-layout>> for how this algorithm will implement locality specifically.
|
|
|
|
=== Encoded Layout [[encoded-layout]]
|
|
|
|
The layout of the encoded address determines what bits map to which parts of the decoded string.
|
|
|
|
The CellID is a 64-bit unsigned integer and is the representation directly adjacent to the final address-like encoding.
|
|
CellIDs are represented as:
|
|
|
|
[source]
|
|
----
|
|
Example: Face 2, level 23
|
|
|
|
# Most significant 3 bits are for the face
|
|
face_number = 0b010
|
|
|
|
# This algorithm is always level 23
|
|
data_bits = level * 2 = 23 * 2 = 46
|
|
|
|
# The bit after the data bits is always 1
|
|
# All subsequent bits are always 0
|
|
|
|
Bit : 64 48 32 16 1
|
|
: | | | | |
|
|
: 01001011101010001011100010010011 10010011001001001100000000000000
|
|
Face number : ^^^
|
|
Data bits : ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
|
|
Bit after data bits (1) : ^
|
|
All remaining bits (0) : ^^^^^^^^^^^^^^
|
|
----
|
|
|
|
There are 6 faces at the top level encoding, which takes 3 bits to represent.
|
|
There are 2 divisions (1 horizontal and 1 vertical division) per level, which take up 2 bits per level
|
|
|
|
The number of bits that encode the actual address is therefore
|
|
|
|
[stem]
|
|
++++
|
|
n_text(total_bits_required)
|
|
= n_text(face_bits) + n_text(subdivision_bits)
|
|
= 3 + 2*l
|
|
++++
|
|
|
|
Where stem:[l] is the subdivision level, such as 22 or 23.
|
|
|
|
Therefore, there are stem:[3+23(2)=49] bits required to represent level 23 and stem:[3+22(2)=47] bits required to represent level 22.
|
|
|
|
NOTE: Level 23 will be used for this project, which requires 49 bits to fully represent.
|
|
This does not include the <<versioning>> bits.
|
|
|
|
Since this algorithm will only ever use level 23, we know that bit 15 will always be 1 and bits 1-14 will always be 0, so these can be excluded from our encoding/decoding process.
|
|
|
|
The only 4 components in this address that can be used to represent a position are the number component and three word components.
|
|
|
|
The number component will be parsed as a 32-bit unsigned integer.
|
|
For version 0, least significant bits 1-10 will be used for data.
|
|
Therefore, stem:[b_text(number)=10]
|
|
|
|
[stem]
|
|
++++
|
|
49
|
|
=b_text(number) + 3 * b_text(word)
|
|
=10 + 3 * b_text(word)
|
|
++++
|
|
|
|
[stem]
|
|
++++
|
|
b_text(word)
|
|
=13
|
|
++++
|
|
|
|
Therefore, each word component needs to represent 13 bits of information, or stem:[2^13=8192] words.
|
|
|
|
Using the responsibilities of each component we set in <<locality>>, the final layout can be determined.
|
|
|
|
NOTE: The layout of our encoding will be:
|
|
|
|
(From the example above)
|
|
|
|
[source]
|
|
----
|
|
All remaining bits (0) : vvvvvvvvvvvvvv
|
|
Bit after data bits (1) : v
|
|
Data bits : vvvvvvvvvvvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvv
|
|
Face number : vvv
|
|
Bit : 64 48 32 16 1
|
|
: | | | | |
|
|
: 01001011101010001011100010010011 10010011001001001100000000000000
|
|
Not represented : ^^^^^^^^^^^^^^^
|
|
0000 (10 bits) : ^^^^^^^^^^
|
|
WORD0 (13 bits) : ^^^^^^ ^^^^^^^
|
|
WORD1 (13 bits) : ^^^^^^^^^^^^^
|
|
WORD2 (13 bits) : ^^^^^^^^^^^^^
|
|
----
|
|
|
|
Note that this is just what each component is responsible for encoding, but does not specify exactly how to encode the selected bits.
|
|
|
|
=== Wordlist Selection [[wordlist-selection]]
|
|
|
|
Considerations when designing a wordlist:
|
|
|
|
. Word complexity (`SESQUIPEDALIAN` might not be reasonable to include)
|
|
. Plural and singular confusion (`take`/`takes`)
|
|
. Homonyms (`him`/`hymn`)
|
|
. Repetition (`1823 APPLE APPLE BLUE` might be confusing)
|
|
. Bad words/negative words (`DEATH` should probably be excluded)
|
|
. Different languages (if the algorithm could ever made non-English)
|
|
|
|
xpin has a relatively small wordlist of length ~8192, so it is feasable to map all plural/singlar and homonym words to the same value, to prevent confusion.
|
|
|
|
NOTE: xpin will disallow complex words, map homonyms and singular/plural words to the same value, and allow repition.
|
|
|
|
==== Wordlist
|
|
|
|
xpin used link:https://github.com/hackerb9/gwordlist[hackerb9/gwordlist^] for the wordlist and associated frequencies.
|
|
Only the first few thousand words from `frequency-all.txt.gz` were used.
|
|
|
|
==== Lemmatization
|
|
|
|
Lemmatization is the process of reducing words into a common form.
|
|
This includes mapping singulars words to plurals.
|
|
|
|
The link:https://www.nltk.org/[nltk^] python package and link:https://wordnet.princeton.edu/[WordNet^] database (used via link:https://www.nltk.org/_modules/nltk/stem/wordnet.html[nltk.stem.wordnet^]) was used to automate some of the lemmtization.
|
|
|
|
==== Annotated Wordlist
|
|
|
|
Complementary to automated lemmatization is manual lemmatization and word removal.
|
|
See `docs/annotated_words.ods` for a link to the annotated spreadsheet, which marks words to be excluded as well as some manual mappings of words that wordnet did not map.
|
|
|
|
For example, in the screenshot below, we want to drop "Church" because it's a proper noun (but also because it could be seen as negative), keep "West" despite it being a capitalized word, and exclude "death" because it is negative.
|
|
Not pictured is the last column which allows custom mappings:
|
|
|
|
image::./annotated_wordlist_example.png[Screenshot of docs/annotated_words.ods as an example]
|
|
|
|
==== Wordlist generation
|
|
|
|
The final wordlist can be generated by running the scripts in `./wordlist/`, which brings together the full list of words, nltk lemmatized words, and manual annotated words into one list.
|
|
See link:wordlist[WORDLIST] for more information
|
|
|
|
The output is of the format:
|
|
|
|
[source]
|
|
----
|
|
WORD,NUMBER
|
|
THE,1
|
|
OF,2
|
|
HIM,3
|
|
HYMN,3
|
|
APPLE,4
|
|
APPLES,4
|
|
----
|
|
|
|
=== Implementation [[implementation]]
|
|
|
|
The link:https://s2geometry.io/[S2 Geometry^] project addressing scheme is exactly the format we want.
|
|
However, I couldn't find a concrete description of the math behind the projections without looking at the source code.
|
|
The implementation used in this algorithm cannot change because addresses might not map to the same location if they do.
|
|
|
|
Therefore, I will define the algorithm below, so it is independent from S2's link:https://github.com/google/s2geometry/blob/master/src/s2/s2point.h[Point^] and link:https://github.com/google/s2geometry/blob/master/src/s2/s2cell_id.h[CellID^] source code in case S2 ever changes their design, but it will be almost the exact same code as the current implementations.
|
|
All credit for the projections should go to the S2 team.
|
|
|
|
.TODO
|
|
[%header,cols="m,a,"]
|
|
|===
|
|
|Name |Format |Description
|
|
|
|
|(number, word1, word2, word3)
|
|
|number ∈ [1, 9999]
|
|
|
|
word1, word2, word3 ∈ wordlist
|
|
|
|
len(wordlist) ≅ 2000
|
|
|
|
|
|
|
|(cellid)
|
|
|cellid ∈ [0, TODO] ∩ ℤ
|
|
|Cell id: A 64-bit encoding of a face and a Hilbert curve parameter on that face, as discussed above.
|
|
The Hilbert curve parameter implicitly encodes both the position of a cell and its subdivision level.
|
|
|
|
|(face, i, j)
|
|
|face ∈ [0, 5] ∩ ℤ
|
|
|
|
i, j ∈ [0, 2^30^-1] ∩ ℤ
|
|
|
|
|Leaf-cell coordinates: The leaf cells are the subsquares that result after 30 levels of Hilbert curve subdivision, consisting of a 2^30^x2^30^ array on each face.
|
|
stem:[i] and stem:[j] are integers in the range [0, 2^30^-1] that identify a particular leaf cell.
|
|
The (i, j) coordinate system is right-handed on every face, and the faces are oriented such that Hilbert curves connect continuously from one face to the next.
|
|
|
|
|(face, s, t)
|
|
|face ∈ [0, 5] ∩ ℤ
|
|
|
|
s, t ∈ [0, 1] ∩ ℝ
|
|
|
|
|Cell-space coordinates: stem:[s] and stem:[t] are real numbers in the range [0,1] that identify a point on the given face.
|
|
For example, the point stem:[(s, t) = (0.5, 0.5)] corresponds to the center of the cell at level 0.
|
|
Cells in (s, t)-coordinates are perfectly square and subdivided around their center point, just like the Hilbert curve construction.
|
|
|
|
|(face, u, v)
|
|
|face ∈ [0, 5] ∩ ℤ
|
|
|
|
u, v ∈ [0, 1] ∩ ℝ
|
|
|
|
|Cube-space coordinates: To make the cells at each level more uniform in size after they are projected onto the sphere, we apply a nonlinear transformation of the form stem:[u=f(s)], stem:[v=f(t)] before projecting points onto the sphere.
|
|
This function also scales the stem:[(u,v)]-coordinates so that each face covers the biunit square [-1,1]×[-1,1].
|
|
Cells in stem:[(u,v)]-coordinates are rectangular, and are not necessarily subdivided around their center point (because of the nonlinear transformation stem:[f]).
|
|
|
|
|(x, y, z)
|
|
|x, y, z ∈ [0, 1] ∩ ℝ
|
|
|Spherical point: The final S2Point is obtained by projecting the (face, u, v) coordinates onto the unit sphere.
|
|
Cells in stem:[(x,y,z)]-coordinates are quadrilaterals bounded by four spherical geodesic edges.
|
|
|
|
|(lat, lon)
|
|
|lat ∈ [-90, 90]
|
|
|
|
lon ∈ [-180, 180]
|
|
|
|
|
|
|
|===
|
|
|
|
The encoding and decoding code comes from a modified link:https://docs.rs/s2/latest/s2/[S2 Rust^] and link:https://github.com/golang/geo/blob/master/s2/cellid.go[S2 Go^].
|
|
It might seem like duplication, but separating the math from the code allows this algorithm to be replicaed in any language.
|
|
|
|
Sample source code:
|
|
|
|
[source,rust]
|
|
----
|
|
// Translating lat, lon to xpin
|
|
|
|
impl<'a> From<&'a Point> for CellID {
|
|
fn from(p: &'a Point) -> Self {
|
|
let (f, u, v) = xyz_to_face_uv(&p.0);
|
|
let i = st_to_ij(uv_to_st(u));
|
|
let j = st_to_ij(uv_to_st(v));
|
|
CellID::from_face_ij(f, i, j) // Important
|
|
}
|
|
}
|
|
impl<'a> From<&'a LatLng> for CellID {
|
|
fn from(ll: &'a LatLng) -> Self {
|
|
let p: Point = ll.into();
|
|
Self::from(p)
|
|
}
|
|
}
|
|
impl<'a> From<&'a CellID> for LatLng {
|
|
fn from(id: &'a CellID) -> Self {
|
|
LatLng::from(Point::from(id))
|
|
}
|
|
}
|
|
impl<'a> From<&'a LatLng> for Point {
|
|
fn from(ll: &'a LatLng) -> Self {
|
|
let phi = ll.lat.rad();
|
|
let theta = ll.lng.rad();
|
|
let cosphi = phi.cos();
|
|
Point(Vector {
|
|
x: theta.cos() * cosphi,
|
|
y: theta.sin() * cosphi,
|
|
z: phi.sin(),
|
|
})
|
|
}
|
|
}
|
|
impl<'a> From<&'a CellID> for Point {
|
|
fn from(id: &'a CellID) -> Self {
|
|
Point(id.raw_point().normalize()) // Important
|
|
}
|
|
}
|
|
struct CellID {
|
|
pub fn raw_point(&self) -> Vector {
|
|
let (face, si, ti) = self.face_siti();
|
|
face_uv_to_xyz(
|
|
face,
|
|
st_to_uv(siti_to_st(si as u64)),
|
|
st_to_uv(siti_to_st(ti as u64)),
|
|
)
|
|
}
|
|
fn face_siti(&self) -> (u8, u32, u32) {
|
|
let (face, i, j, _) = self.face_ij_orientation(); // <= Important
|
|
let delta = if self.is_leaf() {
|
|
1
|
|
} else if (i ^ (self.0 as u32 >> 2)) & 1 != 0 {
|
|
2
|
|
} else {
|
|
0
|
|
};
|
|
(face, 2 * i + delta, 2 * j + delta)
|
|
}
|
|
pub fn is_leaf(&self) -> bool {
|
|
self.0 & 1 != 0
|
|
}
|
|
}
|
|
----
|
|
|
|
.Helper functions
|
|
[source,rust]
|
|
----
|
|
pub fn siti_to_st(si: u64) -> f64 {
|
|
if si > MAX_SITI {
|
|
1f64
|
|
} else {
|
|
(si as f64) / (MAX_SITI as f64)
|
|
}
|
|
}
|
|
|
|
pub fn face_uv_to_xyz(face: u8, u: f64, v: f64) -> Vector {
|
|
match face {
|
|
0 => Vector::new(1., u, v),
|
|
1 => Vector::new(-u, 1., v),
|
|
2 => Vector::new(-u, -v, 1.),
|
|
3 => Vector::new(-1., -v, -u),
|
|
4 => Vector::new(v, -1., -u),
|
|
5 => Vector::new(v, u, -1.),
|
|
_ => unimplemented!(),
|
|
}
|
|
}
|
|
pub fn st_to_uv(s: f64) -> f64 {
|
|
if s >= 0.5 {
|
|
(1. / 3.) * (4. * s * s - 1.)
|
|
} else {
|
|
(1. / 3.) * (1. - 4. * (1. - s) * (1. - s))
|
|
}
|
|
}
|
|
|
|
pub fn uv_to_st(u: f64) -> f64 {
|
|
if u >= 0. {
|
|
0.5 * (1. + 3. * u).sqrt()
|
|
} else {
|
|
1. - 0.5 * (1. - 3. * u).sqrt()
|
|
}
|
|
}
|
|
|
|
pub fn xyz_to_face_uv(r: &Vector) -> (u8, f64, f64) {
|
|
let f = face(r);
|
|
let (u, v) = valid_face_xyz_to_uv(f, r);
|
|
(f, u, v)
|
|
}
|
|
|
|
pub fn valid_face_xyz_to_uv(face: u8, r: &Vector) -> (f64, f64) {
|
|
}
|
|
|
|
fn st_to_ij(s: f64) -> u32 {
|
|
clamp((MAX_SIZE as f64 * s).floor() as u32, 0, MAX_SIZE_I32 - 1)
|
|
}
|
|
----
|
|
|
|
=== Multi-encoding
|
|
|
|
TODO: Describe more
|
|
|
|
Due to the <<locality>> property, nearby locations will likely have common suffixes (`123 APPLE ORANGE GRAPE` will be close to `876 APPLE ORANGE GRAPE`).
|
|
This is a useful property that might be able to be used to encode multiple addresses together.
|
|
|
|
For example, one might consider an encoding like `123 AND 876 APPLE ORANGE GRAPE` to encode both addresses above, with the `AND` conjoining keyword causing a fork at its position:
|
|
|
|
* `111 AND 222 A B C` => `111 A B C` and `222 A B C`
|
|
* `111 A AND 222 D B C` => `111 A B C` AND `222 D B C`
|
|
* `111 A B AND 222 D E C` => `111 A B C` AND `222 D E C`
|
|
* `111 A B C AND 222 D E F` => `111 A B C` AND `222 D E F`
|
|
|
|
If two addresses have the same component, it might look slightly strange:
|
|
|
|
* `111 A B AND 111 D E C` => `111 A B C` and `111 D E C`
|
|
|
|
It must also be noted that ordering might cause issues.
|
|
For example, to encode `111 A B C`, `222 D E F`, and `333 A B C`:
|
|
|
|
* Without ordering, it's simple: `111 AND 333 A B C AND 222 D E F`
|
|
* With ordering, it's complicated: `111 A B C AND 222 D E F AND 333 A B C`
|
|
|
|
=== Compact hashing/Emojis
|
|
|
|
TODO: Consider this more
|
|
|
|
It would be useful, especially in multi-encoding, to generate a hash that maps to one or two (no more than 3) small pictures that can be used for verification.
|
|
This would allow at a quick glance
|
|
|
|
Considerations:
|
|
|
|
* The list of emojis/pictures _must_ be large in order to be more more effective.
|
|
* When using emoji, the exact same emoji set must be used.
|
|
If this does not happen, an example where different emoji sets can cause issues is if one person describes a "blue rocket" on their device, which might be displayed as a green rocket on another device.
|
|
|
|
== Sample Data [[sample-data]]
|
|
|
|
In order to test this algorithm out, I want to ensure the conversions are not incorrect.
|
|
I will generate sample data and test against the link:https://github.com/google/s2geometry/[S2 C++ Source Code^] to ensure the CellIDs match for every tested latitude longitude.
|
|
|
|
.Types of test data
|
|
|
|
* Standard latitude/longitude in the range (-90,90) and (-180,180) - Randomly selected
|
|
* Non-normalized latitude/longitude in the range [-1000,1000] and [-1000,1000] - Used to make sure latitude/longitude normalization logic is correct
|
|
* Corner-cases - 0, ±1, ±90, and ±180, plus some variation
|
|
|
|
All of these cases are generated with this code:
|
|
|
|
[source,python]
|
|
----
|
|
include::../test-data/00-generate-test-data.py[]
|
|
----
|
|
|
|
== Interfaces [[interfaces]]
|
|
|
|
Goal 2 of this project is to provide many interfaces that are easy to use.
|
|
|
|
All interfaces *must* use minimum resources (network, CPU, RAM), so they can run on virtually any device.
|
|
|
|
.Some Ideas for Interfaces
|
|
. Command line
|
|
** `xpin -e -90,180` \=> `\...`
|
|
** Useful for developers, test data generation, etc.
|
|
** Easiest to write
|
|
|
|
. HTTP API
|
|
** Simple HTTP endpoint for JSON/XML encoding/decoding
|
|
** Will not require accounts and will not be resource intensive
|
|
|
|
. JavaScript/HTML File
|
|
** Downloadable `.html` file with embedded CSS and JavaScript that can encode/decode offline
|
|
|
|
. Offline PWA
|
|
** Useful for smartphones
|
|
** Should work offline if possible, but also interface with Google Maps/JAWG/OSM if there is internet connectivity
|
|
** Allows users to translate any link (Google Maps/OSM/Apple Maps) to xpin as a share target
|
|
|
|
. OSMAnd/Other existing mapping applications
|
|
* Requires input from these applications
|
|
|
|
++++
|
|
<style>
|
|
#header, #content, #footnotes, #footer {
|
|
max-width: unset !important;
|
|
}
|
|
.hll {
|
|
background-color: #ff0;
|
|
}
|
|
</style>
|
|
++++
|