Unicode normalization and alphabet case
Unicode normalization
Unicode itself is a set of historical characters in many countries.
So it has duplicate entries for identical letters.
For example, the alphabet "a" U+0061
also has
U+FF41
as "Fullwidth Latin Small Letter A"
that is used in Japanese computers.
Japanese Katakana letters are also troublesome.
The letter "ア" has U+30A2
and U+FF71
.
These duplicates should be unified before indexing
to perform a good search.
The
jsngram.text2 module
gives a function to perform unicode normalization.
Alphabet case
Alphabet characters are always indexed as lower case.