Unicode normalization and case js-py-ngram-full-text-search

Unicode normalization and alphabet case

Unicode normalization

Unicode itself is a set of historical characters in many countries. So it has duplicate entries for identical letters. For example, the alphabet "a" U+0061 also has U+FF41 as "Fullwidth Latin Small Letter A" that is used in Japanese computers. Japanese Katakana letters are also troublesome. The letter "ア" has U+30A2 and U+FF71. These duplicates should be unified before indexing to perform a good search.

The jsngram.text2 module gives a function to perform unicode normalization.

Alphabet case

Alphabet characters are always indexed as lower case.