Unicode normalization and alphabet case
Unicode normalization
Unicode itself is a set of historical characters in many countries.  
So it has duplicate entries for identical letters.  
For example, the alphabet "a" U+0061 also has 
U+FF41 as "Fullwidth Latin Small Letter A" 
that is used in Japanese computers.  
Japanese Katakana letters are also troublesome.  
The letter "ア" has U+30A2 and U+FF71.  
These duplicates should be unified before indexing 
to perform a good search.  
      
The 
jsngram.text2 module 
 
gives a function to perform unicode normalization.  
      
Alphabet case
Alphabet characters are always indexed as lower case.