How about white spaces in text
Default action
Spaces, tabs, line feeds, punctuation marks both in English and Japanese are marked as delimiters. Texts are splitted by delimiters before indexing. So, no keys cross over the delimiter. An adverse effect of this is that "3.14" cannot be searched.
For example, "my cat." will be "my", "ca", "at", "m", "y", "c", "a" and "t". While "my cat!" will be "my", "ca", "at", "t!", "m", "y", "c", "a", "t" and "!". And "3.14" will be "3", "14", "1" and "4".
Customize
The
ignore property
of
jsngram.jsngram.JsNgram class
controls this behavior.
You can add or remove delimiter characters.
The system can go even with no delimiters.
Preparation
If you want to make indexes including spaces or line feeds, some preparations before indexing are recommended. Consecutive spaces are better to be trimmed to one space. Line feeds are better to be unified to one. When the text is something sensitve to these changes, these preprocesses are not appropriate.