N-gram json index structure

N-gram

To work without any word delimiters is a remarkable advantage of the N-gram method. Because Japanese language does not put delimiters between words, it is hard to extract "words" by machine as a minimum unit of index key.

This library is optimized to the Bigram (size 2), though the size can be changed. By default, indexes of size 2 and 1 are generated, so a text "cat" will be indexed by "ca", "at", "c", "a" and "t".

This library uses json files to store indexes. Everything is text and easy to handle by any script languages, including JavaScript.

Index file name

To perfome fast reading, the library uses the hierarchy file system to place indexes. Each index is encoded as an ascii string of unicode numbers. For example, a key "ca" will be "00630061". Here, "c" is U+0063 and "a" is U+0061. This makes all unicode characters, even the control characters, safe to be indexed.

Though files can also be placed in a single directory flatly, they are placed in a tree of directories by default. Too many files in a single directory will slow down the system. The above index file will be "./00/63/00/61.json".

Index file format

Array of array as json. The inner array has two elements, namely, an identification and a position. The identification is a unique id for a single file, and it may be a file path. The position is a location of the key string within a file. All pairs that found to have the key string is listed as the outer array.

[
  ["a/b.txt", 129],
  ["c/d/e.txt", 148]
]
      

For example, when the json file for the key "ca" is as above, it means: In "a/b.txt" at position 129, "ca" is found. In "c/d/e.txt" at position 148, "ca" is found. And the position means the occurrence position from the beginning of text with 0 for the first character .

Example

The online demo page shows all index files generated for short example contents.