N-gram json builder

Python built-in json builder

Writing to a text file by json.dump is simple and fast. A problem is an object size (comparing to the physical memory of machine). For smaller amount of texts, it works nicely. But if you want to deal with a large (huge) number of documents, such as the entire documents saved in a terabit drive, the whole index size will excess the physical memory and this way will become terrible.

The to_json() method uses this.

Recordable json builder

The jsngram.json2 module gives an additive way of building a json file. Now, huge number documents are nothing but the time of processing. You can control the number of files to put in a single object, and flush the result after appending to json files. This will prevent the object size from growing too large.

The add_files_to_json() method uses this.

Differential builder

Rebuilding the index requires a large amount of time, a several hours, maybe. It would be nice if we had a way of dirrerential updates, when changes of documents were limited. But this way is not implemented yet.

Because we already have recordable way as above, the differential way will not be so hard. Just remove old indexes and append new indexes. Or compare them to get minimum changes and modify the json file. Anyway, to remove old indexes, we want to keep previous documents that was used to build old indexes.