N-gram json builder
Python built-in json builder
Writing to a text file by json.dump
is simple and fast.
A problem is an object size (comparing to the physical memory of machine).
For smaller amount of texts, it works nicely.
But if you want to deal with a large (huge) number of documents,
such as the entire documents saved in a terabit drive,
the whole index size will excess the physical memory
and this way will become terrible.
The
to_json() method
uses this.
Recordable json builder
The
jsngram.json2 module
gives an additive way of building a json file.
Now, huge number documents are nothing but the time of processing.
You can control the number of files to put in a single object,
and flush the result after appending to json files.
This will prevent the object size from growing too large.
The
add_files_to_json() method
uses this.
Differential builder
Rebuilding the index requires a large amount of time, a several hours, maybe. It would be nice if we had a way of dirrerential updates, when changes of documents were limited. But this way is not implemented yet.
Because we already have recordable way as above, the differential way will not be so hard. Just remove old indexes and append new indexes. Or compare them to get minimum changes and modify the json file. Anyway, to remove old indexes, we want to keep previous documents that was used to build old indexes.