

Please notice that InvertedIndexRecompresser overwrites the original inverted index with the re-compressed one. This can be performed using InvertedIndexRecompresser with the following properties: For example, one can re-compress an inverted index using the OptPFD codec. bin/trec_terrier.sh -i -H) indexing can be re-compressed using the InvertedIndexRecompresser class. bin/trec_terrier.sh -i -j) or MapReduce (i.e. Inverted indices built by single-pass (i.e. The codec to be used to compress term positions in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration, optional) The codec to be used to compress field frequencies in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration, optional) The codec to be used to compress term frequencies in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration) For the direct index, the codec to be used for the term identifiers. The codec to be used to compress document identifiers in the inverted index (used only w/ IntegerCodecCompressionConfiguration). Number of postings to be compressed at a time (used only w/ IntegerCodecCompressionConfiguration) CompressionFactory$BitCompressionConfiguration (default) Only classical indexing supports pluggable compression. The class that defines the compression configuration to be used on the inverted (direct) index at indexing time. If IntegerCodec meets your requirements, you can implement it, and directly use IntegerCodecCompressionConfiguration. You can also plug into Terrier a new compression schema by implementing your own CompressionConfiguration. For instance, to store the direct and inverted index compressed in blocks of 1024 posting using NewPFD codec: To do so, some properties have to be set. bin/trec_terrier.sh -i), using the aforementioned codecs. Terrier 4.0 can perform classical two-pass indexing (i.e. The size of these chunks can be set at indexing time using the properties .chunk.size for the direct index, and .chunk.size for the inverted index. When using these codecs, the Terrier infrastructure (de)compresses postings in chunks. JavaFastPFOR's FastPFOR implementation - NB: A larger chunk-size is recommended for this codec. JavaFastPFOR's Frame-of-Reference implementation Indeed, the new integer compression layer defines a new CompressionConfiguration (namely IntegerCodecCompressionConfiguration, which can be configured to use various codecs for each compression payload (document ids, term frequencies, field frequencies, term positions):Ĭodec Class name (in .codec) In particular, a new integer compression layer allows the transparent use of compression schemes from Java_FastPFOR by Daniel Lemire, and Kamikaze by LinkedIn. New in version 4.0, Terrier now supports more modern compression codecs, such as the state-of-the-art PForDelta codec. For more information, please refer to .DirectInvertedOutputStream (and children) for documentation on postings compression, and .bit.BasicIterablePosting (and children) documentation for postings decompression. The particular compression configuration is defined by the CompressionConfiguration class.

It uses Elias' Gamma compression schema (codec) to compress doc ids and term positions it uses Unary codec to compress term and field frequencies. term positions within the document (implemented by BlockPosting)īy default, Terrier compresses posting lists as a stream of postings.field frequencies (implemented by FieldPosting) term frequency by field (e.g.: URL, title, body, incoming anchor text) a.k.a.Terrier supports four different types of payload contained within each Posting (or children interfaces): These are represented within Terrier as implementations of two specific interfaces, namely Posting and IterablePosting. The inverted index data structure contains a collection of postings lists, a data structure which maintains information about the occurrence of terms in documents. Warning: This documentation is for an older version of Terrier, click here for the most recent version.
