CIFF + PyTerrier

The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures.

CIFF Project

pyterrier-ciff gives access to CIFF files. It provides the following core functionality:

  • Build CIFF indexes from built indexes. [example]

  • Build CIFF indexes from learned sparse retrieval models. [example]

  • Parse CIFF files to get the postings and document records. [example]

  • Share and load CIFF files to/from HuggingFace datasets. [example]

Quick Start

You can install pyterrier-ciff with pip:

Install pyterrier-ciff
$ pip install pyterrier-ciff

Building from an Index

Many indexes, such as those from Terrier and PISA, provide a get_corpus_iter() method that iterates through the sparse representations. You can use use these methods with pyterrier_ciff.index() to build construct a CIFF file:

Build a CIFF index from a Terrier index
>>> import pyterrier as pt
>>> import pyterrier_ciff
>>> terrier_index = pt.IndexFactory.of('my_index.terrier')
>>> pyterrier_ciff.index(terrier_index, 'my_index.ciff')
CiffIndex('my_index.ciff')

Note

pyterrier_ciff.index() uses reasonable default settings. You can customize more settings with CiffIndexer if you need more control over how the CIFF is constructed.

Share and Load with Huggingface Datasets

CiffIndex allows you to share your CIFF files on HuggingFace datasets using to_hf:

Upload a CIFF index to HuggingFace
>>> from pyterrier_ciff import CiffIndex
>>> index = CiffIndex('my_index.ciff')
>>> index.to_hf('username/my_index.ciff')

Danger

Note that uploads to HuggingFace Datasets are public by default. Be sure not to upload an index that you are not allowed to share!

Similarly, you can download CIFF indexes that others have shared on HuggingFace using from_hf:

Load a CIFF index from HuggingFace
>>> from pyterrier_ciff import CiffIndex
>>> index = CiffIndex.from_hf('username/my_index.ciff')

You can find a list of available CIFF artifacts on HuggingFace datasets.

Note

to_hf and from_hf are provided by PyTerrier’s Artifact API.

API Documentation

CiffIndex is the primary class for interacting with CIFF in PyTerrier.

class pyterrier_ciff.CiffIndex(path)[source]

Represents a CIFF “index” file.

CIFF files are a compact binary format for storing and sharing inverted indexes using Protocol Buffers.

Create a reference to CIFF index.

Parameters:

path – The path to the CIFF file or directory containing the CIFF file. If the path does not exit, it must be built using indexer() before it can be used.

indexer(*, scale=100.0, description='pyterrier-ciff', verbose=True)[source]

Create a CIFF indexer.

The indexer accepts an iterable with a docno and toks fields.

Return type:

Indexer

Parameters:
  • scale – The scaling factor for term frequencies. Defaults to 100.

  • description – The description of the index. Defaults to ‘pyterrier-ciff’.

  • verbose – Whether to show a progress bar. Defaults to True.

built()[source]

Check if the index has been built.

Return type:

bool

ciff_file_path()[source]

Get the path to the CIFF file.

Return type:

Path

header()[source]

Get the header of the CIFF file (if it has been built).

Return type:

Header

records_iter()[source]

Iterate over the PostingsList and DocRecord records in the CIFF file (if it has been built).

Return type:

Iterator[Union[PostingsList, DocRecord]]

class pyterrier_ciff.CiffIndexer(index, *, scale=100.0, description='pyterrier-ciff', verbose=True)[source]

An indexer that produces a CiffIndex.

Create a CIFF indexer.

Parameters:
  • index – A CIFF index object or the path to the CIFF file to create.

  • scale – The scaling factor for term frequencies. Defaults to 100.

  • description – The description of the index. Defaults to ‘pyterrier-ciff’.

  • verbose – Whether to show a progress bar. Defaults to True.

index(inp)[source]

Index the input documents.

Return type:

CiffIndex

Parameters:

inp – An iterable with docno and toks fields.

Returns:

The built CIFF index.

pyterrier_ciff.index(inp, ciff_path)[source]

Index the input to the provided path.

Return type:

CiffIndex

Parameters:
  • inp – An iterable of documents to index to CIFF, or an object that exposes a get_corpus_iter method.

  • ciff_path – The path to write the CIFF index.

Returns:

The built CIFF index.

pyterrier_ciff.invert(inp, *, scale=100.0, verbose=False)[source]

Inverts the provided stream of documents, yielding “doc” and “term” records as they are finalized.

The function yields all documents before terms. It also assigns dids and tids from 0 increasing by 1. :rtype: Iterator[InvertRecord]

Invert a stream of documents using invert()
>>> from pyterrier_ciff import invert
>>> docs = [
...   {"docno": "100", "toks": {"a": 0.02, "b": 1.41}},
...   {"docno": "101", "toks": {"b": 2.15, "c": -3.83, "d": 4.65, "e": 0.42}},
... ]
>>> for record in invert(docs):
...   print(record)
InvertRecord(type='doc', data=InvertDoc(did=0, docno='100', tids=array([0, 1]), tfs=array([2, 141])))
InvertRecord(type='doc', data=InvertDoc(did=1, docno='101', tids=array([1, 2, 3]), tfs=array([215, 465, 42])))
InvertRecord(type='term', data=InvertTerm(tid=0, term='a', dids=array([0]), tfs=array([2])))
InvertRecord(type='term', data=InvertTerm(tid=1, term='b', dids=array([0, 1]), tfs=array([141, 215])))
InvertRecord(type='term', data=InvertTerm(tid=2, term='d', dids=array([1]), tfs=array([465])))
InvertRecord(type='term', data=InvertTerm(tid=3, term='e', dids=array([1]), tfs=array([42])))
Parameters:
  • inp – An iterable with docno and toks fields.

  • scale – The scaling factor for term frequencies. Defaults to 100.

  • verbose – Whether to show a progress bar. Defaults to False.

Protobuf Bindings

The following classes are auto-generated by Protobuf and are returned by CiffIndex.

class pyterrier_ciff.Header

A CIFF header, which provides metadata about the CIFF index.

Attributes:
  • version (int): Version.

  • num_postings_lists (int): Exactly the number of PostingsList messages present in the CIFF index.

  • num_docs (int): Exactly the number of DocRecord messages present in the CIFF index.

  • total_postings_lists (int): The total number of postings lists in the collection, representing the vocabulary size. This might differ from num_postings_lists as it may only include postings lists of query terms.

  • total_docs (int): The total number of documents in the collection, which might differ from num_docs for reasons similar to the above.

  • total_terms_in_collection (int): The total number of terms across the entire collection, calculated as the sum of all document lengths.

  • average_doclength (float): The average length of documents in the collection, stored explicitly for a desired level of precision.

  • description (str): A human-readable description of this index, detailing aspects like the exporting application, document processing, and tokenization pipeline.

class pyterrier_ciff.PostingsList

A CIFF postings list, which holds the postings and metadata for an individual term.

Attributes:
  • term (str) – The term.

  • df (int) – The document frequency, representing the number of documents containing the term.

  • cf (int) – The collection frequency, representing the total occurrences of the term across the collection.

  • postings (List[ Posting ]) – A list of postings associated with the term.

class pyterrier_ciff.Posting

A CIFF posting, which holds the term frequeny (or impact score) for a document.

Attributes:
  • docid (int) – The delta-gap compressed document ID.

  • tf (int) – Term frequency within the document.

class pyterrier_ciff.DocRecord

A CIFF document record, which holds information about a document.

Attributes:
  • docid (int) – Refers to the document ID in the postings lists.

  • collection_docid (str) – Refers to a document ID in the external collection.

  • doclength (int) – The length of the document.

Acknowledgements

This extension builds upon the CIFF initiative. If you use it, please be sure to cite CIFF:

Citation

Lin et al. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. SIGIR 2020. [link]
@inproceedings{DBLP:conf/sigir/LinMKMMSTV20,
  author       = {Jimmy Lin and
                  Joel M. Mackenzie and
                  Chris Kamphuis and
                  Craig Macdonald and
                  Antonio Mallia and
                  Michal Siedlaczek and
                  Andrew Trotman and
                  Arjen P. de Vries},
  editor       = {Jimmy X. Huang and
                  Yi Chang and
                  Xueqi Cheng and
                  Jaap Kamps and
                  Vanessa Murdock and
                  Ji{-}Rong Wen and
                  Yiqun Liu},
  title        = {Supporting Interoperability Between Open-Source Search Engines with
                  the Common Index File Format},
  booktitle    = {Proceedings of the 43rd International {ACM} {SIGIR} conference on
                  research and development in Information Retrieval, {SIGIR} 2020, Virtual
                  Event, China, July 25-30, 2020},
  pages        = {2149--2152},
  publisher    = {{ACM}},
  year         = {2020},
  url          = {https://doi.org/10.1145/3397271.3401404},
  doi          = {10.1145/3397271.3401404},
  timestamp    = {Wed, 07 Dec 2022 23:08:55 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/LinMKMMSTV20.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

This extension was written by Sean MacAvaney at the University of Glasgow. Check out the GitHub for a full list of contributors.