CIFF + PyTerrier¶
The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures.
pyterrier-ciff gives access to CIFF files. It provides the following core functionality:
Build CIFF indexes from built indexes. [example]
Build CIFF indexes from learned sparse retrieval models. [example]
Parse CIFF files to get the postings and document records. [example]
Share and load CIFF files to/from HuggingFace datasets. [example]
Quick Start¶
You can install pyterrier-ciff
with pip:
$ pip install pyterrier-ciff
Building from an Index¶
Many indexes, such as those from Terrier and PISA, provide a get_corpus_iter()
method that iterates
through the sparse representations. You can use use these methods with pyterrier_ciff.index()
to build construct a CIFF file:
>>> import pyterrier as pt
>>> import pyterrier_ciff
>>> terrier_index = pt.IndexFactory.of('my_index.terrier')
>>> pyterrier_ciff.index(terrier_index, 'my_index.ciff')
CiffIndex('my_index.ciff')
>>> from pyterrier_pisa import PisaIndex
>>> import pyterrier_ciff
>>> pisa_index = PisaIndex('my_index.pisa')
>>> pyterrier_ciff.index(pisa_index, 'my_index.ciff')
CiffIndex('my_index.ciff')
Note
pyterrier_ciff.index()
uses reasonable default settings. You can customize more settings with
CiffIndexer
if you need more control over how the CIFF is constructed.
API Documentation¶
CiffIndex
is the primary class for interacting with CIFF in PyTerrier.
- class pyterrier_ciff.CiffIndex(path)[source]¶
Represents a CIFF “index” file.
CIFF files are a compact binary format for storing and sharing inverted indexes using Protocol Buffers.
Create a reference to CIFF index.
- Parameters:
path – The path to the CIFF file or directory containing the CIFF file. If the path does not exit, it must be built using
indexer()
before it can be used.
- indexer(*, scale=100.0, description='pyterrier-ciff', verbose=True)[source]¶
Create a CIFF indexer.
The indexer accepts an iterable with a docno and toks fields.
- Return type:
- Parameters:
scale – The scaling factor for term frequencies. Defaults to 100.
description – The description of the index. Defaults to ‘pyterrier-ciff’.
verbose – Whether to show a progress bar. Defaults to True.
- records_iter()[source]¶
Iterate over the PostingsList and DocRecord records in the CIFF file (if it has been built).
- Return type:
Iterator
[Union
[PostingsList
,DocRecord
]]
- class pyterrier_ciff.CiffIndexer(index, *, scale=100.0, description='pyterrier-ciff', verbose=True)[source]¶
An indexer that produces a
CiffIndex
.Create a CIFF indexer.
- Parameters:
index – A CIFF index object or the path to the CIFF file to create.
scale – The scaling factor for term frequencies. Defaults to 100.
description – The description of the index. Defaults to ‘pyterrier-ciff’.
verbose – Whether to show a progress bar. Defaults to True.
- pyterrier_ciff.index(inp, ciff_path)[source]¶
Index the input to the provided path.
- Return type:
- Parameters:
inp – An iterable of documents to index to CIFF, or an object that exposes a
get_corpus_iter
method.ciff_path – The path to write the CIFF index.
- Returns:
The built CIFF index.
- pyterrier_ciff.invert(inp, *, scale=100.0, verbose=False)[source]¶
Inverts the provided stream of documents, yielding “doc” and “term” records as they are finalized.
The function yields all documents before terms. It also assigns dids and tids from 0 increasing by 1. :rtype:
Iterator
[InvertRecord
]>>> from pyterrier_ciff import invert >>> docs = [ ... {"docno": "100", "toks": {"a": 0.02, "b": 1.41}}, ... {"docno": "101", "toks": {"b": 2.15, "c": -3.83, "d": 4.65, "e": 0.42}}, ... ] >>> for record in invert(docs): ... print(record) InvertRecord(type='doc', data=InvertDoc(did=0, docno='100', tids=array([0, 1]), tfs=array([2, 141]))) InvertRecord(type='doc', data=InvertDoc(did=1, docno='101', tids=array([1, 2, 3]), tfs=array([215, 465, 42]))) InvertRecord(type='term', data=InvertTerm(tid=0, term='a', dids=array([0]), tfs=array([2]))) InvertRecord(type='term', data=InvertTerm(tid=1, term='b', dids=array([0, 1]), tfs=array([141, 215]))) InvertRecord(type='term', data=InvertTerm(tid=2, term='d', dids=array([1]), tfs=array([465]))) InvertRecord(type='term', data=InvertTerm(tid=3, term='e', dids=array([1]), tfs=array([42])))
- Parameters:
inp – An iterable with
docno
andtoks
fields.scale – The scaling factor for term frequencies. Defaults to 100.
verbose – Whether to show a progress bar. Defaults to False.
Protobuf Bindings¶
The following classes are auto-generated by Protobuf and are returned by CiffIndex
.
- class pyterrier_ciff.Header¶
A CIFF header, which provides metadata about the CIFF index.
- Attributes:
version
(int): Version.num_postings_lists
(int): Exactly the number of PostingsList messages present in the CIFF index.num_docs
(int): Exactly the number of DocRecord messages present in the CIFF index.total_postings_lists
(int): The total number of postings lists in the collection, representing the vocabulary size. This might differ from num_postings_lists as it may only include postings lists of query terms.total_docs
(int): The total number of documents in the collection, which might differ from num_docs for reasons similar to the above.total_terms_in_collection
(int): The total number of terms across the entire collection, calculated as the sum of all document lengths.average_doclength
(float): The average length of documents in the collection, stored explicitly for a desired level of precision.description
(str): A human-readable description of this index, detailing aspects like the exporting application, document processing, and tokenization pipeline.
- class pyterrier_ciff.PostingsList¶
A CIFF postings list, which holds the postings and metadata for an individual term.
- Attributes:
term (str) – The term.
df (int) – The document frequency, representing the number of documents containing the term.
cf (int) – The collection frequency, representing the total occurrences of the term across the collection.
postings (List[
Posting
]) – A list of postings associated with the term.
- class pyterrier_ciff.Posting¶
A CIFF posting, which holds the term frequeny (or impact score) for a document.
- Attributes:
docid (int) – The delta-gap compressed document ID.
tf (int) – Term frequency within the document.
- class pyterrier_ciff.DocRecord¶
A CIFF document record, which holds information about a document.
- Attributes:
docid (int) – Refers to the document ID in the postings lists.
collection_docid (str) – Refers to a document ID in the external collection.
doclength (int) – The length of the document.
Acknowledgements¶
This extension builds upon the CIFF initiative. If you use it, please be sure to cite CIFF:
Citation
Lin et al. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. SIGIR 2020. [link]
@inproceedings{DBLP:conf/sigir/LinMKMMSTV20, author = {Jimmy Lin and Joel M. Mackenzie and Chris Kamphuis and Craig Macdonald and Antonio Mallia and Michal Siedlaczek and Andrew Trotman and Arjen P. de Vries}, editor = {Jimmy X. Huang and Yi Chang and Xueqi Cheng and Jaap Kamps and Vanessa Murdock and Ji{-}Rong Wen and Yiqun Liu}, title = {Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format}, booktitle = {Proceedings of the 43rd International {ACM} {SIGIR} conference on research and development in Information Retrieval, {SIGIR} 2020, Virtual Event, China, July 25-30, 2020}, pages = {2149--2152}, publisher = {{ACM}}, year = {2020}, url = {https://doi.org/10.1145/3397271.3401404}, doi = {10.1145/3397271.3401404}, timestamp = {Wed, 07 Dec 2022 23:08:55 +0100}, biburl = {https://dblp.org/rec/conf/sigir/LinMKMMSTV20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
This extension was written by Sean MacAvaney at the University of Glasgow. Check out the GitHub for a full list of contributors.