CIFF + PyTerrier ===================================== The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures. -- `CIFF Project `_ `pyterrier-ciff `__ gives access to `CIFF `_ files. It provides the following core functionality: - Build CIFF indexes from built indexes. `[example] <#building-from-an-index>`__ - Build CIFF indexes from learned sparse retrieval models. `[example] <#building-from-learned-sparse-models>`__ - Parse CIFF files to get the postings and document records. `[example] <#parsing-ciff-files>`__ - Share and load CIFF files to/from HuggingFace datasets. `[example] <#share-and-load-with-huggingface-datasets>`__ - Load CIFF files from the CIFF Hub. `[example] <#loading-ciff-from-the-ciff-hub>`__ Quick Start ------------------------------------- You can install ``pyterrier-ciff`` with pip: .. code-block:: console :caption: Install ``pyterrier-ciff`` $ pip install pyterrier-ciff Building from an Index ------------------------------------- Many indexes, such as those from Terrier and PISA, provide a ``get_corpus_iter()`` method that iterates through the sparse representations. You can use use these methods with :func:`pyterrier_ciff.index` to build construct a CIFF file: .. tabs:: .. tab:: Terrier .. code-block:: python :caption: Build a CIFF index from a Terrier index >>> import pyterrier as pt >>> import pyterrier_ciff >>> terrier_index = pt.IndexFactory.of('my_index.terrier') >>> pyterrier_ciff.index(terrier_index, 'my_index.ciff') CiffIndex('my_index.ciff') .. tab:: PISA .. code-block:: python :caption: Build a CIFF index from a PISA index >>> from pyterrier_pisa import PisaIndex >>> import pyterrier_ciff >>> pisa_index = PisaIndex('my_index.pisa') >>> pyterrier_ciff.index(pisa_index, 'my_index.ciff') CiffIndex('my_index.ciff') .. note:: :func:`pyterrier_ciff.index` uses reasonable default settings. You can customize more settings with :class:`~pyterrier_ciff.CiffIndexer` if you need more control over how the CIFF is constructed. Building from Learned Sparse Models ---------------------------------------- You can also build a CIFF index from learned sparse retrieval models, such as those from the `pyt-splade `__ package. .. code-block:: python :caption: Build a CIFF index from a a SPLADE model >>> from pyterrier_ciff import CiffIndex >>> from pyt_splade import Splade >>> ciff_index = CiffIndex('splade_index.ciff') >>> splade = Splade() >>> pipeline = splade >> ciff_index >>> my_documents = [{'docno': '0', 'text': 'PyTerrier example with SPLADE and CIFF.'}] # or load a dataset >>> pipeline.index(my_documents) Share and Load with Huggingface Datasets ---------------------------------------- :class:`~pyterrier_ciff.CiffIndex` allows you to share your CIFF files on HuggingFace datasets using ``to_hf``: .. code-block:: python :caption: Upload a CIFF index to HuggingFace >>> from pyterrier_ciff import CiffIndex >>> index = CiffIndex('my_index.ciff') >>> index.to_hf('username/my_index.ciff') .. danger:: Note that uploads to HuggingFace Datasets are public by default. Be sure not to upload an index that you are not allowed to share! Similarly, you can download CIFF indexes that others have shared on HuggingFace using ``from_hf``: .. code-block:: python :caption: Load a CIFF index from HuggingFace >>> from pyterrier_ciff import CiffIndex >>> index = CiffIndex.from_hf('username/my_index.ciff') You can find `a list of available CIFF artifacts `__ on HuggingFace datasets. .. note:: ``to_hf`` and ``from_hf`` are provided by PyTerrier's Artifact API. Loading CIFF from the CIFF Hub ---------------------------------------- You can also load a :class:`~pyterrier_ciff.CiffIndex` from the `CIFF Hub `__ using the :func:`~pyterrier_ciff.CiffIndex.from_ciff_hub` method: .. code-block:: python :caption: Load CIFF file from the CIFF Hub >>> from pyterrier_ciff import CiffIndex >>> CiffIndex.from_ciff_hub('csv-30k/bp-csv-30k') API Documentation ---------------------------------------- :class:`~pyterrier_ciff.CiffIndex` is the primary class for interacting with CIFF in PyTerrier. .. autoclass:: pyterrier_ciff.CiffIndex :members: .. autoclass:: pyterrier_ciff.CiffIndexer :members: .. autofunction:: pyterrier_ciff.index .. autofunction:: pyterrier_ciff.invert Protobuf Bindings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following classes are auto-generated by Protobuf and are returned by :class:`~pyterrier_ciff.CiffIndex`. .. autoclass:: pyterrier_ciff.Header :members: A CIFF header, which provides metadata about the CIFF index. Attributes: - ``version`` (int): Version. - ``num_postings_lists`` (int): Exactly the number of PostingsList messages present in the CIFF index. - ``num_docs`` (int): Exactly the number of DocRecord messages present in the CIFF index. - ``total_postings_lists`` (int): The total number of postings lists in the collection, representing the vocabulary size. This might differ from `num_postings_lists` as it may only include postings lists of query terms. - ``total_docs`` (int): The total number of documents in the collection, which might differ from `num_docs` for reasons similar to the above. - ``total_terms_in_collection`` (int): The total number of terms across the entire collection, calculated as the sum of all document lengths. - ``average_doclength`` (float): The average length of documents in the collection, stored explicitly for a desired level of precision. - ``description`` (str): A human-readable description of this index, detailing aspects like the exporting application, document processing, and tokenization pipeline. .. autoclass:: pyterrier_ciff.PostingsList :members: A CIFF postings list, which holds the postings and metadata for an individual term. Attributes: - **term** (*str*) -- The term. - **df** (*int*) -- The document frequency, representing the number of documents containing the term. - **cf** (*int*) -- The collection frequency, representing the total occurrences of the term across the collection. - **postings** (*List[* :class:`~pyterrier_ciff.Posting` *]*) -- A list of postings associated with the term. .. autoclass:: pyterrier_ciff.Posting :members: A CIFF posting, which holds the term frequeny (or impact score) for a document. Attributes: - **docid** (*int*) -- The delta-gap compressed document ID. - **tf** (*int*) -- Term frequency within the document. .. autoclass:: pyterrier_ciff.DocRecord :members: A CIFF document record, which holds information about a document. Attributes: - **docid** (*int*) -- Refers to the document ID in the postings lists. - **collection_docid** (*str*) -- Refers to a document ID in the external collection. - **doclength** (*int*) -- The length of the document. Acknowledgements ------------------------------------- This extension builds upon the CIFF initiative. If you use it, please be sure to cite CIFF: .. cite.dblp:: conf/sigir/LinMKMMSTV20 This extension was written by `Sean MacAvaney `__ at the University of Glasgow. Check out the GitHub for `a full list of contributors `__.