PISA + PyTerrier

pyterrier-pisa provids PyTerrier bindings to the PISA engine. PISA provides very efficient sparse indexing and retrieval.

Getting Started

You can install pyterrier-pisa using pip:

Install pyterrier-pisa with pip
pip install pyterrier-pisa

Attention

pyterrier-pisa is only available on linux (manylinux2010_x86_64) platforms at this time. There are pre-built images for Python 3.8-3.11 on pypi.

The main class is PisaIndex. It provides functionality for indexing and retrieval.

Indexing

You can easily index corpora from PyTerrier datasets:

Index using PISA
import pyterrier as pt
from pyterrier_pisa import PisaIndex

# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())

You can also select which text field(s) to index. If not specified, all fields of type str will be indexed.

Choosing the fields to index with PISA
dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())

Retrieval

From an index, you can build retrieval transformers:

Constructing PISA retrieval transformers
dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)

These retrievers support all the typical pipeline operations.

Search:

Searching with a PISA retriever
bm25.search('covid symptoms')
#     qid           query     docno     score
# 0     1  covid symptoms  a6avr09j  6.273450
# 1     1  covid symptoms  hdxs9dgu  6.272374
# 2     1  covid symptoms  zxq7dl9t  6.272374
# ..   ..             ...       ...       ...
# 999   1  covid symptoms  m8wggdc7  4.690651

Batch retrieval:

Batch retrieval with a PISA retriever
print(dph(dataset.get_topics('title')))
#       qid                     query     docno     score
# 0       1        coronavirus origin  8ccl9aui  9.329109
# 1       1        coronavirus origin  es7q6c90  9.260190
# 2       1        coronavirus origin  8l411r1w  8.862670
# ...    ..                       ...       ...       ...
# 49999  50  mrna vaccine coronavirus  eyitkr3s  5.610429

Experiment:

Conducting an experiment with PISA retrievers
from pyterrier.measures import *
pt.Experiment(
  [dph, bm25, pl2, qld],
  dataset.get_topics('title'),
  dataset.get_qrels(),
  [nDCG@10, P@5, P(rel=2)@5, 'mrt'],
  names=['dph', 'bm25', 'pl2', 'qld']
)
#    name   nDCG@10    P@5  P(rel=2)@5       mrt
# 0   dph  0.623450  0.720       0.548  1.101846
# 1  bm25  0.624923  0.728       0.572  0.880318
# 2   pl2  0.536506  0.632       0.456  1.123883
# 3   qld  0.570032  0.676       0.504  0.974924

Extras

  • You can upload/download indexes to/from HuggingFace Hub using to_hf() and from_hf().

  • You can access PISA’s tokenizers and stemmers using the tokenize().

API Documentation

class pyterrier_pisa.PisaIndex(path, text_field=None, stemmer=None, index_encoding=None, batch_size=100000, stops=None, threads=1, overwrite=False)[source]

Represents a PISA index.

This object acts as a factory for indexing and retrieval transformers over the index.

Parameters:
  • path – The path to the PISA index

  • text_field – The field to use for indexing. If None, all string fields are concatenated.

  • stemmer – The stemmer to use. Defaults to porter2 for new indexes and the stemmer used for construction for existing indexes

  • index_encoding – The index encoding to use. Defaults to block_simdbp.

  • batch_size – The batch size to use during indexing. Defaults to 100,000.

  • stops – The stopword list to use. Defaults to the Terrier stopword list.

  • threads – The number of threads to use during indexing and retrieval. Defaults to 1.

  • overwrite – If True, the index will be overwritten if it already exists. Defaults to False.

to_hf(repo, *, branch=None, pretty_name=None)

Upload this artifact to Hugging Face Hub.

Return type:

None

Parameters:
  • repo – The Hugging Face repository name.

  • branch – The branch or tag of the repository to upload to. (Default: main) A branch can also be provided directly in the repository name using owner/repo@branch.

  • pretty_name – The human-readable name of the artifact. (Default: the repository name)

Upload a PISA index to HuggingFace Hub

classmethod from_hf(repo, branch=None, *, expected_sha256=None)

Load an artifact from Hugging Face Hub.

Return type:

Artifact

Parameters:
  • repo – The Hugging Face repository name.

  • branch – The branch or tag of the repository to load. (Default: main). A branch can also be provided directly in the repository name using owner/repo@branch.

  • expected_sha256 – The expected SHA-256 hash of the artifact. If provided, the downloaded artifact will be verified against this hash and an error will be raised if the hash does not match.

Load a PISA index from HuggingFace Hub

built()[source]

Returns True if the index has been built.

index(it)[source]

Indexes a collection of documents.

bm25(k1=0.9, b=0.4, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]

Creates a BM25 retrieval transformer over this index.

Parameters:
  • k1 – BM25 k1 parameter

  • b – BM25 b parameter

  • num_results – number of results to return per query

  • verbose – if True, print progress

  • threads – number of threads to use

  • query_algorithm – the query algorithm to use

  • query_weighted – if True, the query is weighted

  • toks_scale – scale factor to apply to toks fields

dph(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]

Creates a DPH retrieval transformer over this index.

Parameters:
  • num_results – number of results to return per query

  • verbose – if True, print progress

  • threads – number of threads to use

  • query_algorithm – the query algorithm to use

  • query_weighted – if True, the query is weighted

  • toks_scale – scale factor to apply to toks fields

pl2(c=1.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]

Creates a PL2 retrieval transformer over this index.

Parameters:
  • c – PL2 c parameter

  • num_results – number of results to return per query

  • verbose – if True, print progress

  • threads – number of threads to use

  • query_algorithm – the query algorithm to use

  • query_weighted – if True, the query is weighted

  • toks_scale – scale factor to apply to toks fields

qld(mu=1000.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]

Creates a QLD retrieval transformer over this index.

Parameters:
  • mu – QLD mu parameter

  • num_results – number of results to return per query

  • verbose – if True, print progress

  • threads – number of threads to use

  • query_algorithm – the query algorithm to use

  • query_weighted – if True, the query is weighted

  • toks_scale – scale factor to apply to toks fields

quantized(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]

Creates a quantized retrieval transformer over this index.

This transformer is used for scoring as a dot product (e.g., for learned sparse retreival).

Parameters:
  • num_results – number of results to return per query

  • verbose – if True, print progress

  • threads – number of threads to use

  • query_algorithm – the query algorithm to use

  • query_weighted – if True, the query is weighted

  • toks_scale – scale factor to apply to toks fields

num_terms()[source]

Returns the number of terms in the index.

num_docs()[source]

Returns the number of documents in the index.

static from_ciff(ciff_file, index_path, overwrite=False, stemmer=PisaStemmer.porter2)[source]

Creates a PISA index from a CIFF file.

Parameters:
  • ciff_file – The path to the CIFF file

  • index_path – The path to the index

  • overwrite – If True, the index will be overwritten if it already exists. Defaults to False.

  • stemmer – The stemmer to use. Defaults to porter2.

to_ciff(ciff_file, description='from pyterrier_pisa')[source]

Converts this index to a CIFF file.

Parameters:
  • ciff_file – The path to the CIFF file

  • description – The description to write to the CIFF file.

get_corpus_iter(field='toks', verbose=True)[source]

Iterates over the indexed corpus, yielding a dictionary for each document.

Parameters:
  • field – The field name to yield. Defaults to ‘toks’.

  • verbose – If True, print progress.

indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None)[source]

Create an indexer for this index.

Parameters:
  • text_field – The field name to index. Defaults to ‘text’.

  • mode – The indexing mode to use. Defaults to PisaIndexingMode.create.

  • threads – The number of threads to use. Defaults to the number of threads used to create the index.

  • batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.

toks_indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None, scale=100.0)[source]

Create a indexer over pre-tokenized text for this index.

Parameters:
  • text_field – The field name to index. Defaults to ‘toks’.

  • mode – The indexing mode to use. Defaults to PisaIndexingMode.create.

  • threads – The number of threads to use. Defaults to the number of threads used to create the index.

  • batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.

  • scale – The scale factor to apply to the token counts. Defaults to 100.

tokenize(text)[source]

Tokenize a string using the stemmer of this index.

Return type:

List[str]

Parameters:

text – The text to tokenize

enum pyterrier_pisa.PisaStemmer(value)[source]

Represents a built-in stemming function from PISA

Valid values are as follows:

none = <PisaStemmer.none: 'none'>
porter2 = <PisaStemmer.porter2: 'porter2'>
krovetz = <PisaStemmer.krovetz: 'krovetz'>
enum pyterrier_pisa.PisaScorer(value)[source]

Represents a built-in scoring function from PISA

Valid values are as follows:

bm25 = <PisaScorer.bm25: 'bm25'>
dph = <PisaScorer.dph: 'dph'>
pl2 = <PisaScorer.pl2: 'pl2'>
qld = <PisaScorer.qld: 'qld'>
quantized = <PisaScorer.quantized: 'quantized'>
enum pyterrier_pisa.PisaIndexEncoding(value)[source]

Represents a built-in index encoding type from PISA.

Valid values are as follows:

ef = <PisaIndexEncoding.ef: 'ef'>
single = <PisaIndexEncoding.single: 'single'>
pefuniform = <PisaIndexEncoding.pefuniform: 'pefuniform'>
pefopt = <PisaIndexEncoding.pefopt: 'pefopt'>
block_optpfor = <PisaIndexEncoding.block_optpfor: 'block_optpfor'>
block_varintg8iu = <PisaIndexEncoding.block_varintg8iu: 'block_varintg8iu'>
block_streamvbyte = <PisaIndexEncoding.block_streamvbyte: 'block_streamvbyte'>
block_maskedvbyte = <PisaIndexEncoding.block_maskedvbyte: 'block_maskedvbyte'>
block_interpolative = <PisaIndexEncoding.block_interpolative: 'block_interpolative'>
block_qmx = <PisaIndexEncoding.block_qmx: 'block_qmx'>
block_varintgb = <PisaIndexEncoding.block_varintgb: 'block_varintgb'>
block_simple8b = <PisaIndexEncoding.block_simple8b: 'block_simple8b'>
block_simple16 = <PisaIndexEncoding.block_simple16: 'block_simple16'>
block_simdbp = <PisaIndexEncoding.block_simdbp: 'block_simdbp'>
enum pyterrier_pisa.PisaIndexEncoding(value)[source]

Represents a built-in index encoding type from PISA.

Valid values are as follows:

ef = <PisaIndexEncoding.ef: 'ef'>
single = <PisaIndexEncoding.single: 'single'>
pefuniform = <PisaIndexEncoding.pefuniform: 'pefuniform'>
pefopt = <PisaIndexEncoding.pefopt: 'pefopt'>
block_optpfor = <PisaIndexEncoding.block_optpfor: 'block_optpfor'>
block_varintg8iu = <PisaIndexEncoding.block_varintg8iu: 'block_varintg8iu'>
block_streamvbyte = <PisaIndexEncoding.block_streamvbyte: 'block_streamvbyte'>
block_maskedvbyte = <PisaIndexEncoding.block_maskedvbyte: 'block_maskedvbyte'>
block_interpolative = <PisaIndexEncoding.block_interpolative: 'block_interpolative'>
block_qmx = <PisaIndexEncoding.block_qmx: 'block_qmx'>
block_varintgb = <PisaIndexEncoding.block_varintgb: 'block_varintgb'>
block_simple8b = <PisaIndexEncoding.block_simple8b: 'block_simple8b'>
block_simple16 = <PisaIndexEncoding.block_simple16: 'block_simple16'>
block_simdbp = <PisaIndexEncoding.block_simdbp: 'block_simdbp'>
enum pyterrier_pisa.PisaStopwords(value)[source]

Represents which set of stopwords to use during retrieval

Valid values are as follows:

terrier = <PisaStopwords.terrier: 'terrier'>
none = <PisaStopwords.none: 'none'>
pyterrier_pisa.tokenize(text, stemmer=PisaStemmer.none)[source]

Tokenizes a string using the specified stemmer.

Return type:

List[str]

Parameters:
  • text – The text to tokenize

  • stemmer – The stemmer to use. Defaults to no stemming

References

Citation

Mallia et al. PISA: Performant Indexes and Search for Academia. OSIRRC@SIGIR 2019. [link]
@inproceedings{DBLP:conf/sigir/MalliaSMS19,
  author       = {Antonio Mallia and
                  Michal Siedlaczek and
                  Joel M. Mackenzie and
                  Torsten Suel},
  editor       = {Ryan Clancy and
                  Nicola Ferro and
                  Claudia Hauff and
                  Jimmy Lin and
                  Tetsuya Sakai and
                  Ze Zhong Wu},
  title        = {{PISA:} Performant Indexes and Search for Academia},
  booktitle    = {Proceedings of the Open-Source {IR} Replicability Challenge co-located
                  with 42nd International {ACM} {SIGIR} Conference on Research and Development
                  in Information Retrieval, OSIRRC@SIGIR 2019, Paris, France, July 25,
                  2019},
  series       = {{CEUR} Workshop Proceedings},
  volume       = {2409},
  pages        = {50--56},
  publisher    = {CEUR-WS.org},
  year         = {2019},
  url          = {https://ceur-ws.org/Vol-2409/docker08.pdf},
  timestamp    = {Fri, 10 Mar 2023 16:22:17 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/MalliaSMS19.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

MacAvaney and Macdonald. A Python Interface to PISA!. SIGIR 2022. [link]
@inproceedings{DBLP:conf/sigir/MacAvaneyM22,
  author       = {Sean MacAvaney and
                  Craig Macdonald},
  editor       = {Enrique Amig{\'{o}} and
                  Pablo Castells and
                  Julio Gonzalo and
                  Ben Carterette and
                  J. Shane Culpepper and
                  Gabriella Kazai},
  title        = {A Python Interface to PISA!},
  booktitle    = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research
                  and Development in Information Retrieval, Madrid, Spain, July 11 -
                  15, 2022},
  pages        = {3339--3344},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3477495.3531656},
  doi          = {10.1145/3477495.3531656},
  timestamp    = {Sat, 09 Jul 2022 09:25:34 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/MacAvaneyM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}