PISA + PyTerrier¶

pyterrier-pisa provids PyTerrier bindings to the PISA engine. PISA provides very efficient sparse indexing and retrieval.

Getting Started¶

You can install pyterrier-pisa using pip:

Install pyterrier-pisa with pip¶

pip install pyterrier-pisa

Attention

pyterrier-pisa is only available on linux (manylinux2010_x86_64) platforms at this time. There are pre-built images for Python 3.8-3.11 on pypi.

The main class is PisaIndex. It provides functionality for indexing and retrieval.

Indexing¶

You can easily index corpora from PyTerrier datasets:

Index using PISA¶

import pyterrier as pt
from pyterrier_pisa import PisaIndex

# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())

You can also select which text field(s) to index. If not specified, all fields of type str will be indexed.

Choosing the fields to index with PISA¶

dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())

Retrieval¶

From an index, you can build retrieval transformers:

Constructing PISA retrieval transformers¶

dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)

These retrievers support all the typical pipeline operations.

Search:

Searching with a PISA retriever¶

bm25.search('covid symptoms')
#     qid           query     docno     score
# 0     1  covid symptoms  a6avr09j  6.273450
# 1     1  covid symptoms  hdxs9dgu  6.272374
# 2     1  covid symptoms  zxq7dl9t  6.272374
# ..   ..             ...       ...       ...
# 999   1  covid symptoms  m8wggdc7  4.690651

Batch retrieval:

Batch retrieval with a PISA retriever¶

print(dph(dataset.get_topics('title')))
#       qid                     query     docno     score
# 0       1        coronavirus origin  8ccl9aui  9.329109
# 1       1        coronavirus origin  es7q6c90  9.260190
# 2       1        coronavirus origin  8l411r1w  8.862670
# ...    ..                       ...       ...       ...
# 49999  50  mrna vaccine coronavirus  eyitkr3s  5.610429

Experiment:

Conducting an experiment with PISA retrievers¶

from pyterrier.measures import *
pt.Experiment(
  [dph, bm25, pl2, qld],
  dataset.get_topics('title'),
  dataset.get_qrels(),
  [nDCG@10, P@5, P(rel=2)@5, 'mrt'],
  names=['dph', 'bm25', 'pl2', 'qld']
)
#    name   nDCG@10    P@5  P(rel=2)@5       mrt
# 0   dph  0.623450  0.720       0.548  1.101846
# 1  bm25  0.624923  0.728       0.572  0.880318
# 2   pl2  0.536506  0.632       0.456  1.123883
# 3   qld  0.570032  0.676       0.504  0.974924

Extras¶

You can upload/download indexes to/from HuggingFace Hub using to_hf() and from_hf().
You can access PISA’s tokenizers and stemmers using the tokenize().

API Documentation¶

class pyterrier_pisa.PisaIndex(path, text_field=None, stemmer=None, index_encoding=None, batch_size=100000, stops=None, threads=1, overwrite=False)[source]¶

Represents a PISA index.

This object acts as a factory for indexing and retrieval transformers over the index.

Parameters:

path (str) – The path to the PISA index
text_field (str) – The field to use for indexing. If None, all string fields are concatenated.
stemmer (PisaStemmer | str | None) – The stemmer to use. Defaults to porter2 for new indexes and the stemmer used for construction for existing indexes
index_encoding (PisaIndexEncoding | str | None) – The index encoding to use. Defaults to block_simdbp.
batch_size (int) – The batch size to use during indexing. Defaults to 100,000.
stops (PisaStopwords | List[str] | None) – The stopword list to use. Defaults to the Terrier stopword list.
threads (int) – The number of threads to use during indexing and retrieval. Defaults to 1.
overwrite – If True, the index will be overwritten if it already exists. Defaults to False.

to_hf(repo, *, branch=None, pretty_name=None, private=None)¶

Upload this artifact to Hugging Face Hub.

Return type:

None

Parameters:

repo (str) – The Hugging Face repository name.
branch (str | None) – The branch or tag of the repository to upload to. (Default: main) A branch can also be provided directly in the repository name using owner/repo@branch.
pretty_name (str | None) – The human-readable name of the artifact. (Default: the repository name)
private (bool | None) – Whether make the repository private. New repositories default to public unless the organization’s default is private. No change to the repository’s visiblity will be made if private=None (default).

Upload a PISA index to HuggingFace Hub

classmethod from_hf(repo, branch=None, *, expected_sha256=None, **kwargs)¶

Load an artifact from Hugging Face Hub.

Return type:

Artifact

Parameters:

repo (str) – The Hugging Face repository name.
branch (str | None) – The branch or tag of the repository to load. (Default: main). A branch can also be provided directly in the repository name using owner/repo@branch.
expected_sha256 (str | None) – The expected SHA-256 hash of the artifact. If provided, the downloaded artifact will be verified against this hash and an error will be raised if the hash does not match.
**kwargs (Any) – arguments that will be passed to the constructor of the artifact class

Load a PISA index from HuggingFace Hub

built()[source]¶: Returns True if the index has been built.

index_inputs()[source]¶: Returns the expected input cols for indexing.

index(it)[source]¶

Indexes a collection of documents.

Parameters:: it (Iterable[Dict])

bm25(k1=0.9, b=0.4, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0, precompute_impact=False)[source]¶

Creates a BM25 retrieval transformer over this index.

Parameters:

k1 – BM25 k1 parameter
b – BM25 b parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
precompute_impact=False – pre-compute impact scores. This speeds up retrieval.

dph(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶

Creates a DPH retrieval transformer over this index.

Parameters:

num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields

pl2(c=1.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶

Creates a PL2 retrieval transformer over this index.

Parameters:

c – PL2 c parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields

qld(mu=1000.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶

Creates a QLD retrieval transformer over this index.

Parameters:

mu – QLD mu parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields

quantized(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶

Creates a quantized retrieval transformer over this index.

This transformer is used for scoring as a dot product (e.g., for learned sparse retreival).

Parameters:

num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields

num_terms()[source]¶: Returns the number of terms in the index.

num_docs()[source]¶: Returns the number of documents in the index.

static from_ciff(ciff_file, index_path, overwrite=False, stemmer=PisaStemmer.porter2)[source]¶

Creates a PISA index from a CIFF file.

Parameters:

ciff_file (str) – The path to the CIFF file
index_path – The path to the index
overwrite (bool) – If True, the index will be overwritten if it already exists. Defaults to False.
stemmer – The stemmer to use. Defaults to porter2.

to_ciff(ciff_file, description='from pyterrier_pisa')[source]¶

Converts this index to a CIFF file.

Parameters:

ciff_file (str) – The path to the CIFF file
description (str) – The description to write to the CIFF file.

get_corpus_iter(field='toks', verbose=True)[source]¶

Iterates over the indexed corpus, yielding a dictionary for each document.

Parameters:

field – The field name to yield. Defaults to ‘toks’.
verbose – If True, print progress.

indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None)[source]¶

Create an indexer for this index.

Parameters:

text_field – The field name to index. Defaults to ‘text’.
mode – The indexing mode to use. Defaults to PisaIndexingMode.create.
threads – The number of threads to use. Defaults to the number of threads used to create the index.
batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.

toks_indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None, scale=100.0)[source]¶

Create a indexer over pre-tokenized text for this index.

Parameters:

text_field – The field name to index. Defaults to ‘toks’.
mode – The indexing mode to use. Defaults to PisaIndexingMode.create.
threads – The number of threads to use. Defaults to the number of threads used to create the index.
batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.
scale – The scale factor to apply to the token counts. Defaults to 100.

tokenize(text)[source]¶

Tokenize a string using the stemmer of this index.

Return type:: List[str]
Parameters:: text (str) – The text to tokenize

enum pyterrier_pisa.PisaStemmer(value)[source]¶

Represents a built-in stemming function from PISA

Valid values are as follows:

none = <PisaStemmer.none: 'none'>¶

porter2 = <PisaStemmer.porter2: 'porter2'>¶

krovetz = <PisaStemmer.krovetz: 'krovetz'>¶

enum pyterrier_pisa.PisaScorer(value)[source]¶

Represents a built-in scoring function from PISA

Valid values are as follows:

bm25 = <PisaScorer.bm25: 'bm25'>¶

dph = <PisaScorer.dph: 'dph'>¶

pl2 = <PisaScorer.pl2: 'pl2'>¶

qld = <PisaScorer.qld: 'qld'>¶

quantized = <PisaScorer.quantized: 'quantized'>¶

enum pyterrier_pisa.PisaIndexEncoding(value)[source]¶

Represents a built-in index encoding type from PISA.

Valid values are as follows:

ef = <PisaIndexEncoding.ef: 'ef'>¶

single = <PisaIndexEncoding.single: 'single'>¶

pefuniform = <PisaIndexEncoding.pefuniform: 'pefuniform'>¶

pefopt = <PisaIndexEncoding.pefopt: 'pefopt'>¶

block_optpfor = <PisaIndexEncoding.block_optpfor: 'block_optpfor'>¶

block_varintg8iu = <PisaIndexEncoding.block_varintg8iu: 'block_varintg8iu'>¶

block_streamvbyte = <PisaIndexEncoding.block_streamvbyte: 'block_streamvbyte'>¶

block_maskedvbyte = <PisaIndexEncoding.block_maskedvbyte: 'block_maskedvbyte'>¶

block_interpolative = <PisaIndexEncoding.block_interpolative: 'block_interpolative'>¶

block_qmx = <PisaIndexEncoding.block_qmx: 'block_qmx'>¶

block_varintgb = <PisaIndexEncoding.block_varintgb: 'block_varintgb'>¶

block_simple8b = <PisaIndexEncoding.block_simple8b: 'block_simple8b'>¶

block_simple16 = <PisaIndexEncoding.block_simple16: 'block_simple16'>¶

block_simdbp = <PisaIndexEncoding.block_simdbp: 'block_simdbp'>¶

enum pyterrier_pisa.PisaQueryAlgorithm(value)[source]¶

Represents a built-in query algorithm

Valid values are as follows:

wand = <PisaQueryAlgorithm.wand: 'wand'>¶

block_max_wand = <PisaQueryAlgorithm.block_max_wand: 'block_max_wand'>¶

block_max_maxscore = <PisaQueryAlgorithm.block_max_maxscore: 'block_max_maxscore'>¶

block_max_ranked_and = <PisaQueryAlgorithm.block_max_ranked_and: 'block_max_ranked_and'>¶

ranked_and = <PisaQueryAlgorithm.ranked_and: 'ranked_and'>¶

ranked_or = <PisaQueryAlgorithm.ranked_or: 'ranked_or'>¶

maxscore = <PisaQueryAlgorithm.maxscore: 'maxscore'>¶

enum pyterrier_pisa.PisaStopwords(value)[source]¶

Represents which set of stopwords to use during retrieval

Valid values are as follows:

terrier = <PisaStopwords.terrier: 'terrier'>¶

lucene = <PisaStopwords.lucene: 'lucene'>¶

none = <PisaStopwords.none: 'none'>¶

pyterrier_pisa.tokenize(text, stemmer=PisaStemmer.none)[source]¶

Tokenizes a string using the specified stemmer.

Return type:

List[str]

Parameters:

text (str) – The text to tokenize
stemmer (PisaStemmer) – The stemmer to use. Defaults to no stemming

References¶

Error

Failed to fetch BibTeX for DBLP ID 'conf/sigir/MalliaSMS19': ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Error

Failed to fetch BibTeX for DBLP ID 'conf/sigir/MacAvaneyM22': HTTPSConnectionPool(host='dblp.uni-trier.de', port=443): Read timed out. (read timeout=10)