PISA + PyTerrier¶
pyterrier-pisa provids PyTerrier bindings to the PISA engine. PISA provides very efficient sparse indexing and retrieval.
Getting Started¶
You can install pyterrier-pisa
using pip:
pip install pyterrier-pisa
Attention
pyterrier-pisa
is only available on linux (manylinux2010_x86_64
) platforms at this time.
There are pre-built images for Python 3.8-3.11 on pypi.
The main class is PisaIndex
. It provides functionality for indexing and retrieval.
Indexing¶
You can easily index corpora from PyTerrier datasets:
import pyterrier as pt
from pyterrier_pisa import PisaIndex
# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())
You can also select which text field(s) to index. If not specified, all fields of type str will be indexed.
dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())
Retrieval¶
From an index, you can build retrieval transformers:
dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)
These retrievers support all the typical pipeline operations.
Search:
bm25.search('covid symptoms')
# qid query docno score
# 0 1 covid symptoms a6avr09j 6.273450
# 1 1 covid symptoms hdxs9dgu 6.272374
# 2 1 covid symptoms zxq7dl9t 6.272374
# .. .. ... ... ...
# 999 1 covid symptoms m8wggdc7 4.690651
Batch retrieval:
print(dph(dataset.get_topics('title')))
# qid query docno score
# 0 1 coronavirus origin 8ccl9aui 9.329109
# 1 1 coronavirus origin es7q6c90 9.260190
# 2 1 coronavirus origin 8l411r1w 8.862670
# ... .. ... ... ...
# 49999 50 mrna vaccine coronavirus eyitkr3s 5.610429
Experiment:
from pyterrier.measures import *
pt.Experiment(
[dph, bm25, pl2, qld],
dataset.get_topics('title'),
dataset.get_qrels(),
[nDCG@10, P@5, P(rel=2)@5, 'mrt'],
names=['dph', 'bm25', 'pl2', 'qld']
)
# name nDCG@10 P@5 P(rel=2)@5 mrt
# 0 dph 0.623450 0.720 0.548 1.101846
# 1 bm25 0.624923 0.728 0.572 0.880318
# 2 pl2 0.536506 0.632 0.456 1.123883
# 3 qld 0.570032 0.676 0.504 0.974924
Extras¶
You can upload/download indexes to/from HuggingFace Hub using
to_hf()
andfrom_hf()
.You can access PISA’s tokenizers and stemmers using the
tokenize()
.
API Documentation¶
- class pyterrier_pisa.PisaIndex(path, text_field=None, stemmer=None, index_encoding=None, batch_size=100000, stops=None, threads=1, overwrite=False)[source]¶
Represents a PISA index.
This object acts as a factory for indexing and retrieval transformers over the index.
- Parameters:
path – The path to the PISA index
text_field – The field to use for indexing. If None, all string fields are concatenated.
stemmer – The stemmer to use. Defaults to
porter2
for new indexes and the stemmer used for construction for existing indexesindex_encoding – The index encoding to use. Defaults to
block_simdbp
.batch_size – The batch size to use during indexing. Defaults to 100,000.
stops – The stopword list to use. Defaults to the Terrier stopword list.
threads – The number of threads to use during indexing and retrieval. Defaults to 1.
overwrite – If True, the index will be overwritten if it already exists. Defaults to False.
- to_hf(repo, *, branch=None, pretty_name=None)¶
Upload this artifact to Hugging Face Hub.
- Return type:
None
- Parameters:
repo – The Hugging Face repository name.
branch – The branch or tag of the repository to upload to. (Default: main) A branch can also be provided directly in the repository name using
owner/repo@branch
.pretty_name – The human-readable name of the artifact. (Default: the repository name)
Upload a PISA index to HuggingFace Hub
- classmethod from_hf(repo, branch=None, *, expected_sha256=None)¶
Load an artifact from Hugging Face Hub.
- Return type:
- Parameters:
repo – The Hugging Face repository name.
branch – The branch or tag of the repository to load. (Default: main). A branch can also be provided directly in the repository name using
owner/repo@branch
.expected_sha256 – The expected SHA-256 hash of the artifact. If provided, the downloaded artifact will be verified against this hash and an error will be raised if the hash does not match.
Load a PISA index from HuggingFace Hub
- bm25(k1=0.9, b=0.4, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶
Creates a BM25 retrieval transformer over this index.
- Parameters:
k1 – BM25 k1 parameter
b – BM25 b parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
- dph(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶
Creates a DPH retrieval transformer over this index.
- Parameters:
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
- pl2(c=1.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶
Creates a PL2 retrieval transformer over this index.
- Parameters:
c – PL2 c parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
- qld(mu=1000.0, num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶
Creates a QLD retrieval transformer over this index.
- Parameters:
mu – QLD mu parameter
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
- quantized(num_results=1000, verbose=False, threads=None, query_algorithm=None, query_weighted=None, toks_scale=100.0)[source]¶
Creates a quantized retrieval transformer over this index.
This transformer is used for scoring as a dot product (e.g., for learned sparse retreival).
- Parameters:
num_results – number of results to return per query
verbose – if True, print progress
threads – number of threads to use
query_algorithm – the query algorithm to use
query_weighted – if True, the query is weighted
toks_scale – scale factor to apply to toks fields
- static from_ciff(ciff_file, index_path, overwrite=False, stemmer=PisaStemmer.porter2)[source]¶
Creates a PISA index from a CIFF file.
- Parameters:
ciff_file – The path to the CIFF file
index_path – The path to the index
overwrite – If True, the index will be overwritten if it already exists. Defaults to False.
stemmer – The stemmer to use. Defaults to
porter2
.
- to_ciff(ciff_file, description='from pyterrier_pisa')[source]¶
Converts this index to a CIFF file.
- Parameters:
ciff_file – The path to the CIFF file
description – The description to write to the CIFF file.
- get_corpus_iter(field='toks', verbose=True)[source]¶
Iterates over the indexed corpus, yielding a dictionary for each document.
- Parameters:
field – The field name to yield. Defaults to ‘toks’.
verbose – If True, print progress.
- indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None)[source]¶
Create an indexer for this index.
- Parameters:
text_field – The field name to index. Defaults to ‘text’.
mode – The indexing mode to use. Defaults to PisaIndexingMode.create.
threads – The number of threads to use. Defaults to the number of threads used to create the index.
batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.
- toks_indexer(text_field=None, mode=PisaIndexingMode.create, threads=None, batch_size=None, scale=100.0)[source]¶
Create a indexer over pre-tokenized text for this index.
- Parameters:
text_field – The field name to index. Defaults to ‘toks’.
mode – The indexing mode to use. Defaults to PisaIndexingMode.create.
threads – The number of threads to use. Defaults to the number of threads used to create the index.
batch_size – The batch size to use during indexing. Defaults to the batch size used to create the index.
scale – The scale factor to apply to the token counts. Defaults to 100.
- enum pyterrier_pisa.PisaStemmer(value)[source]¶
Represents a built-in stemming function from PISA
Valid values are as follows:
- none = <PisaStemmer.none: 'none'>¶
- porter2 = <PisaStemmer.porter2: 'porter2'>¶
- krovetz = <PisaStemmer.krovetz: 'krovetz'>¶
- enum pyterrier_pisa.PisaScorer(value)[source]¶
Represents a built-in scoring function from PISA
Valid values are as follows:
- bm25 = <PisaScorer.bm25: 'bm25'>¶
- dph = <PisaScorer.dph: 'dph'>¶
- pl2 = <PisaScorer.pl2: 'pl2'>¶
- qld = <PisaScorer.qld: 'qld'>¶
- quantized = <PisaScorer.quantized: 'quantized'>¶
- enum pyterrier_pisa.PisaIndexEncoding(value)[source]¶
Represents a built-in index encoding type from PISA.
Valid values are as follows:
- ef = <PisaIndexEncoding.ef: 'ef'>¶
- single = <PisaIndexEncoding.single: 'single'>¶
- pefuniform = <PisaIndexEncoding.pefuniform: 'pefuniform'>¶
- pefopt = <PisaIndexEncoding.pefopt: 'pefopt'>¶
- block_optpfor = <PisaIndexEncoding.block_optpfor: 'block_optpfor'>¶
- block_varintg8iu = <PisaIndexEncoding.block_varintg8iu: 'block_varintg8iu'>¶
- block_streamvbyte = <PisaIndexEncoding.block_streamvbyte: 'block_streamvbyte'>¶
- block_maskedvbyte = <PisaIndexEncoding.block_maskedvbyte: 'block_maskedvbyte'>¶
- block_interpolative = <PisaIndexEncoding.block_interpolative: 'block_interpolative'>¶
- block_qmx = <PisaIndexEncoding.block_qmx: 'block_qmx'>¶
- block_varintgb = <PisaIndexEncoding.block_varintgb: 'block_varintgb'>¶
- block_simple8b = <PisaIndexEncoding.block_simple8b: 'block_simple8b'>¶
- block_simple16 = <PisaIndexEncoding.block_simple16: 'block_simple16'>¶
- block_simdbp = <PisaIndexEncoding.block_simdbp: 'block_simdbp'>¶
- enum pyterrier_pisa.PisaIndexEncoding(value)[source]¶
Represents a built-in index encoding type from PISA.
Valid values are as follows:
- ef = <PisaIndexEncoding.ef: 'ef'>¶
- single = <PisaIndexEncoding.single: 'single'>¶
- pefuniform = <PisaIndexEncoding.pefuniform: 'pefuniform'>¶
- pefopt = <PisaIndexEncoding.pefopt: 'pefopt'>¶
- block_optpfor = <PisaIndexEncoding.block_optpfor: 'block_optpfor'>¶
- block_varintg8iu = <PisaIndexEncoding.block_varintg8iu: 'block_varintg8iu'>¶
- block_streamvbyte = <PisaIndexEncoding.block_streamvbyte: 'block_streamvbyte'>¶
- block_maskedvbyte = <PisaIndexEncoding.block_maskedvbyte: 'block_maskedvbyte'>¶
- block_interpolative = <PisaIndexEncoding.block_interpolative: 'block_interpolative'>¶
- block_qmx = <PisaIndexEncoding.block_qmx: 'block_qmx'>¶
- block_varintgb = <PisaIndexEncoding.block_varintgb: 'block_varintgb'>¶
- block_simple8b = <PisaIndexEncoding.block_simple8b: 'block_simple8b'>¶
- block_simple16 = <PisaIndexEncoding.block_simple16: 'block_simple16'>¶
- block_simdbp = <PisaIndexEncoding.block_simdbp: 'block_simdbp'>¶
References¶
Citation
Mallia et al. PISA: Performant Indexes and Search for Academia. OSIRRC@SIGIR 2019. [link]
@inproceedings{DBLP:conf/sigir/MalliaSMS19, author = {Antonio Mallia and Michal Siedlaczek and Joel M. Mackenzie and Torsten Suel}, editor = {Ryan Clancy and Nicola Ferro and Claudia Hauff and Jimmy Lin and Tetsuya Sakai and Ze Zhong Wu}, title = {{PISA:} Performant Indexes and Search for Academia}, booktitle = {Proceedings of the Open-Source {IR} Replicability Challenge co-located with 42nd International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR 2019, Paris, France, July 25, 2019}, series = {{CEUR} Workshop Proceedings}, volume = {2409}, pages = {50--56}, publisher = {CEUR-WS.org}, year = {2019}, url = {https://ceur-ws.org/Vol-2409/docker08.pdf}, timestamp = {Fri, 10 Mar 2023 16:22:17 +0100}, biburl = {https://dblp.org/rec/conf/sigir/MalliaSMS19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Citation
MacAvaney and Macdonald. A Python Interface to PISA!. SIGIR 2022. [link]
@inproceedings{DBLP:conf/sigir/MacAvaneyM22, author = {Sean MacAvaney and Craig Macdonald}, editor = {Enrique Amig{\'{o}} and Pablo Castells and Julio Gonzalo and Ben Carterette and J. Shane Culpepper and Gabriella Kazai}, title = {A Python Interface to PISA!}, booktitle = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022}, pages = {3339--3344}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3477495.3531656}, doi = {10.1145/3477495.3531656}, timestamp = {Sat, 09 Jul 2022 09:25:34 +0200}, biburl = {https://dblp.org/rec/conf/sigir/MacAvaneyM22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }