BMP API Reference¶
- class bmp.pyterrier.BmpIndex(path)[source]¶
Represents a Block-Max Pruning Index stored on disk.
Citation
Mallia et al. Faster Learned Sparse Retrieval with Block-Max Pruning. SIGIR 2024. [link]
@inproceedings{DBLP:conf/sigir/MalliaST24, author = {Antonio Mallia and Torsten Suel and Nicola Tonellotto}, editor = {Grace Hui Yang and Hongning Wang and Sam Han and Claudia Hauff and Guido Zuccon and Yi Zhang}, title = {Faster Learned Sparse Retrieval with Block-Max Pruning}, booktitle = {Proceedings of the 47th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR} 2024, Washington DC, USA, July 14-18, 2024}, pages = {2411--2415}, publisher = {{ACM}}, year = {2024}, url = {https://doi.org/10.1145/3626772.3657906}, doi = {10.1145/3626772.3657906}, timestamp = {Sun, 19 Jan 2025 13:11:15 +0100}, biburl = {https://dblp.org/rec/conf/sigir/MalliaST24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
path (str) – Path to the index directory.
- built()[source]¶
Checks whether the index has been built.
- Return type:
bool- Returns:
True if the index exists on disk, False otherwise.
- Return type:
bool
- indexer(*, bsize=32, compress_range=False, scale_float=100.0)[source]¶
Creates a
bmp.pyterrier.BmpIndexerfor indexing documents.- Return type:
- Parameters:
bsize (int) – Block size for block-max pruning.
compress_range (bool) – Whether to compress the index.
scale_float (float) – Scaling factor for float token values into integers.
- Returns:
The indexer instance.
- Return type:
- index(inp)[source]¶
Index the documents with default settings.
- Return type:
- Parameters:
inp (Iterable[Dict[str, Any]]) – An iterable of documents (dicts containing
docnoandtokskeys) to index.
- retriever(*, num_results=1000, alpha=1.0, beta=1.0)[source]¶
Creates a
bmp.pyterrier.BmpRetrieverfor this index.- Return type:
- Parameters:
num_results (int) – the number of results per query to retrieve.
alpha (float) – block termination threshold (terminate retrievel when the maximum block score is less than
alphaof the threshold. Decreasing this value increases the chance documents are missed, but speeds up retrieval by pruning more blocks. For exact retrieval, usealpha=1.0(default).beta (float) – query term pruning factor (keeps the top
betaweight of query terms). Decreasing this value introduces score approximation error, but reduces computational cost. For exact scoring, usebeta=1.0(default).
- Returns:
The retriever instance.
- transform(inp)[source]¶
Retrieve documents from the index for the given queries using default settings (exact retrieval),
- Return type:
DataFrame- Parameters:
inp (DataFrame) – A DataFrame containing queries with a
query_tokscolumn.- Returns:
DataFrame containing retrieved documents with
docno,score, andrankcolumns.
- class bmp.pyterrier.BmpIndexer(bmp_index, bsize=32, compress_range=False, scale_float=100.0)[source]¶
An indexer for a BMP index.
- Parameters:
bmp_index (BmpIndex) – BMP index object to create.
bsize (int) – Block size for block-max pruning.
compress_range (bool) – Whether to compress the index.
scale_float (float) – Scaling factor for float token values into integers.
- Returns:
The indexer instance.
- Return type:
- class bmp.pyterrier.BmpRetriever(bmp_index, *, num_results=1000, alpha=1.0, beta=1.0)[source]¶
A transformer that retrieves over a BMP index.
- Parameters:
bmp_index (BmpIndex) – BMP index object to retrieve over.
num_results (int) – the number of results per query to retrieve.
alpha (float) – block termination threshold (terminate retrievel when the maximum block score is less than
alphaof the threshold. Decreasing this value increases the chance documents are missed, but speeds up retrieval by pruning more blocks. For exact retrieval, usealpha=1.0(default).beta (float) – query term pruning factor (keeps the top
betaweight of query terms). Decreasing this value introduces score approximation error, but reduces computational cost. For exact scoring, usebeta=1.0(default).