Doc2Query + PyTerrier

pyterrier-doc2query provides PyTerrier transformers for Doc2Query and related methods.

Install pyterrier-doc2query with pip
pip install pyterrier-doc2query

What does it do?

A Doc2Query transformer takes the text of each document and generates questions for that text.

import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query()
sample_doc = "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated"
doc2query([{"docno" : "d1", "text" : sample_doc}])

The resulting dataframe will have an additional "querygen" column, which contains the generated queries, such as:

docno

querygen

“d1”

‘what was the importance of the manhattan project to the united states atom project? what influenced the success of the united states why was the manhattan project a success? why was it important’

As a PyTerrier transformer, there are many ways to introduce Doc2Query into a PyTerrier retrieval process.

By default, the plugin loads macavaney/doc2query-t5-base-msmarco, which is a version of the checkpoint released by the original authors converted to PyTorch format. You can load another T5 model by passing another HuggingFace model name (or path to a model on the file system) as the first argument:

doc2query = pyterrier_doc2query.Doc2Query('some/other/model')

Using Doc2Query for Indexing

You can index with Doc2Query by piping the results from Doc2Query into an indexer. For instance,

Build a Terrier index over documents expanded with Doc2Query
import pyterrier as pt
from pyterrier_doc2query import Doc2Query
dataset = pt.get_dataset("irds:vaswani")
doc2query = Doc2Query(append=True) # append generated queries to the original document text
indexer = doc2query >> pt.IterDictIndexer(index_loc)
indexer.index(dataset.get_corpus_iter())

The generation process is expensive. Consider using pyterrier_caching.IndexerCache to cache the generated queries in case you need them again.

Two-step indexing with Doc2Query: Cache scores, then index
import pyterrier as pt
from pyterrier_caching import IndexerCache
from pyterrier_doc2query import Doc2Query
dataset = pt.get_dataset("irds:vaswani")
doc2query = Doc2Query(append=True) # append generated queries to the original document text
# Step 1: Generate queries and cache them
cache = IndexerCache('doc2query.cache')
(doc2query >> cache).index(dataset.get_corpus_iter())
# Step 2: Index from the cache
indexer = pt.IterDictIndexer('doc2query.terrier')
indexer.index(cache.get_corpus_iter())

Doc2Query–: When Less is More

The performance of Doc2Query can be significantly improved by removing queries that are not relevant to the documents that generated them. This involves first scoring the generated queries (using QueryScorer) and then filtering out the least relevant ones (using QueryFilter).

Scoring and filtering queries from Doc2Query
from pyterrier_doc2query import Doc2Query, QueryScorer, QueryFilter
from pyterrier_dr import ElectraScorer

doc2query = Doc2Query(append=False, num_samples=5)
scorer = ElectraScorer()
indexer = pt.IterDictIndexer('./index')
pipeline = doc2query >> QueryScorer(scorer) >> QueryFilter(t=3.21484375) >> indexer # t=3.21484375 is the 70th percentile for generated queries on MS MARCO

pipeline.index(dataset.get_corpus_iter())

We’ve also released pre-computed filter scores for various models on HuggingFace datasets:

Using Doc2Query for Retrieval

Doc2Query can also be used at retrieval time (i.e., on retrieved documents) rather than at indexing time. This can be used in conjunction with pyterrier.text.scorer() to re-rank documents using Doc2Query.

Re-rank documents using Doc2Query
import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query()

dataset = pt.get_dataset("irds:vaswani")
bm25 = pt.terrier.Retriever.from_dataset("vaswani", "terrier_stemmed", wmodel="BM25")
bm25 >> pt.get_text(dataset) >> doc2query >> pt.text.scorer(body_attr="querygen", wmodel="BM25")

API Documentation

Core Functionality

class pyterrier_doc2query.Doc2Query(*args, **kwargs)[source]

A Transformer that generates queries from documents.

Parameters:
  • checkpoint – The checkpoint to use for the model. Defaults to ‘macavaney/doc2query-t5-base-msmarco’.

  • num_samples – The number of queries to generate per document.

  • batch_size – The batch size to use for inference.

  • doc_attr – The attribute in the input DataFrame that contains the documents.

  • append – If True, the generated queries are appended to the documents. Otherwise, the queries are stored in a separate attribute.

  • out_attr – The attribute in the output DataFrame to store the generated queries.

  • verbose – If True, displays a progress bar.

  • fast_tokenizer – If True, uses the fast tokenizer.

  • device – The device to use for inference. If None, defaults to ‘cuda’ if available, otherwise ‘cpu’.

Filtering Poor Generated Queries

class pyterrier_doc2query.QueryScorer(scorer)[source]

A Transformer that scores queries generated by Doc2Query with the provided scorer transformer.

Parameters:

scorer – A pyterrier Transformer that takes a DataFrame with columns ‘query’, ‘text’, ‘qid’ and returns a DataFrame with columns ‘qid’, ‘score’.

class pyterrier_doc2query.QueryFilter(t, append=True)[source]

A Transformer that filters out queries based on their scores (from QueryScorer) and the threshold t.

Parameters:
  • t – The threshold to filter queries by. The score must be larger than this value to pass the filter.

  • append – If True, the filtered queries are appended to the text. Otherwise, the queries are filtered out.

Caching

class pyterrier_doc2query.Doc2QueryStore(path)[source]

A Indexer that caches and loads the queries generated by Doc2Query.

See also

This cache is deprecated in favor of pyterrier_caching.IndexerCache.

Parameters:

path – The path to the cache.

class pyterrier_doc2query.QueryScoreStore(path)[source]

A Indexer that caches and loads generated query scores from QueryScorer.

Parameters:

path – The path to the cache.

References

Citation

Nogueira et al. Document Expansion by Query Prediction. arXiv 2019. [link]
@article{DBLP:journals/corr/abs-1904-08375,
  author       = {Rodrigo Frassetto Nogueira and
                  Wei Yang and
                  Jimmy Lin and
                  Kyunghyun Cho},
  title        = {Document Expansion by Query Prediction},
  journal      = {CoRR},
  volume       = {abs/1904.08375},
  year         = {2019},
  url          = {http://arxiv.org/abs/1904.08375},
  eprinttype    = {arXiv},
  eprint       = {1904.08375},
  timestamp    = {Fri, 20 May 2022 15:34:53 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1904-08375.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Nogueira and Lin. From doc2query to docTTTTTquery. 2019. [link]

Citation

Gospodinov et al. Doc2Query-: When Less is More. ECIR (2) 2023. [link]
@inproceedings{DBLP:conf/ecir/GospodinovMM23,
  author       = {Mitko Gospodinov and
                  Sean MacAvaney and
                  Craig Macdonald},
  editor       = {Jaap Kamps and
                  Lorraine Goeuriot and
                  Fabio Crestani and
                  Maria Maistro and
                  Hideo Joho and
                  Brian Davis and
                  Cathal Gurrin and
                  Udo Kruschwitz and
                  Annalina Caputo},
  title        = {Doc2Query-: When Less is More},
  booktitle    = {Advances in Information Retrieval - 45th European Conference on Information
                  Retrieval, {ECIR} 2023, Dublin, Ireland, April 2-6, 2023, Proceedings,
                  Part {II}},
  series       = {Lecture Notes in Computer Science},
  volume       = {13981},
  pages        = {414--422},
  publisher    = {Springer},
  year         = {2023},
  url          = {https://doi.org/10.1007/978-3-031-28238-6\_31},
  doi          = {10.1007/978-3-031-28238-6\_31},
  timestamp    = {Tue, 21 Mar 2023 16:23:57 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/GospodinovMM23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}