Working with Document Texts

Many modern retrieval techniques are concerned with operating directly on the text of documents. PyTerrier supports these forms of interactions.

Indexing and Retrieval of Text in Terrier indices

If you are using a Terrier index for your first-stage ranking, you will want to record the text of the documents in the MetaIndex. The following configuration demonstrates saving the title and remainder of the documents separately in the Terrier index MetaIndex when indexing a TREC-formatted corpus:

files = []  # list of filenames to be indexed
indexer = pt.TRECCollectionIndexer(INDEX_DIR,
    # record that we save additional document metadata called 'text'
    meta= {'docno' : 26, 'text' : 2048},
    # The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
    meta_tags = {'text' : 'ELSE'}
    verbose=True)
indexref = indexer.index(files)
index = pt.IndexFactory.of(indexref)

On the other-hand, for a TSV-formatted corpus such as MSMARCO passages, indexing is easier using IterDictIndexer:

def msmarco_generate():
    dataset = pt.get_dataset("trec-deep-learning-passages")
    with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
        for l in corpusfile:
            docno, passage = l.split("\t")
            yield {'docno' : docno, 'text' : passage}

iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})

During retrieval you will need to have the text stored as an attribute in your dataframes.

This can be achieved in one of several ways:
  • requesting document metadata when using BatchRetrieve

  • adding document metadata later using get_text()

BatchRetrieve accepts a metadata keyword-argument which allows for additional metadata attributes to be retrieved.

Alternatively, the pt.text.get_text() transformer can be used, which can extract metadata from a Terrier index or IRDSDataset for documents already retrieved. The main advantage of using IRDSDataset is that it supports all document fields, not just those that were included as meta fields when indexing.

Examples:

# the following pipelines are equivalent
pipe1 = pt.BatchRetrieve(index, metadata=["docno", "body"])

pipe2 = pt.BatchRetrieve(index) >> pt.text.get_text(index, "body")

dataset = pt.get_dataset('irds:vaswani')
pipe3 = pt.BatchRetrieve(index) >> pt.text.get_text(dataset, "text")
pyterrier.text.get_text(indexlike, metadata='body', by_query=False, verbose=False)[source]

A utility transformer for obtaining the text from the text of documents (or other document metadata) from Terrier’s MetaIndex or an IRDSDataset docstore.

Return type:

Transformer

Parameters:
  • indexlike – a Terrier index or IRDSDataset to retrieve the metadata from

  • metadata (list(str) or str) – a list of strings of the metadata keys to retrieve from the index. Defaults to [“body”]

  • by_query (bool) – whether the entire dataframe should be progressed at once, rather than one query at a time. Defaults to false, which means that all document metadata will be fetched at once.

  • verbose (bool) – whether to print a tqdm progress bar. Defaults to false. Has no effect when by_query=False

Example:

pipe = ( pt.BatchRetrieve(index, wmodel="DPH")
    >> pt.text.get_text(index)
    >> pt.text.scorer(wmodel="DPH") )

Scoring query/text similarity

pyterrier.text.scorer(*args, **kwargs)[source]

This allows scoring of the documents with respect to a query, without creating an index first. This is an alias to pt.TextScorer(). Internally, a Terrier memory index is created, before being used for scoring.

Example:

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "docno", "text"])
textscorerTf = pt.text.scorer(body_attr="text", wmodel="Tf")
rtr = textscorerTf.transform(df)
# rtr will have a score for each document for the query "chemical reactions" based on the provided document contents
# both attain score 1, as, after stemming, they both contain one occurrence of the query term 'chemical'
# ["q1", "chemical reactions", "d1", "professor protor poured the chemicals", 0, 1]
# ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats", 0, 1]

For calculating the scores of documents using any weighting model with the concept of IDF, it may be useful to make use of an existing Terrier index for background statistics:

textscorerTfIdf = pt.text.scorer(body_attr="text", wmodel="TF_IDF", background_index=index)
Return type:

Transformer

Other text scorers are available in the form of neural re-rankers - separate to PyTerrier, see Neural Rankers and Rerankers.

Working with Passages rather than Documents

As documents are long, relevant content may only be found in a small portion of the document. Moreover, some models are more suited to operating on small parts of the document. For this reason, passage-based retrieval techniques have been conceived. PyTerrier supports the creation of passages from longer documents, and for the aggregation of scores from these passages.

pyterrier.text.sliding(text_attr='body', length=150, stride=75, join=' ', prepend_attr='title', tokenizer=None, **kwargs)[source]

A useful transformer for splitting long documents into smaller passages within a pipeline. This applies a sliding window over the text, where each passage is the give number of tokens long. Passages can overlap, if the stride is set smaller than the length. In applying this transformer, docnos are altered by adding ‘%p’ and a passage number. The original scores for each document can be recovered by aggregation functions, such as max_passage().

For the puposes of obtaining passages of a given length, the tokenisation can be controlled. By default, tokenisation takes place by splitting on space, i.e. based on the Python regular expression re.compile(r'\s+'). However, more fine-grained tokenisation can applied by passing an object matching the HuggingFace Transformers Tokenizer API as the tokenizer kwarg argument. In short, the tokenizer object must have a .tokenize(str) -> list[str] method and .convert_tokens_to_string(list[str]) -> str for detokenisation.

Return type:

Transformer

Parameters:
  • text_attr (str) – what is the name of the dataframe attribute containing the main text of the document to be split into passages. Default is ‘body’.

  • length (int) – how many tokens in each passage. Default is 150.

  • stride (int) – how many tokens to advance each passage by. Default is 75.

  • prepend_attr (str) – whether another document attribute, such as the title of the document, to each passage, following [Dai2019]. Defaults to ‘title’.

  • title_attr (str) – what is the name of the dataframe attribute containing the title the document to be split into passages. Default is ‘title’. Only used if prepend_title is set to True.

  • tokenizer (obj) – which model to use for tokenizing. The object must have a .tokenize(str) -> list[str] method for tokenization and .convert_tokens_to_string(list[str]) -> str for detokenization. Default is None. Tokenisation is perfomed by splitting on one-or-more spaces, i.e. based on the Python regular expression re.compile(r'\s+')

Example:

pipe = ( pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "body"])
    >> pt.text.sliding(length=128, stride=64, prepend_attr=None)
    >> pt.text.scorer(wmodel="DPH")
    >> pt.text.max_passage() )

# tokenizer model
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
pipe = (pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "body"])
    >> pt.text.sliding(length=128, stride=64, prepend_attr=None, tokenizer=tok)
    >> pt.text.scorer(wmodel="DPH")
    >> pt.text.max_passage() )

Example Inputs and Outputs:

Consider the following dataframe with one or more documents:

qid

docno

text

q1

d1

a b c d

The result of applying pyterrier.text.sliding(length=2, stride=1, prepend_title=False) would be:

qid

docno

text

q1

d1%p1

a b

q1

d1%p2

b c

q1

d1%p3

c d

pyterrier.text.max_passage()[source]

Scores each document based on the maximum score of any constituent passage. Applied after a sliding window transformation has been scored.

Return type:

Transformer

pyterrier.text.first_passage()[source]

Scores each document based on score of the first passage of that document. Note that this transformer is rarely used in conjunction with the sliding window transformer, as all passages would required to be scored, only for the first one to be used.

Return type:

Transformer

pyterrier.text.mean_passage()[source]

Scores each document based on the mean score of all constituent passages. Applied after a sliding window transformation has been scored.

Return type:

Transformer

pyterrier.text.kmaxavg_passage(k)[source]

Scores each document based on the average score of the top scoring k passages. Generalises combination of mean_passage() and max_passage(). Proposed in [Chen2020].

Return type:

Transformer

Parameters:

k (int) – The number of top-scored passages for each document to use when scoring

Examples

Assuming that a retrieval pipeline such as sliding() followed by scorer() could return a dataframe that looks like this:

qid

docno

rank

score

q1

d1%p5

0

5.0

q1

d2%p4

1

4.0

q1

d1%p3

2

3.0

q1

d1%p1

3

1.0

The output of the max_passage() transformer would be:

qid

docno

rank

score

q1

d1

0

5.0

q1

d2

1

4.0

The output of the mean_passage() transformer would be:

qid

docno

rank

score

q1

d1

0

4.5

q1

d2

1

4.0

The output of the first_passage() transformer would be:

qid

docno

rank

score

q1

d2

0

4.0

q1

d1

1

1.0

Finally, the output of the kmaxavg_passage(2) transformer would be:

qid

docno

rank

score

q1

d2

1

4.0

q1

d1

0

1.0

Query-biased Summarisation (Snippets)

pyterrier.text.snippets(text_scorer_pipe, text_attr='text', summary_attr='summary', num_psgs=5, joinstr='...')[source]

Applies query-biased summarisation (snippet), by applying the specified text scoring pipeline.

Return type:

Transformer

Parameters:
  • text_scorer_pipe (Transformer) – the pipeline for scoring passages in response to the query. Normally this applies passaging.

  • text_attr (str) – what is the name of the attribute that contains the text of the document

  • summary_attr (str) – what is the name of the attribute that should contain the query-biased summary for that document

  • num_psgs (int) – how many passages to select for the summary of each document

  • joinstr (str) – how to join passages for a given document together

Example:

# retrieve documents with text
br = pt.BatchRetrieve(index, metadata=['docno', 'text'])

# use Tf as a passage scorer on sliding window passages
psg_scorer = (
    pt.text.sliding(text_attr='text', length=15, prepend_attr=None)
    >> pt.text.scorer(body_attr="text", wmodel='Tf', takes='docs')
)

# use psg_scorer for performing query-biased summarisation on docs retrieved by br
retr_pipe = br >> pt.text.snippets(psg_scorer)

Examples of Sentence-Transformers

Here we demonstrate the use of pt.apply.doc_score( , batch_size=128) to allow an easy application of Sentence Transformers for reranking BM25 results:

import pandas as pd
from sentence_transformers import CrossEncoder, SentenceTransformer
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)
bimodel = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def _crossencoder_apply(df : pd.DataFrame):
    return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

def _biencoder_apply(df : pd.DataFrame):
    from sentence_transformers.util import cos_sim
    query_embs = bimodel.encode(df['query'].values)
    doc_embs = bimodel.encode(df['text'].values)
    scores =  cos_sim(query_embs, doc_embs)
    return scores[0]

bi_encT = pt.apply.doc_score(_biencoder_apply, batch_size=128)

pt.Experiment(
    [ bm25, bm25 >> bi_encT, bm25 >> cross_encT ],
    dataset.get_topics(),
    dataset.get_qrels(),
    ["map"],
    names=["BM25", "BM25 >> BiEncoder", "BM25 >> CrossEncoder"]
)

You can browse the whole notebook or try it yourself it on Colab

References

  • [Chen2020] ICIP at TREC-2020 Deep Learning Track, X. Chen et al. Procedings of TREC 2020.

  • [Dai2019] Deeper Text Understanding for IR with Contextual Neural Language Modeling. Z. Dai & J. Callan. Proceedings of SIGIR 2019.