Working with Document Texts¶
Many modern retrieval techniques are concerned with operating directly on the text of documents. PyTerrier supports these forms of interactions.
Indexing and Retrieval of Text in Terrier indices¶
If you are using a Terrier index for your first-stage ranking, you will want to record the text of the documents in the MetaIndex. The following configuration demonstrates saving the title and remainder of the documents separately in the Terrier index MetaIndex when indexing a TREC-formatted corpus:
files = [] # list of filenames to be indexed
indexer = pt.TRECCollectionIndexer(INDEX_DIR,
# record that we save additional document metadata called 'text'
meta= {'docno' : 26, 'text' : 2048},
# The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
meta_tags = {'text' : 'ELSE'}
verbose=True)
indexref = indexer.index(files)
index = pt.IndexFactory.of(indexref)
On the other-hand, for a TSV-formatted corpus such as MSMARCO passages, indexing is easier using IterDictIndexer:
def msmarco_generate():
dataset = pt.get_dataset("trec-deep-learning-passages")
with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
for l in corpusfile:
docno, passage = l.split("\t")
yield {'docno' : docno, 'text' : passage}
iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})
During retrieval you will need to have the text stored as an attribute in your dataframes.
- This can be achieved in one of several ways:
requesting document metadata when using BatchRetrieve
adding document metadata later using get_text()
BatchRetrieve accepts a metadata keyword-argument which allows for additional metadata attributes to be retrieved.
Alternatively, the pt.text.get_text() transformer can be used, which can extract metadata from a Terrier index or IRDSDataset for documents already retrieved. The main advantage of using IRDSDataset is that it supports all document fields, not just those that were included as meta fields when indexing.
Examples:
# the following pipelines are equivalent
pipe1 = pt.BatchRetrieve(index, metadata=["docno", "body"])
pipe2 = pt.BatchRetrieve(index) >> pt.text.get_text(index, "body")
dataset = pt.get_dataset('irds:vaswani')
pipe3 = pt.BatchRetrieve(index) >> pt.text.get_text(dataset, "text")
- pyterrier.text.get_text(indexlike, metadata='body', by_query=False, verbose=False)[source]¶
A utility transformer for obtaining the text from the text of documents (or other document metadata) from Terrier’s MetaIndex or an IRDSDataset docstore.
- Return type
- Parameters
indexlike – a Terrier index or IRDSDataset to retrieve the metadata from
metadata (list(str) or str) – a list of strings of the metadata keys to retrieve from the index. Defaults to [“body”]
by_query (bool) – whether the entire dataframe should be progressed at once, rather than one query at a time. Defaults to false, which means that all document metadata will be fetched at once.
verbose (bool) – whether to print a tqdm progress bar. Defaults to false. Has no effect when by_query=False
Example:
pipe = ( pt.BatchRetrieve(index, wmodel="DPH") >> pt.text.get_text(index) >> pt.text.scorer(wmodel="DPH") )
Scoring query/text similarity¶
- pyterrier.text.scorer(*args, **kwargs)[source]¶
This allows scoring of the documents with respect to a query, without creating an index first. This is an alias to pt.TextScorer(). Internally, a Terrier memory index is created, before being used for scoring.
Example:
df = pd.DataFrame( [ ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"], ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"], ], columns=["qid", "query", "docno", "text"]) textscorerTf = pt.text.scorer(body_attr="text", wmodel="Tf") rtr = textscorerTf.transform(df) # rtr will have a score for each document for the query "chemical reactions" based on the provided document contents # both attain score 1, as, after stemming, they both contain one occurrence of the query term 'chemical' # ["q1", "chemical reactions", "d1", "professor protor poured the chemicals", 0, 1] # ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats", 0, 1]
For calculating the scores of documents using any weighting model with the concept of IDF, it may be useful to make use of an existing Terrier index for background statistics:
textscorerTfIdf = pt.text.scorer(body_attr="text", wmodel="TF_IDF", background_index=index)
- Return type
Other text scorers are available in the form of neural re-rankers - separate to PyTerrier, see Neural Rankers and Rerankers.
Working with Passages rather than Documents¶
As documents are long, relevant content may only be found in a small portion of the document. Moreover, some models are more suited to operating on small parts of the document. For this reason, passage-based retrieval techniques have been conceived. PyTerrier supports the creation of passages from longer documents, and for the aggregation of scores from these passages.
- pyterrier.text.sliding(text_attr='body', length=150, stride=75, join=' ', prepend_attr='title', **kwargs)[source]¶
A useful transformer for splitting long documents into smaller passages within a pipeline. This applies a sliding window over the text, where each passage is the give number of tokens long. Passages can overlap, if the stride is set smaller than the length. In applying this transformer, docnos are altered by adding ‘%p’ and a passage number. The original scores for each document can be recovered by aggregation functions, such as max_passage().
For the puposes of obtaining passages of a given length, tokenisation is perfomed simply by splitting on one-or-more spaces, i.e. based on the Python regular expression
re.compile(r'\s+')
.- Return type
- Parameters
text_attr (str) – what is the name of the dataframe attribute containing the main text of the document to be split into passages. Default is ‘body’.
length (int) – how many tokens in each passage. Default is 150.
stride (int) – how many tokens to advance each passage by. Default is 75.
prepend_attr (str) – whether another document attribute, such as the title of the document, to each passage, following [Dai2019]. Defaults to ‘title’.
title_attr (str) – what is the name of the dataframe attribute containing the title the document to be split into passages. Default is ‘title’. Only used if prepend_title is set to True.
Example:
pipe = ( pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "body"]) >> pt.text.sliding(length=128, stride=64, prepend_attr=None) >> pt.text.scorer(wmodel="DPH") >> pt.text.max_passage() )
Example Inputs and Outputs:
Consider the following dataframe with one or more documents:
qid |
docno |
text |
---|---|---|
q1 |
d1 |
a b c d |
The result of applying pyterrier.text.sliding(length=2, stride=1, prepend_title=False) would be:
qid |
docno |
text |
---|---|---|
q1 |
d1%p1 |
a b |
q1 |
d1%p2 |
b c |
q1 |
d1%p3 |
c d |
- pyterrier.text.max_passage()[source]¶
Scores each document based on the maximum score of any constituent passage. Applied after a sliding window transformation has been scored.
- Return type
- pyterrier.text.first_passage()[source]¶
Scores each document based on score of the first passage of that document. Note that this transformer is rarely used in conjunction with the sliding window transformer, as all passages would required to be scored, only for the first one to be used.
- Return type
- pyterrier.text.mean_passage()[source]¶
Scores each document based on the mean score of all constituent passages. Applied after a sliding window transformation has been scored.
- Return type
- pyterrier.text.kmaxavg_passage(k)[source]¶
Scores each document based on the average score of the top scoring k passages. Generalises combination of mean_passage() and max_passage(). Proposed in [Chen2020].
- Return type
- Parameters
k (int) – The number of top-scored passages for each document to use when scoring
Examples¶
Assuming that a retrieval pipeline such as sliding() followed by scorer() could return a dataframe that looks like this:
qid |
docno |
rank |
score |
---|---|---|---|
q1 |
d1%p5 |
0 |
5.0 |
q1 |
d2%p4 |
1 |
4.0 |
q1 |
d1%p3 |
2 |
3.0 |
q1 |
d1%p1 |
3 |
1.0 |
The output of the max_passage() transformer would be:
qid |
docno |
rank |
score |
---|---|---|---|
q1 |
d1 |
0 |
5.0 |
q1 |
d2 |
1 |
4.0 |
The output of the mean_passage() transformer would be:
qid |
docno |
rank |
score |
---|---|---|---|
q1 |
d1 |
0 |
4.5 |
q1 |
d2 |
1 |
4.0 |
The output of the first_passage() transformer would be:
qid |
docno |
rank |
score |
---|---|---|---|
q1 |
d2 |
0 |
4.0 |
q1 |
d1 |
1 |
1.0 |
Finally, the output of the kmaxavg_passage(2) transformer would be:
qid |
docno |
rank |
score |
---|---|---|---|
q1 |
d2 |
1 |
4.0 |
q1 |
d1 |
0 |
1.0 |
Query-biased Summarisation (Snippets)¶
- pyterrier.text.snippets(text_scorer_pipe, text_attr='text', summary_attr='summary', num_psgs=5, joinstr='...')[source]¶
Applies query-biased summarisation (snippet), by applying the specified text scoring pipeline.
- Return type
- Parameters
text_scorer_pipe (Transformer) – the pipeline for scoring passages in response to the query. Normally this applies passaging.
text_attr (str) – what is the name of the attribute that contains the text of the document
summary_attr (str) – what is the name of the attribute that should contain the query-biased summary for that document
num_psgs (int) – how many passages to select for the summary of each document
joinstr (str) – how to join passages for a given document together
Example:
# retrieve documents with text br = pt.BatchRetrieve(index, metadata=['docno', 'text']) # use Tf as a passage scorer on sliding window passages psg_scorer = ( pt.text.sliding(text_attr='text', length=15, prepend_attr=None) >> pt.text.scorer(body_attr="text", wmodel='Tf', takes='docs') ) # use psg_scorer for performing query-biased summarisation on docs retrieved by br retr_pipe = br >> pt.text.snippets(psg_scorer)
Examples of Sentence-Transformers¶
Here we demonstrate the use of pt.apply.doc_score( , batch_size=128) to allow an easy application of Sentence Transformers for reranking BM25 results:
import pandas as pd
from sentence_transformers import CrossEncoder, SentenceTransformer
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)
bimodel = SentenceTransformer('paraphrase-MiniLM-L6-v2')
def _crossencoder_apply(df : pd.DataFrame):
return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))
cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)
def _biencoder_apply(df : pd.DataFrame):
from sentence_transformers.util import cos_sim
query_embs = bimodel.encode(df['query'].values)
doc_embs = bimodel.encode(df['text'].values)
scores = cos_sim(query_embs, doc_embs)
return scores[0]
bi_encT = pt.apply.doc_score(_biencoder_apply, batch_size=128)
pt.Experiment(
[ bm25, bm25 >> bi_encT, bm25 >> cross_encT ],
dataset.get_topics(),
dataset.get_qrels(),
["map"],
names=["BM25", "BM25 >> BiEncoder", "BM25 >> CrossEncoder"]
)
You can browse the whole notebook or try it yourself it on Colab
References¶
[Chen2020] ICIP at TREC-2020 Deep Learning Track, X. Chen et al. Procedings of TREC 2020.
[Dai2019] Deeper Text Understanding for IR with Contextual Neural Language Modeling. Z. Dai & J. Callan. Proceedings of SIGIR 2019.