Working with Document Texts¶

Many modern retrieval techniques are concerned with operating directly on the text of documents. PyTerrier supports these forms of interactions.

Indexing and Retrieval of Text in Terrier indices¶

If you are using a Terrier index for your first-stage ranking, you will want to record the text of the documents in the MetaIndex. The following configuration demonstrates saving the title and remainder of the documents separately in the Terrier index MetaIndex when indexing a TREC-formatted corpus:

files = []  # list of filenames to be indexed
indexer = pt.TRECCollectionIndexer(INDEX_DIR,
    # record that we save additional document metadata called 'text'
    meta= {'docno' : 26, 'text' : 2048},
    # The tags from which to save the text. ELSE is special tag name, which means anything not consumed by other tags.
    meta_tags = {'text' : 'ELSE'}
    verbose=True)
indexref = indexer.index(files)
index = pt.IndexFactory.of(indexref)

On the other-hand, for a TSV-formatted corpus such as MSMARCO passages, indexing is easier using IterDictIndexer:

def msmarco_generate():
    dataset = pt.get_dataset("trec-deep-learning-passages")
    with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
        for l in corpusfile:
            docno, passage = l.split("\t")
            yield {'docno' : docno, 'text' : passage}

iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})

During retrieval you will need to have the text stored as an attribute in your dataframes.

This can be achieved in one of several ways:

requesting document metadata when using Retriever
adding document metadata later using get_text()

Retriever accepts a metadata keyword-argument which allows for additional metadata attributes to be retrieved.

Alternatively, the pt.text.get_text() transformer can be used, which can extract metadata from a Terrier index or IRDSDataset for documents already retrieved. The main advantage of using IRDSDataset is that it supports all document fields, not just those that were included as meta fields when indexing.

Examples:

# the following pipelines are equivalent
pipe1 = pt.terrier.Retriever(index, metadata=["docno", "body"])

pipe2 = pt.terrier.Retriever(index) >> pt.text.get_text(index, "body")

dataset = pt.get_dataset('irds:vaswani')
pipe3 = pt.terrier.Retriever(index) >> pt.text.get_text(dataset, "text")

pyterrier.text.get_text(indexlike, metadata='*', by_query=False, verbose=False, **kwargs)[source]¶

A utility transformer for obtaining the text from the text of documents (or other document metadata) from Terrier’s MetaIndex or an IRDSDataset docstore.

Parameters:

indexlike (HasTextLoader | str) – an object that provides a .text_loader() factory method, such as a Terrier index or IRDSDataset. If a str is provided, it will try to load a Terrier index from the provided path.
metadata (str | List[str] | Literal['*']) – The names of the fields to load. If a list of strings, all fields are provided. If a single string, this single field is provided. If the special value of ‘*’ (default), all available fields are provided.
by_query (bool) – whether the entire dataframe should be progressed at once, rather than one query at a time. Defaults to false, which means that all document metadata will be fetched at once.
verbose (bool) – whether to print a tqdm progress bar. When by_query=True, prints progress by query. Otherwise, the behaviour is defined by the provided indexlike.
kwargs (Any) – other arguments to pass through to the text_loader.

Returns:

a transformer that loads the text of documents from the provided indexlike.

Return type:

pt.Transformer

Raises:

ValueError – if indexlike does not provide a .text_loader() method.

Example (Terrier Index):

index = pt.IndexFactory.of("./index/")
pipe = ( pt.terrier.Retriever(index, wmodel="DPH")
    >> pt.text.get_text(index) # load text using a PyTerrier index
    >> pt.text.scorer(wmodel="DPH") )

Example (IR Datasets):

# see https://github.com/terrierteam/pyterrier_t5
from pyterrier_t5 import MonoT5ReRanker
bm25 = pt.terrier.Retriever.from_dataset(pt.get_dataset('msmarcov2_passage'), wmodel='BM25')
# load text using IR Datasets
loader = pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), ['text'])
monoT5 = bm25 >> loader >> MonoT5ReRanker()

Rendering issue. Try running the cell again.

Scoring query/text similarity¶

pyterrier.text.scorer(*args, **kwargs)[source]¶

This allows scoring of the documents with respect to a query, without creating an index first. This is an alias to pt.TextScorer(). Internally, a Terrier memory index is created, before being used for scoring.

Pararm body_attr:

what dataframe input column contains the text of the document. Default is “body”.

Parameters:

wmodel – name of the weighting model to use for scoring.
background_index – An optional background index to use for collection statistics. If a weighting model such as BM25 or TF_IDF or PL2 is used without setting the background_index, the background statistics will be calculated from the dataframe, which is ususally not the desired behaviour.
args – other arguments to pass through to the TextScorer.
kwargs – other arguments to pass through to the TextScorer.

Returns:

a transformer that scores the documents with respect to a query.

Return type:

pt.Transformer

Example:

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "docno", "text"])
textscorerTf = pt.text.scorer(body_attr="text", wmodel="Tf")
rtr = textscorerTf.transform(df)
# rtr will have a score for each document for the query "chemical reactions" based on the provided document contents
# both attain score 1, as, after stemming, they both contain one occurrence of the query term 'chemical'
# ["q1", "chemical reactions", "d1", "professor protor poured the chemicals", 0, 1]
# ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats", 0, 1]

For calculating the scores of documents using any weighting model with the concept of IDF, it is strongly advised to make use of an existing Terrier index for background statistics. Without a background index, IDF will be calculated based on the supplied dataframe (for models such as BM25, this can lead to negative scores):

textscorerTfIdf = pt.text.scorer(body_attr="text", wmodel="TF_IDF", background_index=index)

One pipeline could be retrieve documents, get their text, and then re-score them using a text-based scorer such as BM25 or even MonoT5 from pyterrier_t5.

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7d9a70c6bba0 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ddc2650578a at 0x7d9a69237070>>
num_results	1000
metadata	['docno']
wmodel	BM25
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
bm25.k_1	1.2
bm25.b	0.75
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

BM25

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.datasets._irds.IRDSTextLoader

dataset	IRDSDataset('vaswani')
fields	['text']
verbose	False

TextLoader

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
text	str	Document text

                
                    pt.terrier.retriever.TextScorer

takes	docs
kwargs	{'controls': {'termpipelines': 'Stopwords,PorterStemmer', 'decorate_batch': 'on'}}

TextScorer

qid	str	(Query ID) ID of query in frame
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
text	str	Document text
query	str	Query text
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

Output

Rendering issue. Try running the cell again.

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7d9a6c526570 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ddc26504cda at 0x7d9a6ba2e830>>
num_results	1000
metadata	['docno']
wmodel	BM25
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
bm25.k_1	1.2
bm25.b	0.75
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

BM25

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt._ops.RankCutoff

k	100

% 100

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.datasets._irds.IRDSTextLoader

dataset	IRDSDataset('vaswani')
fields	['text']
verbose	False

TextLoader

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
text	str	Document text

                
                    pyterrier_t5.MonoT5ReRanker

MonoT5

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
text	str	Document text
score	float	Ranking score of document to query (higher=better)
rank	int	Ranking order of document to query (lower=better)

Output

Rendering issue. Try running the cell again.

Other text scorers are available in the form of neural re-rankers - separate to PyTerrier, see Neural Rankers and Rerankers.

Working with Passages rather than Documents¶

As documents are long, relevant content may only be found in a small portion of the document. Moreover, some models are more suited to operating on small parts of the document. For this reason, passage-based retrieval techniques have been conceived. PyTerrier supports the creation of passages from longer documents, and for the aggregation of scores from these passages.

pyterrier.text.sliding(text_attr='body', length=150, stride=75, join=' ', prepend_attr='title', tokenizer=None, **kwargs)[source]¶

A useful transformer for splitting long documents into smaller passages within a pipeline. This applies a sliding window over the text, where each passage is the give number of tokens long. Passages can overlap, if the stride is set smaller than the length. In applying this transformer, docnos are altered by adding ‘%p’ and a passage number. The original scores for each document can be recovered by aggregation functions, such as max_passage().

For the puposes of obtaining passages of a given length, the tokenisation can be controlled. By default, tokenisation takes place by splitting on space, i.e. based on the Python regular expression re.compile(r'\s+'). However, more fine-grained tokenisation can applied by passing an object matching the HuggingFace Transformers Tokenizer API as the tokenizer kwarg argument. In short, the tokenizer object must have a .tokenize(str) -> list[str] method and .convert_tokens_to_string(list[str]) -> str for detokenisation.

Parameters:

text_attr – what is the name of the dataframe attribute containing the main text of the document to be split into passages. Default is ‘body’.
length – how many tokens in each passage. Default is 150.
stride – how many tokens to advance each passage by. Default is 75.
join – how to join the tokens of the passage together. Default is ‘ ‘.
prepend_attr – whether another document attribute, such as the title of the document, to each passage, following [Dai2019]. Defaults to ‘title’.
tokenizer – which model to use for tokenizing. The object must have a .tokenize(str) -> list[str] method for tokenization and .convert_tokens_to_string(list[str]) -> str for detokenization. Default is None. Tokenisation is perfomed by splitting on one-or-more spaces, i.e. based on the Python regular expression re.compile(r'\s+')
kwargs – other arguments to pass through to the SlidingWindowPassager.

Returns:

a transformer that splits the documents into passages.

Return type:

pt.Transformer

Raises:

KeyError – if the text_attr or title_attr columns are not found in the input dataframe.

Example:

pipe = ( pt.terrier.Retriever(index, wmodel="DPH", metadata=["docno", "body"])
    >> pt.text.sliding(length=128, stride=64, prepend_attr=None)
    >> pt.text.scorer(wmodel="DPH")
    >> pt.text.max_passage() )

# tokenizer model
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
pipe = (pt.terrier.Retriever(index, wmodel="DPH", metadata=["docno", "body"])
    >> pt.text.sliding(length=128, stride=64, prepend_attr=None, tokenizer=tok)
    >> pt.text.scorer(wmodel="DPH")
    >> pt.text.max_passage() )

Example Inputs and Outputs:

Consider the following dataframe with one or more documents:

qid	docno	text
qid	docno	text
q1	d1	a b c d
q1	d1	a b c d

The result of applying pyterrier.text.sliding(length=2, stride=1, prepend_title=False) would be:

qid	docno	text
qid	docno	text
q1	d1%p1	a b
q1	d1%p1	a b
q1	d1%p2	b c
q1	d1%p2	b c
q1	d1%p3	c d
q1	d1%p3	c d

pyterrier.text.max_passage()[source]¶

Scores each document based on the maximum score of any constituent passage. Applied after a sliding window transformation has been scored.

Return type:: Transformer

pyterrier.text.first_passage()[source]¶

Scores each document based on score of the first passage of that document. Note that this transformer is rarely used in conjunction with the sliding window transformer, as all passages would required to be scored, only for the first one to be used.

Return type:: Transformer

pyterrier.text.mean_passage()[source]¶

Scores each document based on the mean score of all constituent passages. Applied after a sliding window transformation has been scored.

Return type:: Transformer

pyterrier.text.kmaxavg_passage(k)[source]¶

Scores each document based on the average score of the top scoring k passages. Generalises combination of mean_passage() and max_passage(). Proposed in [Chen2020].

Parameters:: k (int) – The number of top-scored passages for each document to use when scoring
Return type:: Transformer

Examples¶

Assuming that a retrieval pipeline such as sliding() followed by scorer() could return a dataframe that looks like this:

qid	docno	rank	score
qid	docno	rank	score
q1	d1%p5	0	5.0
q1	d1%p5	0	5.0
q1	d2%p4	1	4.0
q1	d2%p4	1	4.0
q1	d1%p3	2	3.0
q1	d1%p3	2	3.0
q1	d1%p1	3	1.0
q1	d1%p1	3	1.0

The output of the max_passage() transformer would be:

qid	docno	rank	score
qid	docno	rank	score
q1	d1	0	5.0
q1	d1	0	5.0
q1	d2	1	4.0
q1	d2	1	4.0

The output of the mean_passage() transformer would be:

qid	docno	rank	score
qid	docno	rank	score
q1	d1	0	4.5
q1	d1	0	4.5
q1	d2	1	4.0
q1	d2	1	4.0

The output of the first_passage() transformer would be:

qid	docno	rank	score
qid	docno	rank	score
q1	d2	0	4.0
q1	d2	0	4.0
q1	d1	1	1.0
q1	d1	1	1.0

Finally, the output of the kmaxavg_passage(2) transformer would be:

qid	docno	rank	score
qid	docno	rank	score
q1	d2	1	4.0
q1	d2	1	4.0
q1	d1	0	1.0
q1	d1	0	1.0

Example Pipelines¶

A typical passage-based retrieval pipeline might look like this:

from pyterrier_t5 import MonoT5ReRanker
index = pt.terrier.TerrierIndex.from_hf('pyterrier/vaswani.terrier')
bm25 = index.bm25()
passage_pipeline = (
    bm25 % 100 >>
    pt.text.get_text(pt.get_dataset('irds:vaswani'), "text") >>
    pt.text.sliding(length=100, stride=50, text_attr='text', prepend_attr=None) >>
    MonoT5ReRanker() >>
    pt.text.max_passage()
)

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7d9a6c409f80 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ddc26504c92 at 0x7d9a6c49f170>>
num_results	1000
metadata	['docno']
wmodel	BM25
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
bm25.k_1	1.2
bm25.b	0.75
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

BM25

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt._ops.RankCutoff

k	100

% 100

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.datasets._irds.IRDSTextLoader

dataset	IRDSDataset('vaswani')
fields	['text']
verbose	False

TextLoader

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
text	str	Document text

                
                    pt.text.SlidingWindowPassager

SlidingWindow

qid	str	(Query ID) ID of query in frame
query	str	Query text
docno	str	(External Document ID) String ID of document in collection
text	str	Document text
score	float	Ranking score of document to query (higher=better)
rank	int	Ranking order of document to query (lower=better)

                
                    pyterrier_t5.MonoT5ReRanker

MonoT5

qid	str	(Query ID) ID of query in frame
query	str	Query text
docno	str	(External Document ID) String ID of document in collection
text	str	Document text
score	float	Ranking score of document to query (higher=better)
rank	int	Ranking order of document to query (lower=better)

                
                    pt.text.MaxPassage

MaxPassage

qid	str	(Query ID) ID of query in frame
query	str	Query text
text	str	Document text
score	float	Ranking score of document to query (higher=better)
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)

Output

Rendering issue. Try running the cell again.

So while the index retrievers documents, MonoT5 is applied to passages, and then the passage scores are aggregated back to document scores using pt.text.max_passage().

Alternatively you can apply passing at indexing time, and then use passage-level retrieval followed by aggregation:

from pyterrier_t5 import MonoT5ReRanker
indexer = pt.text.sliding() >> pt.IterDictIndexer("./index")
indexer.index(document_corpus)
passage_index = pt.terrier.TerrierIndex("./index")
passage_pipeline = (
    index.bm25() % 100 >>
    MonoT5ReRanker() >>
    pt.text.max_passage()
)

where passage_pipeline returns documents rather than passages. Experiments on TREC Robust 2004 have shown that passage indexing and retrieval does not benefit effectiveness compared to document-level indexing and retrieval, when using a strong re-ranker such as MonoT5 .. cite.dblp:journals/tweb/WangMTO23.

Query-biased Summarisation (Snippets)¶

pyterrier.text.snippets(text_scorer_pipe, text_attr='text', summary_attr='summary', num_psgs=5, joinstr='...')[source]¶

Applies query-biased summarisation (snippet), by applying the specified text scoring pipeline. Takes a return a dataframe with the columns [‘qid’, ‘query’, ‘docno’, text_attr], and returns a dataframe with the columns [‘qid’, ‘query’, ‘docno’, text_attr, summary_attr]. The summary_attr column contains the query-biased summary for that document, upto num_psgs passages, joined together with the specified joinstr.

Parameters:

text_scorer_pipe (Transformer) – the pipeline for scoring passages in response to the query. Normally this applies passaging. The pipeline should take a dataframe with the columns [‘qid’, ‘query’, ‘docno’, text_attr] and return a dataframe with the columns [‘qid’, ‘query’, ‘docno’, text_attr, ‘score’, ‘rank’], where these are smaller passages than the input df.
text_attr (str) – what is the name of the attribute that contains the text of the document
summary_attr (str) – what is the name of the attribute that should contain the query-biased summary for that document
num_psgs (int) – how many passages to select for the summary of each document
joinstr (str) – how to join passages for a given document together

Return type:

Transformer

Example:

# retrieve documents with text
br = pt.terrier.Retriever(index, metadata=['docno', 'text'])

# use Tf as a passage scorer on sliding window passages
psg_scorer = (
    pt.text.sliding(text_attr='text', length=15, prepend_attr=None)
    >> pt.text.scorer(body_attr="text", wmodel='Tf', takes='docs')
)

# use psg_scorer for performing query-biased summarisation on docs retrieved by br
retr_pipe = br >> pt.text.snippets(psg_scorer)

Examples of Sentence-Transformers¶

Here we demonstrate the use of pt.apply.doc_score( , batch_size=128) to allow an easy application of Sentence Transformers for reranking BM25 results:

import pandas as pd
from sentence_transformers import CrossEncoder, SentenceTransformer
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)
bimodel = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def _crossencoder_apply(df : pd.DataFrame):
    return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

def _biencoder_apply(df : pd.DataFrame):
    from sentence_transformers.util import cos_sim
    query_embs = bimodel.encode(df['query'].values)
    doc_embs = bimodel.encode(df['text'].values)
    scores =  cos_sim(query_embs, doc_embs)
    return scores[0]

bi_encT = pt.apply.doc_score(_biencoder_apply, batch_size=128)

pt.Experiment(
    [ bm25, bm25 >> bi_encT, bm25 >> cross_encT ],
    dataset.get_topics(),
    dataset.get_qrels(),
    ["map"],
    names=["BM25", "BM25 >> BiEncoder", "BM25 >> CrossEncoder"]
)

You can browse the whole notebook or try it yourself it on Colab

References¶

Citation

Chen et al. ICIP at TREC-2020 Deep Learning Track. TREC 2020. [link]

@inproceedings{DBLP:conf/trec/ChenHSCH020,
  author       = {Xuanang Chen and
                  Ben He and
                  Le Sun and
                  Yingfei Sun},
  editor       = {Ellen M. Voorhees and
                  Angela Ellis},
  title        = {{ICIP} at {TREC-2020} Deep Learning Track},
  booktitle    = {Proceedings of the Twenty-Ninth Text REtrieval Conference, {TREC}
                  2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20,
                  2020},
  series       = {{NIST} Special Publication},
  volume       = {1266},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {2020},
  url          = {https://trec.nist.gov/pubs/trec29/papers/ICIP.DL.pdf},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/ChenHSCH020.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Dai and Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019. [link]

@inproceedings{DBLP:conf/sigir/DaiC19,
  author       = {Zhuyun Dai and
                  Jamie Callan},
  editor       = {Benjamin Piwowarski and
                  Max Chevalier and
                  {\'{E}}ric Gaussier and
                  Yoelle Maarek and
                  Jian{-}Yun Nie and
                  Falk Scholer},
  title        = {Deeper Text Understanding for {IR} with Contextual Neural Language
                  Modeling},
  booktitle    = {Proceedings of the 42nd International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2019, Paris,
                  France, July 21-25, 2019},
  pages        = {985--988},
  publisher    = {{ACM}},
  year         = {2019},
  url          = {https://doi.org/10.1145/3331184.3331303},
  doi          = {10.1145/3331184.3331303},
  timestamp    = {Thu, 25 Apr 2024 15:20:33 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/DaiC19.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}