PyTerrier demonstration for msmarco_passage

This notebook demonstrates retrieval using PyTerrier on the MSMARCO Passage Ranking corpus.

About the corpus: A passage ranking task based on a corpus of 8.8 million passages released by Microsoft, which should be ranked based on their relevance to questions. Also used by the TREC Deep Learning track.

In [1]:
#!pip install -q python-terrier
import pyterrier as pt
if not pt.started():
    pt.init()

from pyterrier.measures import *
dataset = pt.get_dataset('msmarco_passage')
        
PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

Systems using index variant terrier_stemmed

Terrier's default Porter stemming, and stopwords removed.

In [2]:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='DPH')

Systems using index variant terrier_stemmed_text

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

In [3]:
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
In [4]:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])

bm25_bert_terrier_stemmed_text = bm25_terrier_stemmed_text >> vanilla_bert
11:17:23.605 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.

Systems using index variant terrier_stemmed_docT5query

Terrier index using docT5query. Porter stemming and stopword removal applied

In [5]:
bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_docT5query', wmodel='BM25')

Systems using index variant terrier_stemmed_deepct

Terrier index using DeepCT. Porter stemming and stopword removal applied

In [6]:
bm25_terrier_stemmed_deepct = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_deepct', wmodel='BM25')

Evaluation on trec-2019 topics and qrels

43 topics used in the TREC Deep Learning track Passage Ranking task, with deep judgements

In [7]:
pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, bm25_terrier_stemmed_text, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query, bm25_terrier_stemmed_deepct],
    pt.get_dataset('msmarco_passage').get_topics('test-2019'),
    pt.get_dataset('msmarco_passage').get_qrels('test-2019'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'bm25_terrier_stemmed_text', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query', 'bm25_terrier_stemmed_deepct'])
        
11:17:26.746 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8
config file not found: config
[2021-09-23 11:19:03,772][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 11:19:06,355][onir_pt][DEBUG] [starting] batches
                                                            
[2021-09-23 11:22:36,997][onir_pt][DEBUG] [finished] batches: [03:31] [10502it] [49.86it/s]
Out[7]:
name RR(rel=2) nDCG@10 nDCG@100 AP(rel=2)
0 bm25_terrier_stemmed 0.641565 0.479540 0.487416 0.286448
1 dph_terrier_stemmed 0.667307 0.502513 0.485995 0.308977
2 bm25_terrier_stemmed_text 0.641565 0.479540 0.487416 0.286448
3 bm25_bert_terrier_stemmed_text 0.829457 0.685453 0.635066 0.444111
4 bm25_terrier_stemmed_docT5query 0.761370 0.630835 0.592220 0.404429
5 bm25_terrier_stemmed_deepct 0.689009 0.534393 0.521540 0.323135

Evaluation on trec-2020 topics and qrels

43 topics used in the TREC Deep Learning track Passage Ranking task, with deep judgements

In [8]:
pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, bm25_terrier_stemmed_text, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query, bm25_terrier_stemmed_deepct],
    pt.get_dataset('trec-deep-learning-passages').get_topics('test-2020'),
    pt.get_dataset('trec-deep-learning-passages').get_qrels('test-2020'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'bm25_terrier_stemmed_text', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query', 'bm25_terrier_stemmed_deepct'])
        
Downloading msmarco_passage topics to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_passage/msmarco-test2020-queries.tsv.gz
msmarco-test2020-queries.tsv.gz: 100%|██████████| 4.03k/4.03k [2ms<0ms, 2.23MiB/s]
11:23:10.043 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8
Downloading msmarco_passage qrels to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_passage/2020qrels-docs.txt
2020qrels-docs.txt: 100%|██████████| 213k/213k [308ms<0ms, 710kiB/s]  
[2021-09-23 11:25:57,041][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 11:25:57,048][onir_pt][DEBUG] [starting] batches
                                                              
[2021-09-23 11:30:18,381][onir_pt][DEBUG] [finished] batches: [04:21] [12894it] [49.34it/s]
Out[8]:
name RR(rel=2) nDCG@10 nDCG@100 AP(rel=2)
0 bm25_terrier_stemmed 0.618666 0.493627 0.502562 0.292988
1 dph_terrier_stemmed 0.592541 0.450545 0.472645 0.270193
2 bm25_terrier_stemmed_text 0.618666 0.493627 0.502562 0.292988
3 bm25_bert_terrier_stemmed_text 0.806944 0.669242 0.637182 0.466918
4 bm25_terrier_stemmed_docT5query 0.743408 0.622808 0.609271 0.408175
5 bm25_terrier_stemmed_deepct 0.690660 0.550106 0.561069 0.349373

Evaluation on dev.small topics and qrels

6800 topics with sparse judgements

In [9]:
pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, bm25_terrier_stemmed_text, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query, bm25_terrier_stemmed_deepct],
    pt.get_dataset('msmarco_passage').get_topics('dev.small'),
    pt.get_dataset('msmarco_passage').get_qrels('dev.small'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=['recip_rank'],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'bm25_terrier_stemmed_text', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query', 'bm25_terrier_stemmed_deepct'])
        
Downloading msmarco_passage tars to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_passage/collectionandqueries.tar.gz
collectionandqueries.tar.gz: 100%|██████████| 0.99G/0.99G [05:46<0ms, 3.06MiB/s]  
11:37:19.447 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8
[2021-09-23 12:12:15,654][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 12:12:15,660][onir_pt][DEBUG] [starting] batches
                                                             
[2021-09-23 12:27:36,988][onir_pt][DEBUG] [finished] batches: [15:21] [46046it] [49.98it/s]
[2021-09-23 12:28:03,169][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 12:28:03,176][onir_pt][DEBUG] [starting] batches
                                                              
[2021-09-23 12:43:32,823][onir_pt][DEBUG] [finished] batches: [15:30] [46275it] [49.78it/s]
[2021-09-23 12:43:58,010][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 12:43:58,017][onir_pt][DEBUG] [starting] batches
                                                             
[2021-09-23 14:06:59,974][onir_pt][DEBUG] [finished] batches: [16:12] [48414it] [49.81it/s]
[2021-09-23 14:07:26,766][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 14:07:26,773][onir_pt][DEBUG] [starting] batches
                                                             
[2021-09-23 14:40:44,655][onir_pt][DEBUG] [finished] batches: [16:29] [50000it] [50.55it/s]
[2021-09-23 14:41:13,554][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 14:41:13,562][onir_pt][DEBUG] [starting] batches
                                                             
[2021-09-23 16:08:21,520][onir_pt][DEBUG] [finished] batches: [17:24] [49115it] [47.04it/s]
[2021-09-23 16:08:56,748][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 16:08:56,756][onir_pt][DEBUG] [starting] batches
                                                              
[2021-09-23 16:25:49,492][onir_pt][DEBUG] [finished] batches: [16:53] [48618it] [48.01it/s]
[2021-09-23 16:26:18,557][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 16:26:18,564][onir_pt][DEBUG] [starting] batches
                                                             
[2021-09-23 17:01:56,893][onir_pt][DEBUG] [finished] batches: [17:51] [49793it] [46.48it/s]
[2021-09-23 17:02:28,679][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 17:02:28,688][onir_pt][DEBUG] [starting] batches
                                                            
[2021-09-23 21:39:09,808][onir_pt][DEBUG] [finished] batches: [16:30] [48439it] [48.93it/s]
[2021-09-23 21:39:39,784][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-23 21:39:39,795][onir_pt][DEBUG] [starting] batches
batches:  87%|████████▋ | 35986/41480 [12:09<01:51, 49.35it/s]
Out[9]:
name recip_rank
0 bm25_terrier_stemmed 0.196212
1 dph_terrier_stemmed 0.184955
2 bm25_terrier_stemmed_text 0.196212
3 bm25_bert_terrier_stemmed_text 0.358612
4 bm25_terrier_stemmed_docT5query 0.284790
5 bm25_terrier_stemmed_deepct 0.254599