# PyTerrier demonstration for msmarco_document

This notebook demonstrates retrieval using PyTerrier on the MSMARCO Document Ranking corpus.

About the corpus: A document ranking corpus containing 3.2 million documents. Also used by the TREC Deep Learning track.

In [1]:

#!pip install -q python-terrier
import pyterrier as pt
if not pt.started():
    pt.init()

from pyterrier.measures import *
dataset = pt.get_dataset('msmarco_document')
        

PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


## Systems using index variant terrier_stemmed
Terrier's default Porter stemming, and stopwords removed.

In [2]:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='BM25', num_results=100)

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='DPH', num_results=100)

dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarco_document').get_index('terrier_stemmed')) >> dph_terrier_stemmed


## Systems using index variant terrier_stemmed_text
Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

In [3]:
#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint',  text_field='body', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [4]:
bm25_bert_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'title', 'body'], num_results=100) >> pt.text.sliding(length=128, stride=64, prepend_attr='title') >> vanilla_bert >> pt.text.max_passage()


14:00:11.859 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - File /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_document/index/terrier_stemmed_text/data.meta-0.fsomapfile containing reverse meta mapping for keydocno is missing. Reverse lookups for this key will be disabled
14:00:11.890 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 4.3 GiB of memory would be required.


## Systems using index variant terrier_stemmed_docT5query
Terrier index using docT5query. Porter stemming and stopword removal applied.

In [5]:
bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_docT5query', wmodel='BM25', num_results=100)


## Evaluation on trec-2019 topics and qrels
43 topics used in the TREC 2019 Deep Learning track Document Ranking task, with deep judgements

In [6]:

pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, dph_bo1_terrier_stemmed, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query],
    pt.get_dataset('msmarco_document').get_topics('test'),
    pt.get_dataset('msmarco_document').get_qrels('test'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[RR, nDCG@10, nDCG@100, AP],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'dph_bo1_terrier_stemmed', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query'])
        

14:00:12.551 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


                                                               

config file not found: config
[2021-09-28 14:01:22,324][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 14:01:25,000][onir_pt][DEBUG] [starting] batches


                                                          

[2021-09-28 14:05:31,950][onir_pt][DEBUG] [finished] batches: [04:07] [6754it] [27.35it/s]


Unnamed: 0,name,RR,nDCG@10,nDCG@100,AP
0,bm25_terrier_stemmed,0.874123,0.538802,0.512328,0.248334
1,dph_terrier_stemmed,0.907946,0.543129,0.515971,0.249372
2,dph_bo1_terrier_stemmed,0.902952,0.601469,0.574638,0.297022
3,bm25_bert_terrier_stemmed_text,0.965116,0.650781,0.546319,0.290109
4,bm25_terrier_stemmed_docT5query,0.888372,0.599915,0.556793,0.274672


## Evaluation on trec-2020 topics and qrels
45 topics used in the TREC 2020 Deep Learning track Document Ranking task, with deep judgements

In [7]:

pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, dph_bo1_terrier_stemmed, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query],
    pt.get_dataset('msmarco_document').get_topics('test-2020'),
    pt.get_dataset('msmarco_document').get_qrels('test-2020'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[RR, nDCG@10, nDCG@100, AP],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'dph_bo1_terrier_stemmed', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query'])
        

Downloading msmarco_document topics to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_document/msmarco-test2020-queries.tsv.gz


msmarco-test2020-queries.tsv.gz: 100%|██████████| 4.03k/4.03k [2ms<0ms, 1.88MiB/s]


14:05:59.347 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8
Downloading msmarco_document qrels to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_document/2020qrels-docs.txt


2020qrels-docs.txt: 100%|██████████| 179k/179k [375ms<0ms, 488kiB/s]    
                                                               

[2021-09-28 14:08:51,628][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 14:08:51,637][onir_pt][DEBUG] [starting] batches


                                                            

[2021-09-28 14:13:21,421][onir_pt][DEBUG] [finished] batches: [04:30] [7501it] [27.80it/s]


Unnamed: 0,name,RR,nDCG@10,nDCG@100,AP
0,bm25_terrier_stemmed,0.849841,0.531539,0.56281,0.376637
1,dph_terrier_stemmed,0.831323,0.514913,0.563194,0.379489
2,dph_bo1_terrier_stemmed,0.82328,0.513813,0.57077,0.390954
3,bm25_bert_terrier_stemmed_text,0.936111,0.608011,0.609786,0.415357
4,bm25_terrier_stemmed_docT5query,0.882685,0.554608,0.594002,0.404781


## Evaluation on dev topics and qrels
5193 topics with sparse judgements

In [8]:

pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, dph_bo1_terrier_stemmed, bm25_bert_terrier_stemmed_text, bm25_terrier_stemmed_docT5query],
    pt.get_dataset('msmarco_document').get_topics('dev'),
    pt.get_dataset('msmarco_document').get_qrels('dev'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[RR],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'dph_bo1_terrier_stemmed', 'bm25_bert_terrier_stemmed_text', 'bm25_terrier_stemmed_docT5query'])
        

Downloading msmarco_document topics to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_document/msmarco-docdev-queries.tsv.gz


msmarco-docdev-queries.tsv.gz: 100%|██████████| 89.7k/89.7k [302ms<0ms, 305kiB/s]  


14:13:50.208 [main] WARN org.terrier.applications.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8
Downloading msmarco_document qrels to /local/tr.collections/data.terrier.org/pyterrier_prebuilt/pyt_home/corpora/msmarco_document/msmarco-docdev-qrels.tsv.gz


msmarco-docdev-qrels.tsv.gz: 100%|██████████| 37.6k/37.6k [153ms<0ms, 253kiB/s]  
                                                                

[2021-09-28 15:39:35,931][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 15:39:35,941][onir_pt][DEBUG] [starting] batches


                                                             

[2021-09-28 15:58:48,532][onir_pt][DEBUG] [finished] batches: [19:13] [31228it] [27.09it/s]


                                                                

[2021-09-28 16:04:26,494][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 16:04:26,504][onir_pt][DEBUG] [starting] batches


                                                                

[2021-09-28 16:53:00,078][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 16:53:00,088][onir_pt][DEBUG] [starting] batches


                                                              

[2021-09-28 17:36:06,364][onir_pt][DEBUG] [finished] batches: [19:48] [32847it] [27.64it/s]


                                                             

[2021-09-28 18:24:00,915][onir_pt][DEBUG] [finished] batches: [19:55] [32885it] [27.51it/s]


                                                               

[2021-09-28 18:27:41,405][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 18:27:41,416][onir_pt][DEBUG] [starting] batches


                                                                 

[2021-09-28 19:13:57,414][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 19:13:57,422][onir_pt][DEBUG] [starting] batches


                                                              

[2021-09-28 20:27:33,560][onir_pt][DEBUG] [finished] batches: [20:21] [32511it] [26.64it/s]


                                                               

[2021-09-28 21:42:56,711][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-28 21:42:56,720][onir_pt][DEBUG] [starting] batches


                                                              

[2021-09-28 22:25:12,704][onir_pt][DEBUG] [finished] batches: [19:27] [32377it] [27.74it/s]


                                                            

[2021-09-28 23:56:20,463][onir_pt][DEBUG] [finished] batches: [19:45] [32935it] [27.78it/s]


                                                               

[2021-09-29 00:21:52,673][onir_pt][DEBUG] using GPU (deterministic)
[2021-09-29 00:21:52,682][onir_pt][DEBUG] [starting] batches


batches:  86%|████████▌ | 27178/31732 [16:09<02:42, 28.06it/s]

Unnamed: 0,name,RR
0,bm25_terrier_stemmed,0.269925
1,dph_terrier_stemmed,0.273157
2,dph_bo1_terrier_stemmed,0.252609
3,bm25_bert_terrier_stemmed_text,0.350838
4,bm25_terrier_stemmed_docT5query,0.303864
