PyTerrier Data Repository

MSMARCO Passage Ranking

A passage ranking task based on a corpus of 8.8 million passages released by Microsoft, which should be ranked based on their relevance to questions. Also used by the TREC Deep Learning track.

Retrieval notebooks: View, Download


We have 7 index variants for this dataset:


Last Update 2021-06-121.4GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='DPH')


Last Update 2021-06-122.3GB

Terrier index using DeepCT. Porter stemming and stopword removal applied. This index was made using the MSMARCO files provided linked from the authors' original repository. To create indices for other corpora, use the pyterrier_deepct plugin.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_deepct = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_deepct', wmodel='BM25')


Last Update 2021-06-122.3GB

Terrier index using docT5query. Porter stemming and stopword removal applied. This index was made using the MSMARCO files provided linked from the authors' original repository. To create indices for other corpora, use the pyterrier_doc2query plugin.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_docT5query', wmodel='BM25')


Last Update 2021-08-063.4GB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

#!pip install git+

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])

bm25_bert_terrier_stemmed_text = (
>> vanilla_bert)


Last Update 2021-08-062.1GB

Terrier index, no stemming, no stopword removal.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='DPH')


Last Update 2021-06-123.1GB

Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

#!pip install git+

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed_text', wmodel='BM25', metadata=['docno', 'text'])

bm25_bert_terrier_unstemmed_text = (
>> vanilla_bert)


Last Update 2021-09-2925.4GB

ANCE dense retrieval index using model trained by original ANCE authors. Uses the pyterrier_ance plugin.

Browse index

Use this for retrieval in PyTerrier:

#!pip install --upgrade git+

from pyterrier_ance import ANCERetrieval

ance = ANCERetrieval.from_dataset('msmarco_passage', 'ance')