PyTerrier Data Repository

MSMARCO v2 Passage Ranking

A revised corpus of 138M passages released by Microsoft in July 2021, which should be ranked based on their relevance to questions. Also used by the TREC 2021 Deep Learning track.

Retrieval notebooks: View, Download

Variants

We have 2 index variants for this dataset:

terrier_stemmed

Last Update 2021-08-0836.2GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='BM25')

bm25_bert_terrier_stemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH')

dph_bert_terrier_stemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))

terrier_unstemmed

Last Update 2021-08-0639.9GB

Terrier index, no stemming, no stopword removal.

Browse index

Use this for retrieval in PyTerrier:

#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='BM25')

bm25_bert_terrier_unstemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH')

dph_bert_terrier_unstemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))