MSMARCO v2 Passage Ranking
A revised corpus of 138M passages released by Microsoft in July 2021, which should be ranked based on their relevance to questions. Also used by the TREC 2021 Deep Learning track.
Retrieval notebooks: View, Download
Variants
We have 2 index variants for this dataset:
terrier_stemmed
Terrier's default Porter stemming, and stopwords removed.
Use this for retrieval in PyTerrier:
#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='BM25')
bm25_bert_terrier_stemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH')
dph_bert_terrier_stemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))
terrier_unstemmed
Terrier index, no stemming, no stopword removal.
Use this for retrieval in PyTerrier:
#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='BM25')
bm25_bert_terrier_unstemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))
dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH')
dph_bert_terrier_unstemmed = (
pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH')
>> pt.text.get_text(pt.get_dataset('irds:msmarco-passage-v2'), 'text')
>> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}))