MSMARCO Document Ranking
A document ranking corpus containing 3.2 million documents. Also used by the TREC Deep Learning track.
Retrieval notebooks: View, Download
Variants
We have 5 index variants for this dataset:
- terrier_stemmed
- terrier_stemmed_docT5query
- terrier_stemmed_text
- terrier_unstemmed
- terrier_unstemmed_text
terrier_stemmed
Terrier's default Porter stemming, and stopwords removed.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='BM25', num_results=100)
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='DPH', num_results=100)
dph_bo1_terrier_stemmed = (
dph_terrier_stemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarco_document').get_index('terrier_stemmed'))
>> dph_terrier_stemmed)
terrier_stemmed_docT5query
Terrier index using docT5query. Porter stemming and stopword removal applied. This index was made using the MSMARCO files provided linked from the authors' original repository. To create indices for other corpora, use the pyterrier_doc2query plugin.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_docT5query', wmodel='BM25', num_results=100)
terrier_stemmed_text
Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.
Use this for retrieval in PyTerrier:
#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint', text_field='body', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})
bm25_bert_terrier_stemmed_text = (
pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'title', 'body'], num_results=100)
>> pt.text.sliding(length=128, stride=64, prepend_attr='title')
>> vanilla_bert
>> pt.text.max_passage())
terrier_unstemmed
Terrier index, no stemming, no stopword removal.
Use this for retrieval in PyTerrier:
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed', wmodel='BM25', num_results=100)
dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed', wmodel='DPH', num_results=100)
dph_bo1_terrier_unstemmed = (
dph_terrier_unstemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarco_document').get_index('terrier_unstemmed'))
>> dph_terrier_unstemmed)
terrier_unstemmed_text
Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.
Use this for retrieval in PyTerrier:
#!pip install git+https://github.com/Georgetown-IR-Lab/OpenNIR.git
import onir_pt
# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface
vanilla_bert = onir_pt.reranker('hgf4_joint', text_field='body', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})
bm25_bert_terrier_unstemmed_text = (
pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed_text', wmodel='BM25', metadata=['docno', 'title', 'body'], num_results=100)
>> pt.text.sliding(length=128, stride=64, prepend_attr='title')
>> vanilla_bert
>> pt.text.max_passage())