MSMARCOv2 Document Ranking
A new version of the MSMARCO document ranking corpus, containing 11.9 million documents. Also used by the TREC 2021 Deep Learning track.
Retrieval notebooks: View, Download
Variants
We have 2 index variants for this dataset:
terrier_stemmed
Terrier's default Porter stemming, and stopwords removed.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='BM25')
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='DPH')
dph_bo1_terrier_stemmed = (
dph_terrier_stemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarcov2_document').get_index('terrier_stemmed'))
>> dph_terrier_stemmed)
terrier_stemmed_positions
Terrier index, default Porter stemming, and stopwords removed. Position information is saved for proximity or phrase queries.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed_positions', wmodel='BM25')
dph_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed_positions', wmodel='DPH')
dph_bo1_terrier_stemmed_positions = (
dph_terrier_stemmed_positions
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarcov2_document').get_index('terrier_stemmed_positions'))
>> dph_terrier_stemmed_positions)