TREC COVID
A collection of scientific articles related to COVID-19. This uses the 2020-07-16 version of the CORD-19, which is used by the TREC COVID complete benchmark.
Retrieval notebooks: View, Download
Variants
We have 8 index variants for this dataset:
- terrier_stemmed
- terrier_stemmed_positions
- terrier_stemmed_text
- terrier_unstemmed
- terrier_unstemmed_positions
- terrier_unstemmed_text
- ance_msmarco_psg
- colbert_uog44k
terrier_stemmed
Terrier's default Porter stemming, and stopwords removed
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH')
dph_bo1_terrier_stemmed = (
dph_terrier_stemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed'))
>> dph_terrier_stemmed)
terrier_stemmed_positions
Terrier's default Porter stemming, no stopword removal. Position information is saved for proximity or phrase queries.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='BM25')
dph_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='DPH')
dph_bo1_terrier_stemmed_positions = (
dph_terrier_stemmed_positions
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_positions'))
>> dph_terrier_stemmed_positions)
terrier_stemmed_text
Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='BM25')
dph_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='DPH')
dph_bo1_terrier_stemmed_text = (
dph_terrier_stemmed_text
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_text'))
>> dph_terrier_stemmed_text)
terrier_unstemmed
Terrier index, no stemming, no stopword removal
Use this for retrieval in PyTerrier:
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='BM25')
dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='DPH')
dph_bo1_terrier_unstemmed = (
dph_terrier_unstemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed'))
>> dph_terrier_unstemmed)
terrier_unstemmed_positions
Terrier index, no stemming, no stopword removal. Position information is saved for proximity or phrase queries.
Use this for retrieval in PyTerrier:
bm25_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='BM25')
dph_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='DPH')
dph_bo1_terrier_unstemmed_positions = (
dph_terrier_unstemmed_positions
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_positions'))
>> dph_terrier_unstemmed_positions)
terrier_unstemmed_text
Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.
Use this for retrieval in PyTerrier:
bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='BM25')
dph_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='DPH')
dph_bo1_terrier_unstemmed_text = (
dph_terrier_unstemmed_text
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_text'))
>> dph_terrier_unstemmed_text)
ance_msmarco_psg
ANCE dense retrieval index using model trained by original ANCE authors. Uses the pyterrier_ance plugin. Since most documents exceed the maximum length supported by ANCE, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().
Use this for retrieval in PyTerrier:
#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git
from pyterrier_ance import ANCERetrieval
ance = (
ANCERetrieval.from_dataset('trec-covid', 'ance_msmarco_psg')
>> pt.text.max_passage())
colbert_uog44k
ColBERT dense retrieval index using model trained by UoG for TREC 2020 DL track. Uses the pyterrier_colbert plugin. Since most documents exceed the maximum length supported by ColBERT, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().
Use this for retrieval in PyTerrier:
#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git
from pyterrier_colbert.ranking import ColBERTFactory
colbert_e2e = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').end_to_end()
>> pt.text.max_passage())
colbert_prf_rank = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=False)
>> pt.text.max_passage())
colbert_prf_rerank = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=True)
>> pt.text.max_passage())