PyTerrier Data Repository

TREC COVID

A collection of scientific articles related to COVID-19. This uses the 2020-07-16 version of the CORD-19, which is used by the TREC COVID complete benchmark.

Retrieval notebooks: View, Download

Variants

We have 8 index variants for this dataset:

terrier_stemmed
terrier_stemmed_positions
terrier_stemmed_text
terrier_unstemmed
terrier_unstemmed_positions
terrier_unstemmed_text
ance_msmarco_psg
colbert_uog44k

terrier_stemmed

Last Update 2021-09-3075.0MB

Terrier's default Porter stemming, and stopwords removed

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = (
    dph_terrier_stemmed 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed')) 
    >> dph_terrier_stemmed)

terrier_stemmed_positions

Last Update 2021-09-30129.8MB

Terrier's default Porter stemming, no stopword removal. Position information is saved for proximity or phrase queries.

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='BM25')

dph_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='DPH')

dph_bo1_terrier_stemmed_positions = (
    dph_terrier_stemmed_positions 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_positions')) 
    >> dph_terrier_stemmed_positions)

terrier_stemmed_text

Last Update 2021-09-30184.9MB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='BM25')

dph_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='DPH')

dph_bo1_terrier_stemmed_text = (
    dph_terrier_stemmed_text 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_text')) 
    >> dph_terrier_stemmed_text)

terrier_unstemmed

Last Update 2021-09-3097.7MB

Terrier index, no stemming, no stopword removal

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='DPH')

dph_bo1_terrier_unstemmed = (
    dph_terrier_unstemmed 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed')) 
    >> dph_terrier_unstemmed)

terrier_unstemmed_positions

Last Update 2021-09-30189.1MB

Terrier index, no stemming, no stopword removal. Position information is saved for proximity or phrase queries.

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='BM25')

dph_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='DPH')

dph_bo1_terrier_unstemmed_positions = (
    dph_terrier_unstemmed_positions 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_positions')) 
    >> dph_terrier_unstemmed_positions)

terrier_unstemmed_text

Last Update 2021-09-30207.6MB

Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='BM25')

dph_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='DPH')

dph_bo1_terrier_unstemmed_text = (
    dph_terrier_unstemmed_text 
    >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_text')) 
    >> dph_terrier_unstemmed_text)

ance_msmarco_psg

Last Update 2021-10-01777.5MB

ANCE dense retrieval index using model trained by original ANCE authors. Uses the pyterrier_ance plugin. Since most documents exceed the maximum length supported by ANCE, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().

Use this for retrieval in PyTerrier:

#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

from pyterrier_ance import ANCERetrieval

ance = (
    ANCERetrieval.from_dataset('trec-covid', 'ance_msmarco_psg') 
    >> pt.text.max_passage())

colbert_uog44k

Last Update 2021-10-0110.5GB

ColBERT dense retrieval index using model trained by UoG for TREC 2020 DL track. Uses the pyterrier_colbert plugin. Since most documents exceed the maximum length supported by ColBERT, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().

Use this for retrieval in PyTerrier:

#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

from pyterrier_colbert.ranking import ColBERTFactory

colbert_e2e = (
    ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').end_to_end() 
    >> pt.text.max_passage())

colbert_prf_rank = (
    ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=False) 
    >> pt.text.max_passage())

colbert_prf_rerank = (
    ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=True) 
    >> pt.text.max_passage())