PyTerrier Data Repository

TREC COVID

A collection of scientific articles related to COVID-19. This uses the 2020-07-16 version of the CORD-19, which is used by the TREC COVID complete benchmark.

Retrieval notebooks: View, Download

Variants

We have 8 index variants for this dataset:

terrier_stemmed

Last Update 2021-09-3075.0MB

Terrier's default Porter stemming, and stopwords removed

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = (
dph_terrier_stemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed'))
>> dph_terrier_stemmed)

terrier_stemmed_positions

Last Update 2021-09-30129.8MB

Terrier's default Porter stemming, no stopword removal. Position information is saved for proximity or phrase queries.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='BM25')

dph_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_positions', wmodel='DPH')

dph_bo1_terrier_stemmed_positions = (
dph_terrier_stemmed_positions
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_positions'))
>> dph_terrier_stemmed_positions)

terrier_stemmed_text

Last Update 2021-09-30184.9MB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='BM25')

dph_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed_text', wmodel='DPH')

dph_bo1_terrier_stemmed_text = (
dph_terrier_stemmed_text
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed_text'))
>> dph_terrier_stemmed_text)

terrier_unstemmed

Last Update 2021-09-3097.7MB

Terrier index, no stemming, no stopword removal

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='DPH')

dph_bo1_terrier_unstemmed = (
dph_terrier_unstemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed'))
>> dph_terrier_unstemmed)

terrier_unstemmed_positions

Last Update 2021-09-30189.1MB

Terrier index, no stemming, no stopword removal. Position information is saved for proximity or phrase queries.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='BM25')

dph_terrier_unstemmed_positions = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_positions', wmodel='DPH')

dph_bo1_terrier_unstemmed_positions = (
dph_terrier_unstemmed_positions
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_positions'))
>> dph_terrier_unstemmed_positions)

terrier_unstemmed_text

Last Update 2021-09-30207.6MB

Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='BM25')

dph_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed_text', wmodel='DPH')

dph_bo1_terrier_unstemmed_text = (
dph_terrier_unstemmed_text
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed_text'))
>> dph_terrier_unstemmed_text)

ance_msmarco_psg

Last Update 2021-10-01777.5MB

ANCE dense retrieval index using model trained by original ANCE authors. Uses the pyterrier_ance plugin. Since most documents exceed the maximum length supported by ANCE, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().

Browse index

Use this for retrieval in PyTerrier:

#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

from pyterrier_ance import ANCERetrieval

ance = (
ANCERetrieval.from_dataset('trec-covid', 'ance_msmarco_psg')
>> pt.text.max_passage())

colbert_uog44k

Last Update 2021-10-0110.5GB

ColBERT dense retrieval index using model trained by UoG for TREC 2020 DL track. Uses the pyterrier_colbert plugin. Since most documents exceed the maximum length supported by ColBERT, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().

Browse index

Use this for retrieval in PyTerrier:

#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

from pyterrier_colbert.ranking import ColBERTFactory

colbert_e2e = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').end_to_end()
>> pt.text.max_passage())

colbert_prf_rank = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=False)
>> pt.text.max_passage())

colbert_prf_rerank = (
ColBERTFactory.from_dataset('trec-covid', 'colbert_uog44k').colbert_prf(rerank=True)
>> pt.text.max_passage())