This notebook demonstrates retrieval using PyTerrier on the TREC COVID corpus.
About the corpus: A collection of scientific articles related to COVID-19. This uses the 2020-07-16 version of the CORD-19, which is used by the TREC COVID complete benchmark.
#!pip install -q python-terrier
import pyterrier as pt
if not pt.started():
pt.init()
from pyterrier.measures import *
dataset = pt.get_dataset('trec-covid')
Terrier's default Porter stemming, and stopwords removed
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH')
dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed')) >> dph_terrier_stemmed
Terrier index, no stemming, no stopword removal
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='BM25')
dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='DPH')
dph_bo1_terrier_unstemmed = dph_terrier_unstemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed')) >> dph_terrier_unstemmed
50 topics from the TREC COVID task, with deep judgements. Using natural-language "description" queries.
pt.Experiment(
[bm25_terrier_stemmed, dph_terrier_stemmed, dph_bo1_terrier_stemmed, bm25_terrier_unstemmed, dph_terrier_unstemmed, dph_bo1_terrier_unstemmed],
pt.get_dataset('irds:cord19/trec-covid').get_topics('description'),
pt.get_dataset('irds:cord19/trec-covid').get_qrels(),
batch_size=200,
filter_by_qrels=True,
eval_metrics=[nDCG@10, P@5, P(rel=2)@5, AP],
names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'dph_bo1_terrier_stemmed', 'bm25_terrier_unstemmed', 'dph_terrier_unstemmed', 'dph_bo1_terrier_unstemmed'])