Vaswani
The Vaswani NPL corpus is a small test collection of 11,000 abstracts has been used by the Glasgow IR group for many years (created 1990). Due to its small size, it is used for many test cases used in both Terrier and PyTerrier.
Retrieval notebooks: View, Download
Variants
We have 5 index variants for this dataset:
terrier_stemmed
Terrier's default Porter stemming, and stopwords removed
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')
dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed', wmodel='DPH')
dph_bo1_terrier_stemmed = (
dph_terrier_stemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_stemmed'))
>> dph_terrier_stemmed)
terrier_stemmed_text
Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.
Use this for retrieval in PyTerrier:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed_text', wmodel='BM25')
dph_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed_text', wmodel='DPH')
dph_bo1_terrier_stemmed_text = (
dph_terrier_stemmed_text
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_stemmed_text'))
>> dph_terrier_stemmed_text)
terrier_unstemmed
Terrier index, no stemming, no stopword removal
Use this for retrieval in PyTerrier:
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_unstemmed', wmodel='BM25')
dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_unstemmed', wmodel='DPH')
dph_bo1_terrier_unstemmed = (
dph_terrier_unstemmed
>> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_unstemmed'))
>> dph_terrier_unstemmed)
ance_msmarco_psg
ANCE dense retrieval index using model trained by original ANCE authors. Uses the pyterrier_ance plugin.
Use this for retrieval in PyTerrier:
#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git
from pyterrier_ance import ANCERetrieval
ance = ANCERetrieval.from_dataset('vaswani', 'ance_msmarco_psg')
colbert_uog44k
ColBERT dense retrieval index using model trained by UoG for TREC 2020 DL track. Uses the pyterrier_colbert plugin.
Use this for retrieval in PyTerrier:
#!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git
from pyterrier_colbert.ranking import ColBERTFactory
colbert_e2e = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').end_to_end()
colbert_prf_rank = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').prf(rerank=False)
colbert_prf_rerank = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').prf(rerank=True)