CITREC – Open Evaluation Framework for Citation-based and Text-based Similarity Measures

CITREC is an open evaluation framework for citation-based and text-based similarity measures.

Overview Paper:

B. Gipp, N. Meuschke, and M. Lipinski,
“CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central,”
in Proceedings of the iConference 2015, Newport Beach, California, 2015.
(PDF | DOI)

Summary

CITREC prepares the data of two formerly separate collections for a citation-based analysis and provides the tools necessary for performing evaluations of similarity measures. The first collection is the PubMed Central Open Access Subset (PMC OAS), the second is the collection used for the Genomics Tracks at the Text REtrieval Conferences (TREC) ’06 and ’07 (overview paper for the TREC Gen collection).

CITREC extends the PMC OAS and TREC Genomics collections by providing:

citation and reference information that includes the position of in-text citations for documents in both collections;
code and pre-computed scores for 35 citation-based and text-based similarity measures;
two gold standards based on Medical Subject Headings (MeSH) descriptors and the relevance feedback gathered for the TREC Genomics collection;
a web-based system (Literature Recommendation Evaluator – LRE) that allows evaluating similarity measures on their ability to identify documents that are relevant to user-defined information needs;
tools to statistically analyze and compare the scores that individual similarity measures yield.

Documentation

Database Overview and Tutorial explaining the structure of the CITREC database and demonstrating the usage of the demo system.
Overview of Similarity Tables listing the similarity measures included in the CITREC framework and explaining the naming conventions for the database tables that contain the similarity scores calculated using the individual measures.
Parser Documentation explaining the procedures for data extraction and cleaning.
LRE Documentation describing the web-based SciPlore Literature Recommendation Evaluator, which allows surveys to gather relevance feedback and establish gold standard datasets.

Data

PubMed Central Open Access Subset

Database Schema only (1.3 KB)
Whole Database (5 GB zipped, ~20 GB raw) – includes document metadata, citation data and pre-computed similarity scores

TREC Genomics collection

Database Schema only (1.2 KB)
Whole Database (1 GB zipped, ~5 GB raw) – includes document metadata and citation data.

Source Code

Analysis Code (Java)

The Java source code includes:

parsers for the PMC OAS and the TREC Genomics collection as well as tools to retrieve MeSH and article metadata from NCBI resources (package org.sciplore.citrec.dataimport)
tools to statistically evaluate retrieval results using a top-k or a rank-based analysis (package org.sciplore.citrec.eval)
implementations of similarity measures and code to calculate the MeSH-based gold standard (package org.sciplore.citrec.sim)

LRE Code (PHP)

The source code for the Literature Recommendation Evaluator (LRE) uses the symfony (v. 2) PHP framework.

Example

This Excel spreadsheet exemplifies a possible evaluation using CITREC data. The spreadsheet compares the scores calculated using different similarity measures dependent on the maximum Co-Citation score (i).

FAQ

Q: Which bibliometric or scientometric methods does CITREC cover?
A: CITREC includes 35 citation-based (link-based), text-based, and semantic concept-based similarity measures. measures. For each of these measures, we provide its implementations (Java code) and pre-computed similarity scores for all documents in our collection. The 35 similarity measures represent variants, e.g., derived by using different weighting parameters for the following approaches: Amsler, Bibliograpic Coupling, Co-Citation, Contextual Co-Citation, Citation Proximity Analysis, Linkthrough, Lucene More Like This, MESH similarity. The Overview of Similarity Tables presents all 35 measures included in CITREC.

Q: What are the advantages and disadvantages of using CITREC?
A: Advantages:

- CITREC includes many similarity measures, provides their implementation and pre-computed scores for a large document collection.
- CITREC covers citation-based (link-based), text-based, and semantic concept-based similarity measures.
- CITREC includes two gold standards:

- - One is automatically derived from expert classifications (MESH terms), hence can be re-computed for new documents covered by PubMed.
  - One was manually created by domain experts and includes relevance judgements for documents and passages but is static since it was manually created.

- CITREC includes code to perform evaluations and to add more similarity measures and documents to the CITREC framework.
- CITREC includes a web application (LRE) to conduct user studies.
- All data and code are open source.

Disadvantages:

CITREC’s document collection currently only covers life sciences.

Q: How does CITREC differ from other bibliometric tools like CiteSpace?
A: CITREC focuses on comparatively evaluating the retrieval performance of similarity measures, e.g., to determine which measures are most suitable to realize a search or recommendation system. Tools like CiteSpace focus on descriptive bibliographic analyses of research fields or document collections and visualizing the results. These tools serve answering research questions like “What are the citation or collaboration patterns in a certain research field.” CITREC includes many variants of measures to allow in-depth experiments for finding the most suitable approach for a specific retrieval task. CITREC also makes it easy for researchers to implement their own measures and easily compare them to all the measures we already implemented in the CITREC framework.

Contribute

CITREC is an open source project published under the Gnu Public License (GPL) version 2. We warmly invite you to contribute to the continuous development of the framework by sharing results and resources related to CITREC.

If you have performed an evaluation using CITREC, developed a similarity measure, a parser, or any other tool that you would like to share, we would be happy to acknowledge and share your work on this page. If you are interested in making your resources available through this page, please contact us at meuschke{at}uni-wuppertal.de.

Document Collections and Metadata

Below, we link to the sources of full texts and metadata that we combined, processed and enhanced as part of the CITREC framework. Please observe the individual licenses of the publishers!

Related Projects

ParsCit – Citation Parser
SimPack – Java Library for Similarity Measures
WebLA – Java package for handling Web Graphs that implements popular algorithms such as PageRank, HITS, CoCitation Similarity and SimRank.

Acknowledgements

We thank everyone contributing to the creation of the TREC Genomics test-collection. Without this great work, the realization of the CITREC framework would not have been possible.

Contact

If you experience any problems or would like to contribute to this project, please send us an email:

CONTACT

Dr. Norman Meuschke

LINKS

▸CITREC Dataset