In a recent paper posted to the bioRxiv* preprint server, researchers reveal the development of an open-source database that provides data on coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) resources.
Outbreak.info: A standardized, searchable platform to discover and explore COVID-19 resources and data. Image Credit: Studio.c/ Shutterstock
With the ongoing COVID-19 pandemic causing devastation on a global scale, scientists and public health systems alike have been working together to address the challenges the pandemic entails and develop policies to control it.
Since the pandemic began, scientific research has grown exponentially at an unparalleled pace, from exploring and testing therapeutic drugs to developing vaccines against SARS-CoV-2. Data suggests that over 52,000 peer-reviewed articles were published during the first year of the COVID-19 crisis, as compared to around 1,000 during the initial 12 months of the SARS outbreak in 2002.
The staggering magnitude of research data on COVID-19 and SARS-CoV-2, which continues to expand, requires a combined database to house the research data from across various available repositories in a standardized, searchable, interpretable, and easy-to-access interface.
The pandemic has led to the creation of several databases and for instance, numerous websites report COVID-19 cases across different geographical regions that are mostly contributed by volunteers.
LitCovid is a hub of the COVID-19 literature, while the data on clinical trials are stored at the National Clinical Trials (NCT) registry. Therefore, a common library that provides access to COVID-19 resources assembled from various sources is required to aid scientific research.
In the present paper, the authors describe the development of outbreak.info. This website hosts COVID-19 research data created by collecting metadata from 14 repositories and combining COVID-19 resources from hundreds of sources scattered over the internet and yet remain disparate.
The database hosts data resources from over 200,000 publications, clinical trials, and other related datasets. The collected resources were standardized by developing schema, prioritizing five classes of COVID-19 research data – publications, datasets, clinical trials, analysis, and protocols.
Number of resources in outbreak.info as a function of date.
Metadata is ingested into the website in two ways. For example, the first method uses the BioThings software development kit (SDK) data plugins, and the second method allows submissions via an online form. A nested list of thematic or topic-based categories was developed based on the initial list from LitCovid, which resulted in a list with 11 broad categories and 24 specific child categories. Epidemiological data was ingested from John Hopkins University (JHU) and the New York Times (NYT), and the genomics data was integrated from the GISAID database.
After developing the schema, the researchers created data plugins or parsers to import metadata from 14 repositories and ingest it into outbreak.info. These parsers auto-update daily to maintain updated information. The most extensive data class was publications collected from LitCovid and the preprint servers, bioRxiv and medRxiv. The clinical trial data from the NCT and World Health Organization (WHO) formed the second largest library. The “protocols” class compiled data from two resources – Protocols.io and NCT protocols, while the datasets library sourced its information from Zenodo, protein data bank (PDB), Figshare, and Harvard datasets.
A. Distribution of resources by resource type and source. B. Heterogeneous and filterable resources (ie-publications, clinical trials, datasets, etc.) resulting from a single search of the phrase “Delta Variant”
Data available at the Imperial College of London (ICL) were imported to fill the “Analysis” library class. The database has been developed with a feature to allow submissions from the “volunteers” or the community. Other features include creative and interactive visualization of epidemiological data imported from JHU and NYU, although many other sources compile information on epidemiology from JHU, the interface on outbreak.info is built to support research.
The authors of the present work have created a database to access resources of COVID-19 and SARS-CoV-2 easily. The massive expansion of research and epidemiological data necessitates a shared library that houses information from many sources in an easy, searchable, standardized, and interpretable interface. This has been achieved by creating outbreak.info, a feature-rich website that allows contributions from the community. Furthermore, the integration of data compiled from various repositories into a single database allows quick exploration and retrieval of COVID-19 resources irrespective of their source.
In summary, the authors created a website that essentially comprises three components: 1) outbreak.info contains a searchable interface, 2) a tool to explore epidemiology data and spatiotemporal trends, and 3) surveillance reports on SARS-CoV-2 variants and mutants. The website is also integrated with public application programming interfaces or APIs to allow access to resource data.
What is Outbreak.info? The Open-Source Hub of COVID-19 Data & Research
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information
- Tsueng, Ginger, Julia Mullen, Manar Alkuzweny, Marco Alvarado Cano, Benjamin Rush, Emily Haag, Outbreak Curators, et al. “Outbreak.Info: A Standardized, Searchable Platform to Discover and Explore COVID-19 Resources and Data.” bioRxiv, January 21, 2022. DOI: https://doi.org/10.1101/2022.01.20.477133, https://www.biorxiv.org/content/10.1101/2022.01.20.477133v1