DrivR-Base: A Feature Extraction Toolkit For Variant Effect Prediction Model Construction

doi:10.1093/bioinformatics/btae197

DOI: 10.1093/bioinformatics/btae197 ISSN: 1367-4811

DrivR-Base: A Feature Extraction Toolkit For Variant Effect Prediction Model Construction

Amy Francis, Colin Campbell, Tom R Gaunt

Computational Mathematics
Computational Theory and Mathematics
Computer Science Applications
Molecular Biology
Biochemistry
Statistics and Probability

Show PDF Cite

Abstract

Motivation

Recent advancements in sequencing technologies have led to the discovery of numerous variants in the human genome. However, understanding their precise roles in diseases remains challenging due to their complex functional mechanisms. Various methodologies have emerged to predict the pathogenic significance of these genetic variants. Typically, these methods employ an integrative approach, leveraging diverse data sources that provide important insights into genomic function. Despite the abundance of publicly available data sources and databases, the process of navigating, extracting, and pre-processing features for machine learning models can be highly challenging and time-consuming. Furthermore, researchers often invest substantial effort in feature extraction, only to later discover that these features lack informativeness.

Results

In this paper, we introduce DrivR-Base, an innovative resource that efficiently extracts and integrates molecular information (features) related to single nucleotide variants. These features encompass information about the genomic positions and the associated protein positions of a variant. They are derived from a wide array of databases and tools, including structural properties obtained from AlphaFold, regulatory information sourced from ENCODE, and predicted variant consequences from Variant Effect Predictor. DrivR-Base is easily deployable via a Docker container to ensure reproducibility and ease of access across diverse computational environments. The resulting features can be used as input for machine learning models designed to predict the pathogenic impact of human genome variants in disease. Moreover, these feature sets have applications beyond this, including haploinsufficiency prediction and the development of drug repurposing tools. We describe the resource’s development, practical applications, and potential for future expansion and enhancement.

Availability and Implementation

DrivR-Base source code is available at https://github.com/amyfrancis97/DrivR-Base.

Supplementary Information

Supplementary data are available at Bioinformatics online.

Outline

DrivR-Base: A Feature Extraction Toolkit For Variant Effect Prediction Model Construction

Abstract

Motivation

Results

Availability and Implementation

Supplementary Information

More from our Archive