Data-intensive and translational bioinformatics

High-throughput technologies have transformed biomedicine into a data-intensive discipline. This has shifted the focus from traditional data generation and hypothesis testing to more data-driven research, and bioinformatics data analysis has become the bottleneck in many projects. However, the field is characterized by growing data sets and poorly scalable software, threatening to severely constrain many biomedical projects. Our group aims at developing new methods and applications to meet the demands of high-throughput biology and drug discovery, using high-throughput e-infrastructures and Big Data analytics frameworks, aiming towards “next-generation bioinformatics”.

Principal investigator: Associate Professor Ola Spjuth

Figure: Growing data sets require new methods and e-infrastructures to allow for scalable analysis and to cope with the data deluge.

Highlighted projects:

Large-scale predictive modeling in drug discovery

This project aims at developing computational methods, tools and predictive models to aid the drug discovery process on large data sets. Methods include ligand-based and structure-based methods such as QSAR (machine learning) and docking, with applications including prediction of drug safety, toxicology, interactions, target profiles and secondary pharmacology. In order to analyze large-scale data we use high-performance computing, cloud computing resources, and data analytics platforms such as Apache Hadoop and Apache Spark. We also use and develop scientific workflow systems such as Luigi and BPipe to automate and streamline analysis. The work is carried out in collaboration with AstraZeneca R&D, Maastricht University NL, and Karolinska Institutet. We aim at making models and tools available from the Bioclipse workbench. We are also founding partners of the OpenTox association ( and associated partner with the consortia OpenPhacts ( and e-nanomapper (

Figure: Data is extracted from various data sources, and we use high performance computing, cloud computing, workflows and big data frameworks to train predictive models which are published in the Bioclipse workbench for easy and user-friendly access with graphical interpretations.

Prediction of metabolism

This project aims at developing methods for predicting site-of-metabolism and metabolites based on chemical structure. Using data mining techniques we have developed the tool MetaPrint2D for site-of-metabolism prediction. The project aims at improving these models and also to predict putative metabolites. The work is carried out in close collaboration with AstraZeneca R&D and models and tools are available from the Bioclipse workbench.

Figure: Prediction of site-of-metabolism with the MetaPrint2D method in Bioclipse.

Translational bioinformatics

Translational bioinformatics is defined as: ”The development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health”. Our group carries out research focused on translating massively parallel sequencing via automated bioinformatics analysis, informatics solutions, and reporting systems to aid in clinical settings. Projects include long-read amplicon sequencing of chronic myeloid leukemia (CML), TP53, and multi-drug resistant bacteria. We are also part of the joint SeRC-eSSENCE flagship project “e-Science for Cancer Prevention and Control” (eCPC). Collaborators include the National Genomics Institute (NGI), Uppsala Academic Hospital, and Karolinska Institutet.

Figure: Screenshot from our developed system for translating long-read amplicon sequencing to be used as decision-aid for chronic myeloid leukemia (CML) with mutation frequencies in the Philadelphia chromosome