## Abstract

Rapid advances in technology have led to a wealth of large-scale molecular omics datasets. Integrating such data offers an unprecedented opportunity to assess molecular interactions at multiple functional levels and provide a more comprehensive understanding of the biological pathways involved in different diseases subgroups. However, multiple omics data integration is a challenging task due to the heterogeneity in the different platforms used. There is a need to address the complex and correlated nature of different data-types, in order to identify a robust and reliable multi-omics signature that can predict a phenotype of interest.

We introduce a novel multivariate dimension reduction method for multiple omics integration, classification and identification of a multi-omics molecular signature. DIABLO - Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies, models the correlation structure between omics datasets, resulting in an improved ability to associate biomarkers across multiple functional levels to phenotypes of interest. We demonstrate the capabilities of DIABLO using simulated data and studies of breast cancer and asthma, integrating up to four types of omics datasets to identify relevant biomarkers, while still retaining competitive classification and predictive performance compared to existing methods.

Our statistical integrative framework can benefit a diverse range of research areas with varying types of study designs, as well as enabling module-based analyses. Importantly, graphical outputs of our method assist in the interpretation of such complex analyses and provide significant biological insights.

## Background

Systems biology approaches combine information from different biological components in order to unravel complex processes involved in health and disease [1,2]. Such biological processes are comprised of interactions between different biological layers such as the genome, methylome, transcriptome, proteome and metabolome. These interactions are often missed when each omics level is studied in isolation, leading to an increased number of false positives, loss of information (false negatives) and irreproducible findings [3]. The advent of technological advances coupled with decreasing experimental costs have however made it possible to obtain multiple high dimensional omics datasets, such as transcriptomics, proteomics, and metabolomics, for the same group of individuals or biological samples. Systems approaches including multivariate approaches [4], Bayesian methods [5], and network analyses [6] have been used to combine data originating from different biological layers with the aim to provide a holistic and accurate depiction of molecular processes within biological systems. For example, multivariate methods such as multiple co-inertia analysis (MCIA), used to integrate gene and protein expression of the NCI-60 cell line, revealed pathways not uncovered by single-omics analyses [7]. Bayesian network algorithms have been used to integrate datasets consisting of expression, variation and interaction data from yeast, and identified causal regulators of networks and novel biological mechanisms of expression quantitative trait loci (eQTL) hot spots [8]; condition-specific regulatory networks produced using multiple datasets have been shown to be more accurate than using individual datasets [9]. Others have used modular approaches that reduce high dimensional data to modules (clusters) representing distinct functional processes [10,11]. The next logical step was to determine whether the identified interactions, modules, and networks differed between different disease conditions [12,13]. To that end, differential modular or network analyses were proposed to determine whether modules or networks were statistically different between groups. However, such inferential methods are not predictive models and, as such, cannot be used to classify new subjects into different phenotypic groups.

Machine learning algorithms construct predictive models by “learning” from the data a classification rule that is then used to predict or assign the class membership of new individuals. Common classification methods include discriminant analysis, neural networks, decision trees, support vector machine (SVM) and random forest (RF) to name a few, where performance is assessed using indices such as the classification error rate, prediction accuracy, and the area under the receiver operating curve. A study comparing 176 classifiers showed that RF and SVM led to superior performance accuracy [14]. Linear penalized regression models such as elastic net have been also proposed to simultaneously perform shrinkage and variable selection, thereby resulting in a parsimonious linear model with improved predictive performance [15]. These methods are suited to single dataset analyses, whereas methods that can construct predictive models from multiple high dimensional omics datasets are required as multiple sources of information captured via different data-types becomes available for the same individuals.

Omics data integration as defined by Ritchie *et al.* [16] refers to combining multiple omics datasets in order to develop multivariate models that are predictive of complex traits or phenotypes. However, the task is non-trivial given the numerous analytical challenges. The high dimensionality of each dataset requires not only efficient computational techniques, but also the development of novel methods which are able to identify relevant information; a ‘multi-omics molecular signature’ from the tens of thousands of predictors that are measured [17]. Additionally, the small number of samples compared to the large number of predictors limits statistical power and accuracy of current methods, which may lead to non-reproducible molecular signatures. Finally, while the same samples are profiled within a single study, the different omics platforms employed have their own inherent platform-specific artifacts such as variation between manufacturers and omics technologies. Data heterogeneity is therefore a major obstacle to combining multiple omics studies [17].

Current data integration frameworks enabling the identification of multi-omics molecular signatures in a data-driven analysis include concatenation-based [18] and model-based integration (*e.g*. ensemble classifiers) [19] (**Figure 1A-B**). Concatenation-based integration combines multiple datasets into a single large dataset, with the aim to predict a phenotype of interest. Model-based integration approaches such as ensemble classification construct a predictive model on each individual dataset before combining the model predictions. None of these approaches however account or model relationships between datasets and thus limit our understanding of molecular interactions at multiple functional levels. Therefore, there is a crucial need for novel integrative modeling methods, that can identify a multi-omics molecular signature by borrowing discriminatory strength from complementary information, across multiple functional levels while providing greater insight into disease mechanisms.

We introduce a multivariate dimension reduction discriminant analysis method, DIABLO (Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies, **Figure 1C**) as part of the mixOmics Data Integration Project (http://mixomics.org/) [20,21]. DIABLO aims to maximize the common or correlated information between multiple datasets whilst identifying in an optimal manner the key omics variables (mRNA, miRNA, CpGs, proteins, metabolites, etc.) that explain and reliably classify disease sub-groups or phenotypes of interest. DIABLO builds on Projection to Latent Structure models (PLS) [22], substantially extends both sparse PLS-Discriminant Analysis [23] to multi-omics analyses and sparse generalized canonical correlation analysis [24] to a discriminant analysis framework. In contrary to existing penalized matrix decomposition methods [25] DIABLO models and maximizes the correlation between pairs of pre-specified omics datasets to unravel similar functional relationships between those omics data [26]. In addition, DIABLO provides appealing features by 1) allowing the user to specify the number of variables to select from each dataset 2) constructing a predictive multi-omics model that can be applied to classify new samples even if some datasets are missing, and by 3) allowing for the assessment of the classification performance of the predictive model. The dimension reduction process enables visualization of the samples, as well as biologically relevant variables. DIABLO is a highly flexible method that can handle classical single time point experimental designs, as well as cross-over or repeated measures study designs. Modular-based analysis can also be used in conjunction with DIABLO by inputting pathway-based module matrices [11] instead of omics matrices.

We demonstrate the ability of DIABLO to select relevant, correlated and discriminatory biomarkers, using synthetic data as well as multi-omics datasets from human breast cancer and asthma case studies. In those studies, we integrate up to four omics datasets and show that DIABLO has competitive classification performance with existing single-omics methods and multi-omics integrative frameworks. Importantly, DIABLO yields improved biological insights of multi-omics signatures as we demonstrate in both case studies.

## Results

### The mixDIABLO integrative framework

We describe the mixDIABLO pipeline in **Figure 2** to integrate multiple omics datasets and identify a multi-omics biomarker panel, assess the predictive performance of the model, and generate visualizations to aid in the interpretation of the results. The first step inputs multiple omics datasets measured on the same individuals, that were previously normalized and filtered with optional preprocessing steps such multilevel transformation (for repeated measures study designs) and module-transformations (for Pathway analyses, as described in **Methods**). Prior to multivariate data integration, exploratory and unsupervised data analyses of each omics dataset with Principal Component Analysis (PCA), or sparse PCA built only on a smaller subset of variables [27] can be useful to visualize and understand the major sources of variation in each dataset to be integrated. Then, the omics datasets, along with the phenotype information indicating the class membership of each sample (two or more groups) are input in the multivariate integrative method DIABLO. DIABLO is a multivariate dimension reduction method that seeks for latent components – linear combinations of variables from each omics dataset, that are maximally correlated as specified in a design matrix. The design matrix indicates which datasets should be connected such that their pair-wise correlations are maximized (**Figure 1C**). The design can be determined according to *prior* knowledge (e.g. mRNA and miRNA datasets can be assumed to be connected since miRNAs regulate mRNA expression), or using our proposed data-driven approach that indicates when to connect pairs of datasets (**see Methods**). The identification of a multi-omics panel is performed via l1 penalties that shrink the variable coefficients defining the latent components to zero (**see Methods**). The performance of the DIABLO model and associated multi-omics panel is then assessed using cross-validation repeated several times to ensure reliable evaluation and the balanced error rate (BER) or area under the receiver operating curve (AUC, for two groups) are reported. Lastly, numerous visualizations are proposed to provide insights into the multi-omics panel and guide the interpretation of the selected omics variables, including sample and variable plots (**see Methods**).

### Validation of the DIABLO method on synthetic data

An extensive simulation study was performed to numerically validate the DIABLO classification performance method and its ability to identify a highly correlated and discriminatory signature, as well as investigate the impact of the design matrix on the selected variables (**Additional file 1**). Briefly, two design matrices were tested; the full design (all datasets were connected) and the null design (where no datasets were connected, **see Methods**). Three datasets were generated with equal numbers of observations (n=100, divided into two sample groups) and 150 variables. Amongst the 150 variables, 100 variables were deemed ‘irrelevant’, i.e. not correlated between datasets and not discriminatory, whereas the 50 remaining variables were either 1) correlated across all three datasets but not discriminatory between groups (called *CorNonDis*), 2) correlated and discriminatory (*CorDis*) or 3) not correlated but discriminatory (*NonCorDis*). Several scenarios were considered to simulate these three datasets, such as different fold-change values between the two sample groups (varied from 0 to 1.5) as well as levels of noise. Furthermore, the strength of correlation was varied based on specified variance-covariance matrices (see details in **Additional file 1**). Simulated datasets were generated 20 times for each type of variable and DIABLO was applied with the full and null design selecting 50 variables with one component. We used 10x5-fold cross-validation to evaluate the performance of the integrative model on the different simulated scenarios.

As expected DIABLO with a full design correctly identified a greater proportion of *CorNonDis* variables compared to a null design (**Figure 3A**). The difference increased further with the correlation strength between the variables. DIABLO with a full design also selected a greater proportion of *CorDis* variables compared with a null design, however this difference decreased as the fold-change increased, while no difference was found between the full and null designs when *NonCorDis* variables were simulated. Interestingly, we observed very similar classification error rates between the full and null design for the *CorDis* and *NonCorDis* variables (**Figure 3B**). The error rate was lower when the datasets contained *NonCorDis* instead of *CorDis* variables. As expected, when DIABLO was applied to datasets including only *CorNonDis* and irrelevant variables we observed a random prediction of the model (error rate ~ 50%).

In summary, the design matrix is an important parameter in the model, as it affected the types of variables selected by DIABLO. The purpose of the full design is to select highly correlated variables regardless of their discriminatory power whereas the null design selects discriminatory variables regardless of the correlation structure between variables. The error rate was similar in both designs, although the presence of highly correlated and discriminatory variables led to a slight increase in the error rate.

### Comparisons with existing single-omics classifiers and multi-omics integrative classifiers using human breast cancer data

The subtypes of breast cancer (Luminal A, Luminal B, Her2-enriched and Basal-like) [28] have been the most replicated subtypes of human breast cancer [29] and a risk model based on the expression levels of 50 genes (PAM50) has been shown to successfully predict breast cancer subtypes [30]. This biomarker panel has been developed using the NanoString platform, called the Prosigna™ test and has been approved by the Food and Drug Administration (FDA) [31]. We integrated human breast cancer datasets (mRNA without PAM50 genes, miRNA, methylation and proteins) from The Cancer Genome Atlas (TCGA) [32] in order to achieve a systems characterization of PAM50 breast cancer subtypes with other types of omics datasets. We compared the classification performance of single-omics methods, integrative concatenation and ensemble-based approaches and DIABLO. The TCGA study was divided into a training and test set, where the training set was determined so as to include all samples in the proteomics dataset (**see Methods**). Most of the training samples were obtained in 2010, whereas the test samples were mainly obtained from 2011 to 2013 (**Table S1, Additional file 2**). The training set consisted of 379 subjects (76-Basal, 38-Her2, 188-LumA, and 77-LumB) with four omics datasets, whereas the test set consisted of 610 subjects (102-Basal, 40-Her2, 346-LumA, and 122-LumB) with only three omics datasets (mRNA, miRNA, and methylation). The omics datasets consisted of 2,000 mRNAs, 184 miRNAs, 2,000 CpG probes, and 142 proteins (**see Methods for preprocessing of datasets**). We checked that the range of expression values within each omics dataset was consistent between the training and test set (**Figure S1, Additional file 3**).

For single-omics analyses, the penalized regularization method [15] Elastic net (Enet), random forest (RF) [33] and support vector machine (SVM) [34] were used to identify biomarker panels for each omics dataset (mRNA, miRNA, CpGs and proteins), respectively. In Enet the sparsity parameter (lasso penalty) was set to 1 to determine the smallest possible biomarker panel, whereas RF and SVM do not perform variable selection and retained all variables (**Figure 4A**). Multi-omics biomarker panels (with equal number of variables from each omics dataset) were constructed using DIABLO such that the total number of variables was similar to the single-omics biomarker panels. The classification performance of single-omics biomarker panels was compared with DIABLO by using a 50x5-fold cross validation in the training set (**Figure 4B**). Enet gave the best performance for the mRNA panel (BER = 12.8±0.9%), however, DIABLO out-performed all single-omics methods for the other types of omics panels; miRNA, CpGs, and proteins panels, with respective training BERs of 14.2±1.9%, 14.1±1.3% and 13.0±1.8%.

For the integrative methods we applied Enet, SVM and RF in the concatenation or ensemble frameworks. For similar size panels, and on the training set, Concatenation-Enet and Ensemble-Enet led to the lowest BERs (11.4±1.1% and 11.9±1.1% respectively) compared to the DIABLO panels (DIABLO9 and DIABLO11, BER = 16.4±2.1% and 13.5±1.7% respectively, **Figure 4B**). SVM and RF performed better than DIABLO using the Concatenation framework, but worse when used with the Ensemble framework. The main limitation of the Concatenation method is that all omics datasets must also be available in the test set, which in this study is missing the proteomics data. In addition, we observed that the Concatenation-based panel was heavily biased towards variables from the most discriminatory mRNA dataset (**Figure 4C**). As such, the Concatenation-Enet multi-omics signature consisted of 62% mRNAs and 4% miRNAs, 29% CpGs and 5% proteins. Figure 4D shows a large number of unique non-overlapping set of features between the methods. Most importantly, when examining the correlation between variables identified by each method as displayed in the circos plot (**see Methods**) we observed substantially fewer inter and intra-associations in the Concatenation and Ensemble-based approaches compared to DIABLO (**Figure 4E**). Therefore, the multi-omics biomarker panel selected by DIABLO was not only predictive of breast cancer subtypes, but also included highly correlated molecular features spanning different biological layers. Interestingly, although DIABLO did not substantially out-perform existing methods, it also did not under-perform. For example, out of all panels in **Figure 4**, the best performing panel was RF including all 2000 mRNA transcripts with a training and test performance of 13.0±1.1% and 12.0%. The second best performing panel was the DIABLO7 panel with a training and test BER of 13.0±1.8% and 12.2%, based on 60 features (15 from each omics dataset). Therefore, we conclude that DIABLO performs competitively with current methods with an enhanced focus on selecting discriminatory and correlated multi-omics variables.

### Multi-omics biomarker panel predicts the PAM50 breast cancer subtypes

We demonstrate our mixDIABLO pipeline presented in Figure 2 to identify a multi-omics biomarker signature predictive of the PAM50 human breast cancer subtypes and determine its biological significance (**Figure 5A**). The path diagram of **Figure 5A**, was determined by using a correlation cut-off 0.8 for between dataset pair-wise correlations (**Figure S2, Additional file 4**) The panel size of the Enet classifier, which is dependent on the elastic net penalty (Lasso penalty set to 1), remained quite large (hundreds of variables). DIABLO on the other hand can fit a very sparse model that contains only a few variables in each dataset. **Figure 5B** shows the tuning step to set the optimal number of variables in each dataset to be selected with a minimum BER (**see Methods**), resulting in a multi-omics panel of 9 variables selected from each dataset (BER = 13.4±1.5%). **Figure 5C** shows the sample clustering of the subjects in the training cohort, with the Basal group clearly separated from the rest of the subjects and a significant overlap between the Luminal groups. Using a correlation cut-off of 0.7, DIABLO identified 44 pair-wise associations between miRNA and other omics variables (mRNA, CpGs and proteins), amongst which 34 (77%) were negative correlations (**Figure 5D**). Similar to **Figure 5C**, the heatmap in **Figure 5E** shows tight clustering of Basal and Her2 samples, and intermixing of the Luminal A and B samples. We then performed a gene-set enrichment analysis for each set of 9 variables separately using curated gene sets and oncogenic signatures (C2 and C6 collections) (**see Methods**). The top 25 ranked pathways from the C2 and C6 collections were mainly enriched with the mRNA and proteins (**Figure 5F**). The top ranked pathways consisted of several breast cancer-related pathways such as *breast cancer ESR1 up, breast cancer basal down, breast cancer basal vs. luminal, breast cancer luminal vs. basal up,* and *breast cancer relapse in bone up.* Given the highly correlated nature of the variables identified through DIABLO, our analyses and identified molecular signature may suggest novel biologically plausible roles of the selected CpGs and miRNAs in breast cancer.

### A holistic view of molecular processes in blood during allergen inhalation change

Next we showed the utility of DIABLO to a repeated measures study, incorporating cell-types and pathway-based modules using molecular data from 14 asthmatic individuals undergoing allergen inhalation challenge [35,36] (**Figure 6A**). Blood samples were collected prior to (pre) and 2 hours after (post) allergen challenge and profiled for cell-type frequencies (9 cell-types), leukocyte gene transcript expression and plasma metabolite abundances. A module based approach (also known as eigengene summarization [11]) was used to transform both the gene expression and metabolite datasets into pathway datasets. Consequently, each variable in those two datasets now represented the pathway activity expression level for each sample instead of direct gene/metabolite expression. The mRNA dataset was transformed into a Kyoto Encyclopedia of Genes and Genomes (KEGG) dataset whereas the metabolite dataset was transformed into a metabolite pathway dataset (**see Methods**). We observed that metabolite modules were highly correlated with cell-counts and gene modules (Pearson correlation > 0.8) (**Figure S3, Additional file 5**). To account for the repeated measures experimental design, a variance decomposition technique was applied to all datasets (**see Methods**). **Figure 6B** shows our chosen DIABLO design to identify correlated sets of cells, gene and metabolite modules that were altered after allergen inhalation challenge. The DIABLO model identified 2 cell-types, 10 gene and metabolite modules across two components. We compared the performance of DIABLO with variance decomposition for the repeated experimental design (AUC=99%, leave-one-out cross-validation) or with no variable decomposition (AUC=85%), suggesting a high individual variability that is greater than the pre-post challenge differences (**Figure 6C**). **Figure 6D** shows that the DIABLO method is able to maximize the correlation between components from each omics dataset or module as specified in the design matrix, and the first component shows a clear separation between pre- and post-challenge samples. Interestingly, many asthma-related cell-types and molecular pathways were identified by DIABLO, as represented in **Figure 6E** The selected cell-types, eosinophils and basophils, are considered hallmarks of allergic asthma [37]. The selected gene-module pathways included *Asthma* KEGG pathway (**Figure S4, Additional file 6**) even though individual gene members were not significantly altered postchallenge (**Figure S5, Additional file 7**). DIABLO also selected the *Valine, leucine and isoleucine (branched-chain amino acids, BCAAs) biosynthesis* gene module and the *Valine, leucine and isoleucine metabolism* metabolite module, where the activity of both significantly increased post-challenge (**Figure S6, Additional file 8**). These findings depict common molecular processes that span different biological layers, and suggest the ability of our DIABLO method to identify features that suggest a mechanistic link with response to allergen challenge.

## Discussion

Classification algorithms *a priori* do not focus on incorporating biological information and therefore, any derived discriminatory markers (“biomarkers”) may not mechanistically link the underlying biology to the phenotype. To address this concern, we developed DIABLO, an integrative classification method which not only identifies subsets of discriminatory molecules from each omics dataset, but also aims to more plausibly model the correlation structure between them, assuming that correlation implies similar functional relationships [26].

DIABLO promotes a compromise between a performance-driven and biologically-driven multi-omics biomarker panel. For example, for the single-omics analyses, the mRNA dataset was found to be the most discriminatory and led to superior performance using Enet, SVM and RF compared to the multi-omics DIABLO panel which included equal numbers of each type of omics variables. The high performance of the mRNA dataset may be due to the fact that the PAM50 gene classifier was developed using gene expression data (even though the PAM50 genes were removed) and thus, genes correlated with the PAM50 genes may be driving the classification signal. Therefore, although DIABLO provided an enhanced set of correlated omics variables, its classification performance was hindered by variables with less discriminatory power. On the other hand, DIABLO out-performed the other single-omics panels (miRNA, CpGs and proteins), which may be explained by the integrative focus of DIABLO, where stronger discriminatory omics variables may compensate for weaker ones. While the existing integrative schemes using Enet out-performed DIABLO, they presented strong limitations, such as an over representation of selected mRNAs (Concatenation-Enet) and panels with a large number of features (Ensemble-Enet). Furthermore, the Concatenation method could not be objectively assessed in the test set that was missing proteomics data, while the Ensemble methods could only be assessed based on three omics datasets. The DIABLO classifier, however, was built by integrating all four datasets and tested using three datasets. The integration task in DIABLO models the information contained in the proteomics dataset, even though it was missing in the test set. Therefore, DIABLO gives a competitive performance compared to existing integrative methods with added user-friendly benefits of: 1) user specified number of features, 2) strong correlation between the variables identified and 3) ability to make predictions on new data even when some new datasets are missing. Lastly, although we have shown comparable performance of DIABLO with existing methods for this particular human breast cancer data, conclusions may vary with other datasets from other biological studies.

A challenge that limits the clinical translatability of multi-omics biomarker panels is the increased number of features that need to be assessed in combination. DIABLO is able to build a sparse classifier (36 variables, 9 from each omics space) with a competitive performance compared to existing methods which contained hundreds of omics-specific variables. Our analyses revealed a multi-omics characterization of the PAM50 breast cancer subtypes by combining selected mRNA, miRNA, CpGs, and proteins as validated by the gene-set enrichment analyses. Although the top ranked pathways (many related to breast cancer) were enriched with mRNA and proteins, this suggests that the identified miRNA and CpGs may also play a role in determining the PAM50 subtypes. Therefore, DIABLO helps generate novel hypotheses that arise from evidence through multiple biological layers of information and may lead to more robust signatures compared to those generated via single-omics analyses.

We also demonstrate the utility of DIABLO to study molecular processes across different omics layers by combining DIABLO with a modular approach. Often biologists are interested in identifying significant molecular pathways that are dysregulated between disease groups instead of identifying biomarker signatures that contain a limited number of features which restricts gene-set enrichment analyses. DIABLO identified known cell-types and pathwaysin the context of asthma, such as eosinophils/basophils and the *Asthma KEGG pathway,* as well as, novel pathways such as the *valine, leucine and isoleucine (branched-chain amino acids, BCAAs) pathway.* The mechanistic relationship between BCAAs and allergic responses is a novel finding and their role as potential predictors of allergic inflammation needs to be further investigated. This proof-of-concept study demonstrates the ability of DIABLO to uncover common relationships between different biological layers, resulting in novel hypotheses to be validated in the laboratory.

Despite the multi-purpose nature of DIABLO, we acknowledge some limitations of the method. The linearity assumption between the selected omics variables and the response may not be valid in some biological research areas, and the further development of kernel-based methods to model non-linear relationships between omics levels and the response may overcome this problem. The other limitation that is also encountered with other machine learning algorithms is the tuning of the parameters. The optimal number of variables to select from each dataset, can be computationally intensive, as we have used repeated cross-validation to ensure unbiased classification error rate evaluation. A grid approach was deemed reasonable and provided very good performance results, but may still be suboptimal as we had to restrict the grid space. Finally, and similar to other methods, DIABLO suffers from potential technical artifacts of the data, such as batch effects, presence of confounding variables and differences in noise levels with respect to the different technologies used for each omics dataset. Therefore, we recommend exploratory preliminary analyses for each single-omics dataset to address technical factors that may affect downstream analyses using DIABLO.

Nowadays, system biologists, computational biologists and bioinformaticians dealing with multi-omics studies face the challenge of ‘missing’ the biological question, thus relying on data-driven statistical approaches in the absence of specific hypotheses to be tested. Our study shows that a precise biological question is crucial to perform integrative analyses, as it will aid the choice of the design in the DIABLO model, whether to use of variance decomposition and whether to use pathway-based modules. Our proposed pipeline has strong potential to identify multi-omics signatures that discriminate multiple phenotypic groups and can be interpreted through the use of various graphical outputs. Our ultimate goal is that those identified molecular signatures will help in generating novel biological hypotheses to be tested and validated back in the laboratory, thus eventually filling the gap of the missing biological question.

## Conclusions

We introduced DIABLO, a dimension reduction multivariate method to integrate several omics datasets measured on the same set of samples, while accounting for the heterogeneity between omics platforms. The aim of DIABLO is to classify samples according to known phenotypic groups, to identify a small but robust multi-omics molecular signature that can predict phenotypic groups in new test samples.

To our knowledge, DIABLO is the only integrative classification method that models the correlation structure between omics data spaces, thus improving biological insights by linking biology to phenotype. We propose a flexible framework for different data-types that can be applied to any type of datasets (not only omics), various study designs and pathway-based module analyses. The mixDIABLO framework will allow researchers to explore datasets, build multi-omic panels, assess the performance of these integrative statistical models, create visualizations to assist in the interpretation of these models in the biological context, and, ultimately, generate novel hypotheses to be validated in the laboratory.

## Methods

### Code availability and software tool requirements

The DIABLO framework is implemented in the mixOmics R package [20,21]. mixOmics currently includes 15 multivariate methodologies, for single-omics analysis and integration of two datasets. All scripts/tutorials can be found on the webpage (http://www.mixomics.org/mixDIABLO). All analyses were performed using the R statistical computing program [43] (version 3.3.1) and the mixOmics package (version 6.0.0).

## Statistical methods and analysis

### General multivariate framework to integrate multiple datasets measured on the same samples

DIABLO extends sparse generalized canonical correlation analysis (sGCCA) [24] to a classification framework. sGCCA is a multivariate dimension reduction technique that uses singular value decomposition and selects co-expressed (correlated) variables from several omics datasets in a computationally and statistically efficient manner. sGCCA maximizes the covariance between linear combinations of variables (latent component scores) and projects the data into the smaller dimensional subspace spanned by the components. The selection of the correlated molecules across omics levels is performed internally in sGCCA with l_{1}–penalization on the variable coefficient vector defining the linear combinations. *Note that since all latent components are scaled in the algorithm, sGCCA maximizes the correlation between components. However, we will retain the term ‘covariance’ instead of ‘correlation’ throughout this section to present the general sGCCA framework*.

Denote *K* normalized, centered and scaled datasets *X _{1} (n* x

*p*), …,

_{1}*X*(

_{K}*n*x

*p*), measuring the expression levels of

_{K}*p*, …,

_{1}, p_{2}*p*omics variables on the same

_{K}*n*samples,

*k = 1, …, K*, sGCCA solves the optimization function: where

*c*indicates whether to maximize the covariance between the datasets

_{jk}*X*and

_{k}*X*according to the design matrix, with

_{j}*c*= 0 (no relationship modelled between the datasets) or

_{jk}*c*= 1 otherwise,

_{jk}

**a**^{k}is the variable coefficient vector for each dataset

*X*,

_{k}*λ*

_{k}is a non-negative parameter that controls the amount of shrinkage and thus the number of non-zero coefficients in

**a**

^{k}. Similar to Lasso [44] or 1

_{1}–penalized multivariate model for one single-omics dataset [23], the 1

_{1}penalization improves the interpretability of the component scores

*X*that is now only defined on a subset of omics variables with a non-zero coefficient from the omics dataset

_{k}**a**^{k}*X*. The result is the identification of variables that are highly correlated between and within omics datasets.

_{k}Equation (1) describes the sGCCA model for the first dimension. Once the first set of coefficient vectors and associated component scores are obtained, residual matrices are calculated during the ‘deflation’ step for the second dimension, such that , where is the original centered and scaled data matrix. The subsequent set of components scores and coefficient vectors are then obtained by substituting *X _{k}* by in (1). This process is repeated until a sufficient number of dimensions (or set of components) is achieved.

The underlying assumption of the sGCCA model is that the major source of common biological variation can be extracted via the component scores *X _{k} a^{k}*, while any unwanted variation due to heterogeneity across the datasets

*X*does not impact the statistical model. The optimization problem (1) is solved using a monotonically convergent algorithm [24].

_{K}### DIABLO for supervised classification analysis and prediction

To extend sGCCA for a classification framework, we substitute one omics dataset *X _{k}* in (1) with a dummy indicator matrix

*Y*of size (

*n*×

*G*), where

*G*is the number of phenotype groups that indicate the class membership of each sample. In addition, and for easier use of the method, the l

_{1}penalty parameter

*λ*

_{k}was replaced by the number of variables to select in each dataset and each component, as there is a direct correspondence between both parameters.

The class membership of a new sample *i* which is measured across the different types of omics datasets is predicted using the fitted sGCCA model with the estimated variable coefficients vectors *â*^{k} to estimate the predicted scores . To each dataset *k* corresponds a predicted continuous score *t ^{k,i}* which assigns a predicted class using a distance such as the Maximum, Centroids or Mahalanobis [20], as described in Lê Cao

*et al*. [23] and in the mixOmics package. Each component

*t*associated to each dataset

^{k,i}*k*predicts the class membership of the new sample

*i*, and the consensus class membership across all

*K*datasets is determined using either a majority vote or by averaging all

*t*across all

^{k,i}*K*datasets before using the prediction distance of choice (average prediction scheme). In case of ties in the majority vote scheme, ‘NA’ is allocated as a prediction. Because the class prediction relies on individual vote from each omics set, DIABLO is highly flexible and thus allows for some missing datasets

*X*during the prediction step. In our two studies we used the centroid distance for the majority vote scheme (breast cancer study) and the maximum distance for the average vote scheme (asthma study) during performance evaluation and test set prediction.

_{k}### Design matrix in DIABLO

The design matrix *C* is a KxK matrix of zeros and ones which specifies whether the covariance between two datasets should be maximized in the DIABLO model, as presented in equation (1). In our simulation study we evaluated two different scenarios: a null design is when no datasets are connected, and a full design is when all datasets are connected:

Note that internal to the DIABLO method, the design always links each dataset to the outcome *Y*. For the two case studies (breast cancer and asthma) the design matrix was computed based on our proposed method (see below * Parameters tuning*).

### Parameters tuning

The first parameter to tune in the design matrix C, which can be determined using either prior biological knowledge, or a data-driven approach. The latter approach uses PLS method implemented in mixOmics that models pair-wise associations between omics datasets. If the correlation between the first component of each omics dataset is above a given threshold (e.g. 0.8) then a connection between those datasets is included in the DIABLO design.

The second parameter to tune is the total number of components. In several analyses we found that G – 1 components were sufficient to extract sufficient information to discriminate all phenotype groups [23], but this can be assessed by evaluating the model performance across all specified components (described below) as well as using graphical outputs such as sample plots to visualize the discriminatory ability of each component.

Finally, the third set of parameters to tune is the number of variables to select per dataset and per component. Such tuning can rapidly become cumbersome, as there might be numerous combinations of selection sizes to evaluate across all *K* datasets. For the breast cancer study, we used 5-fold cross-validation repeated 50 times to evaluate the performance of the model over a grid of different possible values of variables to select. The performance of the model for a given set of parameters (including number of component and number of variables to select) was based on the balanced classification error rate using majority vote or average prediction schemes with centroids distance. In our experience, the number of variables to select in each dataset provides less of an improvement on the error rate compared to tuning the number of components. Therefore, even a grid composed of a small number of variables (<50 with steps of 5 or 10) may suffice as it does not substantially change the classification performance. Also, the variable selection size can also be guided according to the downstream biological interpretation to be performed. For example, a gene-set enrichment analysis may require a larger signature than a literature-search interpretation.

### Visualization outputs with DIABLO

Several types of graphical outputs were made available in mixOmics to improve the interpretation of the DIABLO results.

#### Sample plots

Pairs of components associated to each dataset are used to represent the samples projected in the space spanned by those components in each individual omics dataset. The sample plot enables the user to visualize the ability of the DIABLO model to extract common information at the sample level for each dataset, as well as to visualize the discriminatory power of each data type to separate the phenotypic groups. The scatterplot matrix (**Figure 5C, Figure 6D**) represents correlation between components for the same dimension but across all omics datasets to verify that the model maximizes the correlation as indicated in the design matrix. Since DIABLO is a supervised method, separation of subjects of different phenotypic groups can be seen using this type of plot.

#### Variable plots

To visualize selected variables, we proposed circos plot (**Figure 5D**) to represent correlations between and within variables from each dataset at the variable level. The association between variables is computed using a similarity score that is analogous to a Pearson correlation coefficient, as previously described in [45]. For each omics dataset, DIABLO produces a variable coefficient matrix of size (*p _{k} x H*), where

*H*is the total number of components in the model. The product of any two matrices approximates the association score between variables of the two omics datasets. The association between variables is displayed as a color coded link inside the plot to represent a positive or negative correlation above a user-specified threshold. The selected variables are represented on the side of the circos plot, with side colors indicating each omics type, optional line plots represent the expression levels in each phenotypic group. When we compared several approaches that do not output latent components (e.g. Enet) we calculated instead a Pearson correlation matrix, where each link represents a Pearson correlation coefficient.

#### Clustered Image Map (CIM)

A clustered image map [45] based on the Euclidean distance and the complete linkage displays an unsupervised clustering between the selected variables (centered and scaled) and the samples. Color bars represent the sample phenotypic groups (columns) and the type of omics (rows) variables.

### Gene-set enrichment analyses

Significance of enrichment was determined using a hypergeometric test of the overlap between the selected features (mapped to official HUGO gene symbols or official miRNA symbols) and the various gene sets contained in the collections. In order to carry out the comparison, each feature set was mapped back to official HUGO gene symbols. This was done as follows across the respective data types: 1) mRNA – gene symbols used as-is. 2) DNA methylation – features were mapped to coding gene symbol manually from downloaded annotation file. 3) Protein – features mapped to coding gene symbol manually from downloaded annotation file. 4) miRNA – a previously described strategy was used [46]. Briefly, all gene sets were mapped back to a set of miRNAs associated with them, using a database of computationally predicted target genes for each miRNA (e.g. if a gene set is composed of genes A, B and C, genes A and B are targets of miRNA X, while gene C is a target of miRNA Y and Z, the new gene set will be made up of miRNA X, Y and Z. This effectively deals with deduplication issues.) Enrichment of the miRNA features was then assessed against these transformed gene sets.

The following collections were used as gene-sets for the enrichment analysis [47]: 1) **C2** is a collection of curated gene sets such as Pathway Interaction DB (PID), Biocarta (BIOCARTA), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome (REACTOME). 2) **C6** is a collection of oncogenic gene sets (signatures of cellular pathways which are often dysregulated in cancer).

### Input data in DIABLO

While DIABLO does not assume particular data distributions, all datasets should be normalized appropriately according to each omics platform and preprocessed if necessary (see normalization steps described below for each case study). Samples should be represented in rows in the data matrices and match the same sample across omics datasets. The phenotype outcome Y is a factor indicating the class membership of each sample. The R function, in mixOmics will internally center and scale each variable as is conventionally performed in PLS-based models and will create the dummy matrix outcome from Y. A multilevel variance decomposition option is available for repeated measures study designs (see below).

## Data description and preprocessing

### Breast cancer multi-omics study

#### Datasets accession

The level 3 TCGA data (version 2015_11_01) were retrieved from firebrowse.org hosted by the Broad Institute. The clinical data file (Merge_Clinical) was downloaded from the Primary tab of the BRCA Clinical Archives. The mRNA RSEM normalized dataset (illuminahiseq_rnaseqv2-RSEM_genes_normalized) was downloaded from the Primary tab of the BRCA mRNASeq Archives. The miRNA datasets (illuminahiseq_mirnaseqmiR_gene_expression and illuminaga_mirnaseq-miR_gene_expression) were downloaded from the Primary tab of the BRCA miRSeq Archives. The reverse phase protein array dataset (mda_rppa_core-protein_normalization) was downloaded from the Primary tab of the BRCA RPPA Archives. The beta values for the methylation datasets (humanmethylation27-within_bioassay_data_set_function and humanmethylation450-within_bioassay_data_set_function MD5) were downloaded from the Primary tab of the BRCA Methylation Archives.

#### Data processing

Clinical data were present for 1098 subjects for 3,703 variables. Un-annotated (29) transcripts were removed from the mRNA dataset (20,502 genes x 1212 samples). Two transcripts corresponded to *SLC35E2*, therefore one of the transcripts was re-labelled *SLC35E2.rep*. The miRNA datasets (1,046 miRNA x 1190 samples) was derived using two different Illumina technologies, the Illumina Genome Analyzer (341 samples) and the Illumina HiSeq (849 samples). The read counts instead of the reads_per_million_miRNA_mapped were used. The proteomics dataset obtained using a reverse phase protein array consisted of 142 proteins for 410 samples. The methylation data was derived from two different platform, the Illumina Methylation 27 (27,578 CpG probes x 343 subjects) and the Illumina 450K (485,577 CpG probes x 885 subjects). There were 25,978 CpG probes in common between the platforms. The PAM50 labels for 1182 samples were obtained from the TCGA staff.

Since some samples were derived from the same individuals, all datasets were restricted to samples coming from the primary solid tumor (sample type code 01) and to the first vial (vial code A), resulting in the following datasets for mRNA (20,502 genes x 1080 subjects), miRNA (1,046 miRNAs x 1066 subjects), proteins (142 proteins x 403 subjects), CpGs (25,978 CpG probes x 1066 subjects) and 1049 subjects with PAM50 subtypes present.

#### Training and test cohorts

There were 387 subjects (Basal: 76, Her2: 38, LumA: 188, LumB: 77 and Normal: 8) common between the clinical, mRNA, miRNA, proteomics, methylation and PAM50 label datasets. The biomarker analysis was performed using 4 molecular datasets, mRNA, miRNA, CpGs and proteins. Since the proteomics dataset was the limiting dataset, the test datasets only consisted of the mRNA, miRNA and CpG data matrices. The test cohort consisted of 638 subjects; Basal: 102, Her2: 40, LumA: 346, LumB: 122 and Normal: 28. Given the limited number of normal subjects, they were not used in the biomarker analysis.

#### Normalization and pre-filtering

The count data for the mRNA dataset was normalized to log2-counts per million (logCPM), similar to limma voom [48]:

After library size normalization, genes with counts less than 0 were removed. In addition, the 3000 most variable genes based on the median absolute deviation (MAD) were retained for downstream analysis. The PAM50 genes were also removed from the mRNA dataset prior to analyses. Similarly, the miRNA count data was normalized to logCPM and miRNA transcripts with counts less than 0 were also removed. The CpG probes containing missing data were removed from the methylation data and the 2000 most variable probes based on MAD were retained for downstream analysis.

## Asthma multi-omics study

### Datasets accession

Paired blood samples were obtained from 14 asthmatic individuals undergoing allergen inhalation challenge as previously described[49]. Cell counts were obtained from a hematolyzer (percentage of Neutrophils, Lymphocytes, Monocytes, Eosinophils and Basophils) and DNA methylation analysis (percentage of T regulatory cells, T cells, B cells and Th17 cells). Gene expression profiling was performed using Affymetrix Human Gene 1.0 ST (GSE40240). Metabolite profiling was performed by Metabolon Inc. (Durham, North Carolina, USA). All asthma data have been published as part of previous studies[35,36].

### Normalization

Microarray data was normalized using Robust MultiArray Average (RMA), consisting of background correction, quantile normalization and probe summarization using median polish. Preprocessing of mass spectrometry data including data extraction, peak-identification and data preprocessing for quality control and compound identification was performed by Metabolon Inc. (Durham, North Carolina, USA).

### Modular analysis

Eigengene summarization is a common approach to decompose a n by p dataset (where n is the number of samples and p is the number of variables in a module), to a component (linear combination of all p variables) that represents the summarized expression of genes in the module [11]. For the asthma study, 15,683 genes were reduced to 229 KEGG pathways and 292 metabolites were reduced to 60 metabolic pathways using eigengene summarization.

### Multilevel transformation for repeated measures study designs

For multivariate analyses, A multilevel approach separates the within subject variation matrix (*X _{w}*) and the between subject variation (

*X*) for a given dataset (

_{b}*X*) [50], ie.

*X = X*. In the case of a two-repeated measured problem (e.g. pre vs post challenge), the within subject variation matrix is similar to calculating the net difference for each individual between the data obtained for pre and post challenge. For each omics dataset, the within-subject variation matrix was extracted prior to applying DIABLO. In the asthma study, the multilevel approach (called variance decomposition step) was applied to the cell-type, gene and metabolite module datasets.

_{w}+ X_{b}## Declarations

### Acknowledgements

The authors would like to thank Dr. Kevin Chang (University of Auckland) for some preliminary exploratory analyses of the breast cancer dataset. We would also like to thank Mr. Chao Liu (University of Queensland) for obtaining the PAM50 phenotypic information for the TCGA datasets.

### Competing interests

The authors declare no competing interests.

### Funding

AS is the recipient of the Canadian Institutes of Health Research Doctoral Award – Frederick Banting and Charles Best Canada Graduate Scholarship and the Michael Smith Foreign Study Supplement award. KALC was supported in part by the National Health and Medical Research Council (NHMRC) Career Development fellowship (APP1087415).

### Authors’ contributions

AS performed the data pre-processing, the statistical analyses and developed the DIABLO method. BG implemented the R scripts for DIABLO and graphical outputs, CPS performed the gene enrichment analyses, MV implemented the circos plots, FR and BG implemented the R scripts in mixOmics along with the S3 functions, SJT supervised AS and participated in the design of the study. KALC supervised AS, BG, MV and FR, participated in the development of the DIABLO method and provided statistical advice. AS and KALC edited the manuscript, with editorial input from SJT and CPS.

### Additional files

**Additional file 1:** Simulation Study. (PDF 382KB)

**Additional file 2: Table S1.** Year of collection for TCGA breast cancer samples. (PDF 34KB)

**Additional file 3: Figure S1.** Overlap of expression between train and test sets. (PDF 54KB)

**Additional file 4: Figure S2.** Design matrix used for Figure 5. (PDF 71KB)

**Additional file 5: Figure S3.** Design matrix used for Figure 6. (PDF 64KB)

**Additional file 6: Figure S4.** Asthma KEGG pathway. (PDF 100 KB)

**Additional file 7: Figure S5.** Volcano plot of genes in the Asthma KEGG pathway. (PDF 65KB)

**Additional file 8: Figure S6.** Valine, leucine and isoleucine gene and metabolite pathway. (PDF 98KB)

## Additional Files

### Additional File 1

#### Simulation study

The purpose of this simulation was to study the effects of the design matrix on the variables selected by DIABLO and their corresponding error rates. Two designs were tested: the null design where no datasets were connected and the full design where all datasets were connected. Three datasets (X, Y, Z) for two phenotypic groups (Y) were generated of equal sizes 100 observations by 150 variables. 100 out of the 150 variables were noisy irrelevant variables and we assessed the ability of DIABLO to identify the true 50 discriminative and/or correlated variables and evaluated the classification error rate for different simulation scenarios where both correlation and noise levels were varied.

To that end we simulated four types of variables, namely non-discriminatory but correlated variables, discriminatory (i.e. explaining the phenotype of interest) but correlated variables, discriminatory non-correlated variables and irrelevant (noisy) variables, as described below:

**Non-discriminatory correlated variables**We generated covariance matrices Σ of size

*p*x^{1}*p*with different correlation strength λ [∈(0,1)], where^{1}*p*= 50:^{1}We denote

**X**a_{nonDisCor}*p*-dimensional random vector, generated from the multivariate normal distribution^{1}**X**~ N(μ, Σ) where μ = (0, …, 0) as those variables are not discriminatory. We then randomly allocate_{nonDisCor}**X**into X_{nonDisCor}^{1}_{nonDisCor}, Y^{1}_{nonDisCor}, and Z^{1}_{nonDiscor}, each of size 100 observations and*p*variables for group 1, and similarly for group 2 with^{1}**X**into X_{nonDisCor2}^{2}_{nonDisCor}, Y^{2}_{nonDisCor}and Z^{2}_{nonDisCor}. Note since the mean of the two groups are equivalent these variables are not discriminatory but are correlated.**Discriminatory non-correlated variables**Generate a kxk covariance matrix: , where k = 50

Using the kxk covariance matrix generate

**x**, and_{DisNonCor1}**x**, k-dimensional random vectors; [X_{DisNonCor2}_{1}, X_{2}, …, X_{k}] with a multivariate normal distribution:**x**~ N(μ_{DisNonCor1}_{1}, Σ) &**x**~ N(μ_{DisNonCor2}_{2}, Σ) where , where c is the fold-change ranging from 0, 1, 2, 3Randomly divide

**x**into X_{DisNonCor1}_{dis1}, Y_{dis1}, and Z_{dis1}of size 100 observations, and 50 variables for group 1 and**x**into X_{DisNonCor2}_{dis2}, Y_{dis2}, and Z_{dis2}of size 100 observations, and 50 variables for group 2.Generate irrelevant variables

Generate a kxk covariance matrix: , where k = 100

Using the kxk covariance matrix generate

**x**be a k-dimensional random vector;_{irv}**x**= [X_{irv}_{1}, X_{2}, …, X_{k}] with a multivariate normal distribution:Randomly divide

**x**into X_{irv}_{irv1}, Y_{irv1}, and Z_{irv1}of size 100 observations, and 100 variables for group 1 and X_{irv2}, Y_{irv2}, and Z_{irv2}of size 100 observations, and 100 variables for group 2.Generate noise

Range variance from low noise to high noise, change diagonal values to 0, to 1.5:

Using the pxp covariance matrix generate

**x**be a k-dimensional random vector;_{noise}**x**= [X_{noise}_{1}, X_{2}, …, X_{k}] with a multivariate normal distribution:Divide

**x**into X_{noise}_{noise1}, Y_{noise1}, and Z_{noise1}of size 100 observations (group 1 and group2), and 150 variables (given type of variable (50) + 100 irrelevant variables)).

#### DIABLO analysis

We then ran DIABLO integrating the three simulated data sets X, Y, Z with equal numbers of observations (n=100) and including Y as the categorical outcome(class vector indicating that 100 samples belong to group1 and 100 samples to group2). Those data sets include 150 variables each, among which 100 variables were deemed irrelevant variables (not correlated between datasets, and not discriminatory), and the remaining 50 variables were either 1) correlated across all three datasets but not discriminatory between groups (CorNonDis), or 2) correlated and discriminatory (CorDis) or 3) not correlated but discriminatory (NonCorDis).

Separate scenarios of three datasets were constructed by varying the fold-change between the two groups (from 0 to 1.5) and the levels of noise (0-no noise, 1-low noise, 5-medium noise and 15-high noise). In addition, the strength of correlation varied for the parameter λ used to generate the covariance matrices. The figure below depicts three covariance matrices for low (λ=0.75), medium (λ=0.91) and highly (λ=0.98) correlated variables.

The DIABLO analysis included 1 component and 50 variables to be selected and we compared the performance in terms of classification error rate and identification of the true relevant variables when using different matrix designs, namely a full design or a null design.

The strength of the correlation between variables was varied to determine its effects on the number of correctly identified variables by the DIABLO classifier with respect to the design matrix specification.

### Additional File 2

The majority of samples in the training set were collected in 2010, whereas the majority of samples in the test set were collected in 2011.

### Additional File 3

The range of expression for each omic dataset was similar between the training and test set. Note, there was no protein expression data present for subjects in the test set.

### Additional File 4

A correlation cut-off of 0.8 was used to determine the connectivity for the design matrix.

### Additional File 5

### Additional File 6

Asthma KEGG pathway.

Red depicts up-regulated genes whereas green depicts down-regulate genes after allergen inhalation challenge.

### Additional File 7

Volcano plot of genes in the Asthma KEGG pathway.

All genes in the asthma pathway (many not represented on the KEGG pathway diagram in Supplementary Figure 6A. The volcano plot shows that apart from HLA-DPB1 no other genes within the Asthma pathway was significant at the nominal p-value cut-off of 0.05. After correcting for multiple testing, HLA-DPB1 corresponding to an Benjamini Hochberg False Discovery Rate (BH-FDR) of 0.46.

### Additional File 8

## List of abbreviations

- DIABLO,
- Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies;
- AUC,
- area under the receiver operating curve;
- PLS,
- Projection to Latent Structure models;
- sPLS-DA,
- sparse PLS-Discriminant Analysis,
- sGCCA,
- sparse generalized canonical correlation analysis;
- PCA,
- Principal Component Analysis;
- BER,
- Balanced Error Rate;
- Enet,
- elastic net;
- RF,
- random forest;
- SVM,
- support vector machine;
- KEGG,
- Kyoto Encyclopedia of Genes and Genomes;