+ All documents
Home > Documents > Multiblock Discriminant Analysis for Integrative Genomic Study

Multiblock Discriminant Analysis for Integrative Genomic Study

Date post: 12-Nov-2023
Category:
Upload: independent
View: 1 times
Download: 0 times
Share this document with a friend
12
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/275333551 Multiblock Discriminant Analysis for Integrative Genomic Study ARTICLE in BIOMED RESEARCH INTERNATIONAL · MAY 2015 Impact Factor: 2.71 · DOI: 10.1155/2015/783592 CITATION 1 READS 21 4 AUTHORS, INCLUDING: Mingon Kang Texas A&M University - Commerce 16 PUBLICATIONS 17 CITATIONS SEE PROFILE Chunyu Liu University of Illinois at Chicago 694 PUBLICATIONS 9,424 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: Mingon Kang Retrieved on: 03 February 2016
Transcript

Seediscussionsstatsandauthorprofilesforthispublicationathttpswwwresearchgatenetpublication275333551

MultiblockDiscriminantAnalysisforIntegrativeGenomicStudy

ARTICLEinBIOMEDRESEARCHINTERNATIONALmiddotMAY2015

ImpactFactor271middotDOI1011552015783592

CITATION

1

READS

21

4AUTHORSINCLUDING

MingonKang

TexasAampMUniversity-Commerce

16PUBLICATIONS17CITATIONS

SEEPROFILE

ChunyuLiu

UniversityofIllinoisatChicago

694PUBLICATIONS9424CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate

lettingyouaccessandreadthemimmediately

AvailablefromMingonKang

Retrievedon03February2016

Research ArticleMultiblock Discriminant Analysis for Integrative Genomic Study

Mingon Kang1 Dong-Chul Kim2 Chunyu Liu3 and Jean Gao1

1Department of Computer Science and Engineering University of Texas at Arlington Arlington TX 76019 USA2Department of Computer Science University of Texas-Pan American Edinburg TX 78539 USA3Department of Psychiatry University of Illinois at Chicago Chicago IL 66012 USA

Correspondence should be addressed to Jean Gao gaoutaedu

Received 9 December 2014 Accepted 21 April 2015

Academic Editor Klaus Wimmers

Copyright copy 2015 Mingon Kang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Human diseases are abnormal medical conditions in which multiple biological components are complicatedly involvedNevertheless most contributions of research have been made with a single type of genetic data such as Single NucleotidePolymorphism (SNP) or Copy Number Variation (CNV) Furthermore epigenetic modifications and transcriptional regulationshave to be considered to fully exploit the knowledge of the complex human diseases as well as the genomic variants We callthe collection of the multiple heterogeneous data ldquomultiblock datardquo In this paper we propose a novel Multiblock DiscriminantAnalysis (MultiDA) method that provides a new integrative genomic model for the multiblock analysis and an efficient algorithmfor discriminant analysisThe integrative genomicmodel is built by exploiting the representative genomic data including SNP CNVDNAmethylation and gene expressionThe efficient algorithm for the discriminant analysis identifies discriminative factors of themultiblock data The discriminant analysis is essential to discover biomarkers in computational biology The performance of theproposed MultiDA was assessed by intensive simulation experiments where the outstanding performance comparing the relatedmethods was reported As a target application we applied MultiDA to human brain data of psychiatric disorders The findings andgene regulatory network derived from the experiment are discussed

1 Introduction

Human diseases involve complex processes that includeinteractive actions of biological multiple layers such asgenetic epigenetic and transcriptional regulation Conduct-ing research based on a single type of biological data producesinsufficient results to fully exploit the knowledge of thecomplex human diseases The prior research shows that itis essential for the study to be based on a comprehensiveconsideration of the multiple biological data to grasp anin-depth understanding of the complex mechanisms of thehuman diseases and the identification of disease markersThe recent advances of high-throughput technologies suchas DNA microarray and sequencing technologies efficientlyprofile various types of genomic data The genomic datainclude Single Nucleotide Polymorphism (SNP) Copy Num-ber Variation (CNV) DNA methylation (DM) and geneexpression (GE) Integrative genomic analysis of the hetero-geneous genomic data plays an important role in profiling

a global view of a biological system as well as identifyingsignificant markers of the human diseases

However most research has focused solely on investiga-tions of a single type of the genomic data Genome-WideAssociation Studies (GWAS) examine genetic loci which areassociated with a trait (eg major diseases) using the SNPdata [1 2] GWAS normally compare the SNP arrays oftwo groups disease (case) and normal (control) samples Ifa genetic variation on a locus with the disease samples isstatistically significant to the controls the SNP is consideredassociated with the disease whereas expression QuantitativeTrait Loci (eQTL) studies have been actively done to identifygenetic loci that regulate gene expression [3] Combining thegene microarray data with GWAS not only enables the cap-ture of gene regulatory interactions but also provides insightinto the genetic mechanism that regulates gene expressionvariations However both GWAS and eQTLmapping studiesstill remain as a ldquomissing heritabilityrdquo problem [4]

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 783592 10 pageshttpdxdoiorg1011552015783592

2 BioMed Research International

In addition to SNP Copy Number Variation (CNV) andDNA methylation (DM) have also been highlighted as keyfactors that affect the gene expression regulation CNV is astructural alternation of DNA in which specific regions of thegenome are deleted or duplicated on chromosomes AlthoughCNV is frequently observed even in healthy individuals it ishypothesized that the variants may cause diseases by directlyaffecting gene dosage and gene expression [5 6] Specifi-cally whole-genome association studies of the relationshipbetween CNV and diseases reported that gene expressionlevels in CNV regions are strongly related to the deletionor duplication of the regions [6] Typically the deletion ofeither particular regions within a gene or regulatory regionsof a gene may result in a lower gene expression than what isnormally expressed DM is an epigenetic modification thatoccurred by the addition of methyl group to the cytosine oradenine of DNA DM inhibits transcription of the genes withhigh levels of 5-methylcytosine in their promoter region orrecruits proteins such as histone deacetylases that canmodifyhistones [7 8]The functionality ofDMconsequently changesthe gene expression levels even on the same DNA bases

Thus recent research has actively extended GWAS andeQTL mapping studies to the integrative association stud-ies with multiple types of genomic data Most integrativegenomic research focuses on identifying genetic epigeneticor posttranscriptional factors that control gene expressionregulation (or microRNA) by considering the complex inter-actions of SNP CNV and DM [9ndash11] Specifically the CancerGenomic Atlas [9] conducted large-scale multidimensionalanalysis with SNP CNV DM and GE to provide compre-hensive genomic characterizations for brain cancer In Aureet alrsquos work [10] the combination effects of CNV and DMwere examined to identify the association with alterations ofmiRNAexpression in breast tumorsWagner et al [11] studiedthe relationship between SNPDM andGE viamultiple eQTLanalysis

Most of the integration approaches have used step-by-step processes Ordinarily approaches filter candidatemarkers by using statistical techniques at the first step andfind the final markers that satisfy certain criteria at theremaining stages [12ndash15] This type of integration methodoftenmakes increased ldquotype II errorsrdquo at each step that is failsto find informative markers by incorrectly identifying themas insignificant Moreover they do not consider interactioneffects of the multiblock data Mechanism was not consid-ered

Hence research has recently started to shift towardapproaches using systematical models in order to integrateand analyze the heterogeneous data comprehensively ratherthan through simple step-wise processes [16ndash18] Multiblockmethods of Partial Least Squares (PLS) and GeneralizedCanonical Correlation Analysis (GCCA) are representativemethods A derivative of a sparse version of PLS wasproposed by penalizing both features and sample dimen-sions to identify ldquoregulatory modulesrdquo [16] Such PLS-basedmethods which maximize the covariance between latentvariables often fail to detect significant factors when theirintensities are weak Furthermore the method lacks theconsideration of the discriminant analysis of the disease

A sparse multiblock analysis method derived from Gener-alized Canonical Correlation (SGCCA) was developed toidentify multiblock association models while considering therelationship between the different data block such as cis-regulated mutations [17] This work builds a hybrid modelby combining both GWAS and eQTL models rather than amultiblock integration model The data integration approachwas suggested by utilizing multiple feature selectionmethodssuch as Principal Component Analysis (PCA) PLS andLASSO [18] They extracted the important factors using thedimensional reduction and feature selection methods andapplied them onCox survival models However combinationeffects of the multiblock data were ignored in this approach

To tackle these limitations we propose a novelMultiblockDiscriminant Analysis (MultiDA) method for the integrativegenomic study The proposed method MultiDA makes thefollowing main contributions

(i) A new integrative genomic model for the discrimi-nant analysis is introduced by exploiting class infor-mation

(ii) A sophisticated optimal solution is developed to solvethe discriminant analysis problem in the integrativegenomic model

First we built a novel integrative genomic model for the dis-criminant analysis The class data is considered as oneblock and the total squared correlation including the classblock is maximized The introduction of the class block tothe multiblock model enables us to perform discriminantanalysis in the integrative genomic model Secondly wepropose a sophisticated method to solve the discriminantanalysis problem in the new integrative genomic model Thediscriminant analysis is essential in identifying biomarkersof human diseases in computational biology Regardless ithas been overlooked in the multiblock analysis The efficientalgorithm for the discriminant analysis and assessment of itsperformance are explored in this paper

2 Methods

21 Notation We suppose that there are 119869 multiblock dataThemultiblock data are measured on119873 numbers of the sameset of observations A block consists of a group of featuresthat share common properties or represent one aspect of thesample The multiblock data is denoted by X = X

1 X

119869

The 119895th block data X119895is 119875119895-dimensional zero mean column

vectors X119895isin R119873times119875119895 A matrix C = 119888

119895119896| 119888119895119896isin 0 1 1 le

119895 119896 le 119869 is a binary matrix that determines the linkagebetween the multiblock where 119888

119895119896= 1 if the block 119895 and

the block 119896 are connected or 0 if otherwise In the proposedintegrative genomic model SNP CNV DM GE and classlabel (case or control) of the samples are considered as themultiblock components For simplicity X

1 X2 X3 X4 and

X5represent SNP CNV DM GE and class label respectively

Through this paper we use 119894 for the index of the sampleand 119895 119896 for the multiblock (120484) is used to denote a columnvector of a matrix or an element of a vector For instanceX119894(120484)

and 119886119894(120484)

represent the 120484th column vector of the matrix

BioMed Research International 3

Discriminantanalysis

Gene expressionSNP

CNV

DNA methylation

Disease

Latent variable

1198311

1198312

1198313

1198314

1198315

119831p4

119831p3

119831p2

119831p2

1198551

1198552

1198553

1198554

1198555

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1The conceptual graphic representation of the integrative genomicmodel A rectangle represents a manipulated variable and a circlerepresents a latent variable The graphic representation illustrates the structure model that shows the relationship between SNP CNV DNAmethylation gene expression and disease phenotype

X119894and the 120484th element of the vector a

119894 respectively Figure 1

illustrates the conceptual overview of the multi-block dataand framework

22 Multiblock Discriminant Analysis Multiblock Discrimi-nant Analysis (MultiDA) builds a sparse associationmodel bynot only maximizing the total squared correlations betweenthe multiblocks but also taking into account the discrimina-tive factors in themodelMultiDA considers a linear subspacewhich is a construction of low-dimensional basis of the dataThe linear subspaces of the multiblock which maximize thetotal squared correlations identify the significant factors ofthe association model with sparsity regularizationThe linearsubspace (or latent variable) k

119895of the 119895th block is represented

by

k119895= X119895120572119895 (1)

where 120572119895is a loading vector Then we introduce sparse

regularization (elastic net penalization) on the loading vectorto reduce the chance of including insignificant variables andto improve their interpretationThe sparse regularization hasits advantage especially when the number of features is muchlarger than the sample number (119873 ≪ 119875

119895)Therefore the basic

objective function can be represented as

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119888119895119896

120572⊤

119895X⊤119895X119896120572119896120572⊤

119895X⊤119895X119896120572119896

120572⊤

119895X⊤119895X119895120572119895120572⊤

119896X⊤119896X119896120572119896

st 120572⊤119895X⊤119895X119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(2)

where | sdot | and sdot 2 represent ℓ1-norm and ℓ

2-norm of

the vectors respectively and 1199051and 119905

2are the shrinkage

parameters that determine the sparsity Note that the basicobjective function is equivalent to the Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] Since theintegrative genomic model aims to represent gene expressionregulated by the combinations of SNP CNV and DM thematrix C can be defined as

C =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

]]]]]]]]

]

(3)

We further consolidate the model by (1) introducing aweight matrix of the correlation for the balance of the modeland (2) providing discriminant analysis in the integrativegenomicmodelWe also provide the sophisticated solution ofthe model while SGCCA heuristically estimates the optimalsolution by following Woldrsquos algorithm in the previous work[17]

221 Weight Matrix for the Balance of the Model The weightmatrix of the correlation between the multiblocks d = 119889

119895119896|

119889119895119896isin R 1 le 119895 119896 le 119869 is introduced in the model

In the original multiblock model the correlation betweengene expression and class label block tends to be overlookedInstead the sum of the squared pairwise correlations of X

1

X2 X3 and X

4contributes large portions The correlation

4 BioMed Research International

weight matrix D gives an equal balance of the total squaredcorrelations In this paper the weight matrix is defined as

D =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 3

0 0 0 3 0

]]]]]]]]

]

(4)

where the correlation between gene expression and class labelblocks is three times more weighted than others Then thematrixD simply replaces the matrix C

222 Discriminant Analysis In the proposed integrativegenomic model we need to find discriminative genes thatcharacterize diseases However the integrative genomicmodel is comprised of combinations ofmultiple linear regres-sion models Thus discriminant analysis such as LogisticRegression (LR) and Linear Discriminant Analysis (LDA)cannot be embedded into the integrative genomic modelTo solve this problem we adapted the Discriminative LeastSquares Regression (DLSR) method proposed by Xiang et al[19] DLSR was developed based on the linear regressionmodel and it is proved that DLSR provides equal or superiorperformance compared to other discriminant methods Thebasic concept of DLSR is to enlarge the distance betweenclasses by introducing slack variables Whereas they consid-ered a multi-class problem and developed its sparse versionwith ℓ

21-norm regularization in their work we reformulated

its sparse method with elastic net penalization to suit ourown needs In DLSR the slack variable is introduced into theordinary linear regression problem

Xa = y + b ⊙m (5)

where y is a dependent variable (119910119894= minus1 1 y isin R119873) X is

a multivariate independent variable (X isin R119873times119901) and a is acoefficient vector (a isin R119901) b is a direction of the class whereits element 119887

119894= minus1 if 119910

119894= minus1 or 1 if otherwise (b isin R119901) The

Hadamard product operator ⊙ of the direction vector b andthe slack variable vector m determines the distance betweenclasses (m isin R119901)The optimal solution will be covered in thenext section

223 The Objective Function of MultiDA We finally obtainthe objective function of MultiDA

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119889119895119896

120572⊤

119895120594⊤

119895120594119896120572119896120572⊤

119895120594⊤

119895120594119896120572119896

120572⊤

119895120594⊤

119895120594119895120572119895120572⊤

119896120594⊤

119896120594119896120572119896

st 120572⊤119895120594⊤

119895120594119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(6)

where 120594119895is defined as

120594119895=

X119895+ b ⊙m if 119895 = 5

X119895

if otherwise(7)

This setting enables one to perform discriminant analysisbetween gene expression and disease blocks

23 Optimization The optimal solution of (6) can beobtained by the Lagrangian function

L = minus

119869

sum

119895

119869

sum

119896=1119895 =119896

119889119895119896120572⊤

119895120594⊤

119895120594119896120572119896120572119895

⊤120594⊤

119895120594119896120572119896

+

119869

sum

119895

119911119895(120572⊤

119895120594⊤

119895120594119895120572119895minus 1) +

119869

sum

119895

120582119895

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816

+

119869

sum

119895

(1 minus 120582119895)

2

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

(8)

where 119911119895

and 120582119895

are the Lagrangian multipliers TheLagrangian function (8) is convex although not differen-tiable Therefore the local optimum of (8) provides a globalsolution The partial derivatives of the Lagrangian functionwith respect to 120572

119895and 120582

119895are derived from

120597L

120597120572119895

= minus

119869

sum

119896

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594⊤

119895120594119896120572119896+ 119911119895120594⊤

119895120594119895120572119895

+ 120582119895s119895+ (1 minus 120582

119895)120572119895= 0

(9)

120597L

120597120582119895

= 120572⊤

119895120594⊤

119895120594119895120572119895minus 1 = 0 (10)

where s119895is the vector of a

119895rsquos sign Although the stationary

equations have no closed form solutions the optimal solutioncan be estimated by an iterative algorithm

We can make (9) simple with the inner component

120592119895=

119869

sum

119896119896 =119895

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594119896120572119896 (11)

Then by introducing the inner component 120592119895into (9) the

solution of 120572119895can be written as

120572119895= [119911119895(120594⊤

119895120594119895+1 minus 120582119895

119911119895

)]

minus1

(120594⊤

119895120592119895minus 120582119895s119895) (12)

In (11) (120572⊤119895120594⊤

119895120594119896120572119896) is a squared correlation between the

latent variables of the 119894th and 119895th block which is a scalarTherefore the inner component is computed by 120572

119895of the

previous iteration and then new 120572119895is updated in iterations

Equation (12) is the normal equation of the regression of120592119895on 120594119895with ridge and shrinkage parameter [20] The final

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Research ArticleMultiblock Discriminant Analysis for Integrative Genomic Study

Mingon Kang1 Dong-Chul Kim2 Chunyu Liu3 and Jean Gao1

1Department of Computer Science and Engineering University of Texas at Arlington Arlington TX 76019 USA2Department of Computer Science University of Texas-Pan American Edinburg TX 78539 USA3Department of Psychiatry University of Illinois at Chicago Chicago IL 66012 USA

Correspondence should be addressed to Jean Gao gaoutaedu

Received 9 December 2014 Accepted 21 April 2015

Academic Editor Klaus Wimmers

Copyright copy 2015 Mingon Kang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Human diseases are abnormal medical conditions in which multiple biological components are complicatedly involvedNevertheless most contributions of research have been made with a single type of genetic data such as Single NucleotidePolymorphism (SNP) or Copy Number Variation (CNV) Furthermore epigenetic modifications and transcriptional regulationshave to be considered to fully exploit the knowledge of the complex human diseases as well as the genomic variants We callthe collection of the multiple heterogeneous data ldquomultiblock datardquo In this paper we propose a novel Multiblock DiscriminantAnalysis (MultiDA) method that provides a new integrative genomic model for the multiblock analysis and an efficient algorithmfor discriminant analysisThe integrative genomicmodel is built by exploiting the representative genomic data including SNP CNVDNAmethylation and gene expressionThe efficient algorithm for the discriminant analysis identifies discriminative factors of themultiblock data The discriminant analysis is essential to discover biomarkers in computational biology The performance of theproposed MultiDA was assessed by intensive simulation experiments where the outstanding performance comparing the relatedmethods was reported As a target application we applied MultiDA to human brain data of psychiatric disorders The findings andgene regulatory network derived from the experiment are discussed

1 Introduction

Human diseases involve complex processes that includeinteractive actions of biological multiple layers such asgenetic epigenetic and transcriptional regulation Conduct-ing research based on a single type of biological data producesinsufficient results to fully exploit the knowledge of thecomplex human diseases The prior research shows that itis essential for the study to be based on a comprehensiveconsideration of the multiple biological data to grasp anin-depth understanding of the complex mechanisms of thehuman diseases and the identification of disease markersThe recent advances of high-throughput technologies suchas DNA microarray and sequencing technologies efficientlyprofile various types of genomic data The genomic datainclude Single Nucleotide Polymorphism (SNP) Copy Num-ber Variation (CNV) DNA methylation (DM) and geneexpression (GE) Integrative genomic analysis of the hetero-geneous genomic data plays an important role in profiling

a global view of a biological system as well as identifyingsignificant markers of the human diseases

However most research has focused solely on investiga-tions of a single type of the genomic data Genome-WideAssociation Studies (GWAS) examine genetic loci which areassociated with a trait (eg major diseases) using the SNPdata [1 2] GWAS normally compare the SNP arrays oftwo groups disease (case) and normal (control) samples Ifa genetic variation on a locus with the disease samples isstatistically significant to the controls the SNP is consideredassociated with the disease whereas expression QuantitativeTrait Loci (eQTL) studies have been actively done to identifygenetic loci that regulate gene expression [3] Combining thegene microarray data with GWAS not only enables the cap-ture of gene regulatory interactions but also provides insightinto the genetic mechanism that regulates gene expressionvariations However both GWAS and eQTLmapping studiesstill remain as a ldquomissing heritabilityrdquo problem [4]

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 783592 10 pageshttpdxdoiorg1011552015783592

2 BioMed Research International

In addition to SNP Copy Number Variation (CNV) andDNA methylation (DM) have also been highlighted as keyfactors that affect the gene expression regulation CNV is astructural alternation of DNA in which specific regions of thegenome are deleted or duplicated on chromosomes AlthoughCNV is frequently observed even in healthy individuals it ishypothesized that the variants may cause diseases by directlyaffecting gene dosage and gene expression [5 6] Specifi-cally whole-genome association studies of the relationshipbetween CNV and diseases reported that gene expressionlevels in CNV regions are strongly related to the deletionor duplication of the regions [6] Typically the deletion ofeither particular regions within a gene or regulatory regionsof a gene may result in a lower gene expression than what isnormally expressed DM is an epigenetic modification thatoccurred by the addition of methyl group to the cytosine oradenine of DNA DM inhibits transcription of the genes withhigh levels of 5-methylcytosine in their promoter region orrecruits proteins such as histone deacetylases that canmodifyhistones [7 8]The functionality ofDMconsequently changesthe gene expression levels even on the same DNA bases

Thus recent research has actively extended GWAS andeQTL mapping studies to the integrative association stud-ies with multiple types of genomic data Most integrativegenomic research focuses on identifying genetic epigeneticor posttranscriptional factors that control gene expressionregulation (or microRNA) by considering the complex inter-actions of SNP CNV and DM [9ndash11] Specifically the CancerGenomic Atlas [9] conducted large-scale multidimensionalanalysis with SNP CNV DM and GE to provide compre-hensive genomic characterizations for brain cancer In Aureet alrsquos work [10] the combination effects of CNV and DMwere examined to identify the association with alterations ofmiRNAexpression in breast tumorsWagner et al [11] studiedthe relationship between SNPDM andGE viamultiple eQTLanalysis

Most of the integration approaches have used step-by-step processes Ordinarily approaches filter candidatemarkers by using statistical techniques at the first step andfind the final markers that satisfy certain criteria at theremaining stages [12ndash15] This type of integration methodoftenmakes increased ldquotype II errorsrdquo at each step that is failsto find informative markers by incorrectly identifying themas insignificant Moreover they do not consider interactioneffects of the multiblock data Mechanism was not consid-ered

Hence research has recently started to shift towardapproaches using systematical models in order to integrateand analyze the heterogeneous data comprehensively ratherthan through simple step-wise processes [16ndash18] Multiblockmethods of Partial Least Squares (PLS) and GeneralizedCanonical Correlation Analysis (GCCA) are representativemethods A derivative of a sparse version of PLS wasproposed by penalizing both features and sample dimen-sions to identify ldquoregulatory modulesrdquo [16] Such PLS-basedmethods which maximize the covariance between latentvariables often fail to detect significant factors when theirintensities are weak Furthermore the method lacks theconsideration of the discriminant analysis of the disease

A sparse multiblock analysis method derived from Gener-alized Canonical Correlation (SGCCA) was developed toidentify multiblock association models while considering therelationship between the different data block such as cis-regulated mutations [17] This work builds a hybrid modelby combining both GWAS and eQTL models rather than amultiblock integration model The data integration approachwas suggested by utilizing multiple feature selectionmethodssuch as Principal Component Analysis (PCA) PLS andLASSO [18] They extracted the important factors using thedimensional reduction and feature selection methods andapplied them onCox survival models However combinationeffects of the multiblock data were ignored in this approach

To tackle these limitations we propose a novelMultiblockDiscriminant Analysis (MultiDA) method for the integrativegenomic study The proposed method MultiDA makes thefollowing main contributions

(i) A new integrative genomic model for the discrimi-nant analysis is introduced by exploiting class infor-mation

(ii) A sophisticated optimal solution is developed to solvethe discriminant analysis problem in the integrativegenomic model

First we built a novel integrative genomic model for the dis-criminant analysis The class data is considered as oneblock and the total squared correlation including the classblock is maximized The introduction of the class block tothe multiblock model enables us to perform discriminantanalysis in the integrative genomic model Secondly wepropose a sophisticated method to solve the discriminantanalysis problem in the new integrative genomic model Thediscriminant analysis is essential in identifying biomarkersof human diseases in computational biology Regardless ithas been overlooked in the multiblock analysis The efficientalgorithm for the discriminant analysis and assessment of itsperformance are explored in this paper

2 Methods

21 Notation We suppose that there are 119869 multiblock dataThemultiblock data are measured on119873 numbers of the sameset of observations A block consists of a group of featuresthat share common properties or represent one aspect of thesample The multiblock data is denoted by X = X

1 X

119869

The 119895th block data X119895is 119875119895-dimensional zero mean column

vectors X119895isin R119873times119875119895 A matrix C = 119888

119895119896| 119888119895119896isin 0 1 1 le

119895 119896 le 119869 is a binary matrix that determines the linkagebetween the multiblock where 119888

119895119896= 1 if the block 119895 and

the block 119896 are connected or 0 if otherwise In the proposedintegrative genomic model SNP CNV DM GE and classlabel (case or control) of the samples are considered as themultiblock components For simplicity X

1 X2 X3 X4 and

X5represent SNP CNV DM GE and class label respectively

Through this paper we use 119894 for the index of the sampleand 119895 119896 for the multiblock (120484) is used to denote a columnvector of a matrix or an element of a vector For instanceX119894(120484)

and 119886119894(120484)

represent the 120484th column vector of the matrix

BioMed Research International 3

Discriminantanalysis

Gene expressionSNP

CNV

DNA methylation

Disease

Latent variable

1198311

1198312

1198313

1198314

1198315

119831p4

119831p3

119831p2

119831p2

1198551

1198552

1198553

1198554

1198555

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1The conceptual graphic representation of the integrative genomicmodel A rectangle represents a manipulated variable and a circlerepresents a latent variable The graphic representation illustrates the structure model that shows the relationship between SNP CNV DNAmethylation gene expression and disease phenotype

X119894and the 120484th element of the vector a

119894 respectively Figure 1

illustrates the conceptual overview of the multi-block dataand framework

22 Multiblock Discriminant Analysis Multiblock Discrimi-nant Analysis (MultiDA) builds a sparse associationmodel bynot only maximizing the total squared correlations betweenthe multiblocks but also taking into account the discrimina-tive factors in themodelMultiDA considers a linear subspacewhich is a construction of low-dimensional basis of the dataThe linear subspaces of the multiblock which maximize thetotal squared correlations identify the significant factors ofthe association model with sparsity regularizationThe linearsubspace (or latent variable) k

119895of the 119895th block is represented

by

k119895= X119895120572119895 (1)

where 120572119895is a loading vector Then we introduce sparse

regularization (elastic net penalization) on the loading vectorto reduce the chance of including insignificant variables andto improve their interpretationThe sparse regularization hasits advantage especially when the number of features is muchlarger than the sample number (119873 ≪ 119875

119895)Therefore the basic

objective function can be represented as

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119888119895119896

120572⊤

119895X⊤119895X119896120572119896120572⊤

119895X⊤119895X119896120572119896

120572⊤

119895X⊤119895X119895120572119895120572⊤

119896X⊤119896X119896120572119896

st 120572⊤119895X⊤119895X119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(2)

where | sdot | and sdot 2 represent ℓ1-norm and ℓ

2-norm of

the vectors respectively and 1199051and 119905

2are the shrinkage

parameters that determine the sparsity Note that the basicobjective function is equivalent to the Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] Since theintegrative genomic model aims to represent gene expressionregulated by the combinations of SNP CNV and DM thematrix C can be defined as

C =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

]]]]]]]]

]

(3)

We further consolidate the model by (1) introducing aweight matrix of the correlation for the balance of the modeland (2) providing discriminant analysis in the integrativegenomicmodelWe also provide the sophisticated solution ofthe model while SGCCA heuristically estimates the optimalsolution by following Woldrsquos algorithm in the previous work[17]

221 Weight Matrix for the Balance of the Model The weightmatrix of the correlation between the multiblocks d = 119889

119895119896|

119889119895119896isin R 1 le 119895 119896 le 119869 is introduced in the model

In the original multiblock model the correlation betweengene expression and class label block tends to be overlookedInstead the sum of the squared pairwise correlations of X

1

X2 X3 and X

4contributes large portions The correlation

4 BioMed Research International

weight matrix D gives an equal balance of the total squaredcorrelations In this paper the weight matrix is defined as

D =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 3

0 0 0 3 0

]]]]]]]]

]

(4)

where the correlation between gene expression and class labelblocks is three times more weighted than others Then thematrixD simply replaces the matrix C

222 Discriminant Analysis In the proposed integrativegenomic model we need to find discriminative genes thatcharacterize diseases However the integrative genomicmodel is comprised of combinations ofmultiple linear regres-sion models Thus discriminant analysis such as LogisticRegression (LR) and Linear Discriminant Analysis (LDA)cannot be embedded into the integrative genomic modelTo solve this problem we adapted the Discriminative LeastSquares Regression (DLSR) method proposed by Xiang et al[19] DLSR was developed based on the linear regressionmodel and it is proved that DLSR provides equal or superiorperformance compared to other discriminant methods Thebasic concept of DLSR is to enlarge the distance betweenclasses by introducing slack variables Whereas they consid-ered a multi-class problem and developed its sparse versionwith ℓ

21-norm regularization in their work we reformulated

its sparse method with elastic net penalization to suit ourown needs In DLSR the slack variable is introduced into theordinary linear regression problem

Xa = y + b ⊙m (5)

where y is a dependent variable (119910119894= minus1 1 y isin R119873) X is

a multivariate independent variable (X isin R119873times119901) and a is acoefficient vector (a isin R119901) b is a direction of the class whereits element 119887

119894= minus1 if 119910

119894= minus1 or 1 if otherwise (b isin R119901) The

Hadamard product operator ⊙ of the direction vector b andthe slack variable vector m determines the distance betweenclasses (m isin R119901)The optimal solution will be covered in thenext section

223 The Objective Function of MultiDA We finally obtainthe objective function of MultiDA

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119889119895119896

120572⊤

119895120594⊤

119895120594119896120572119896120572⊤

119895120594⊤

119895120594119896120572119896

120572⊤

119895120594⊤

119895120594119895120572119895120572⊤

119896120594⊤

119896120594119896120572119896

st 120572⊤119895120594⊤

119895120594119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(6)

where 120594119895is defined as

120594119895=

X119895+ b ⊙m if 119895 = 5

X119895

if otherwise(7)

This setting enables one to perform discriminant analysisbetween gene expression and disease blocks

23 Optimization The optimal solution of (6) can beobtained by the Lagrangian function

L = minus

119869

sum

119895

119869

sum

119896=1119895 =119896

119889119895119896120572⊤

119895120594⊤

119895120594119896120572119896120572119895

⊤120594⊤

119895120594119896120572119896

+

119869

sum

119895

119911119895(120572⊤

119895120594⊤

119895120594119895120572119895minus 1) +

119869

sum

119895

120582119895

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816

+

119869

sum

119895

(1 minus 120582119895)

2

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

(8)

where 119911119895

and 120582119895

are the Lagrangian multipliers TheLagrangian function (8) is convex although not differen-tiable Therefore the local optimum of (8) provides a globalsolution The partial derivatives of the Lagrangian functionwith respect to 120572

119895and 120582

119895are derived from

120597L

120597120572119895

= minus

119869

sum

119896

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594⊤

119895120594119896120572119896+ 119911119895120594⊤

119895120594119895120572119895

+ 120582119895s119895+ (1 minus 120582

119895)120572119895= 0

(9)

120597L

120597120582119895

= 120572⊤

119895120594⊤

119895120594119895120572119895minus 1 = 0 (10)

where s119895is the vector of a

119895rsquos sign Although the stationary

equations have no closed form solutions the optimal solutioncan be estimated by an iterative algorithm

We can make (9) simple with the inner component

120592119895=

119869

sum

119896119896 =119895

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594119896120572119896 (11)

Then by introducing the inner component 120592119895into (9) the

solution of 120572119895can be written as

120572119895= [119911119895(120594⊤

119895120594119895+1 minus 120582119895

119911119895

)]

minus1

(120594⊤

119895120592119895minus 120582119895s119895) (12)

In (11) (120572⊤119895120594⊤

119895120594119896120572119896) is a squared correlation between the

latent variables of the 119894th and 119895th block which is a scalarTherefore the inner component is computed by 120572

119895of the

previous iteration and then new 120572119895is updated in iterations

Equation (12) is the normal equation of the regression of120592119895on 120594119895with ridge and shrinkage parameter [20] The final

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

2 BioMed Research International

In addition to SNP Copy Number Variation (CNV) andDNA methylation (DM) have also been highlighted as keyfactors that affect the gene expression regulation CNV is astructural alternation of DNA in which specific regions of thegenome are deleted or duplicated on chromosomes AlthoughCNV is frequently observed even in healthy individuals it ishypothesized that the variants may cause diseases by directlyaffecting gene dosage and gene expression [5 6] Specifi-cally whole-genome association studies of the relationshipbetween CNV and diseases reported that gene expressionlevels in CNV regions are strongly related to the deletionor duplication of the regions [6] Typically the deletion ofeither particular regions within a gene or regulatory regionsof a gene may result in a lower gene expression than what isnormally expressed DM is an epigenetic modification thatoccurred by the addition of methyl group to the cytosine oradenine of DNA DM inhibits transcription of the genes withhigh levels of 5-methylcytosine in their promoter region orrecruits proteins such as histone deacetylases that canmodifyhistones [7 8]The functionality ofDMconsequently changesthe gene expression levels even on the same DNA bases

Thus recent research has actively extended GWAS andeQTL mapping studies to the integrative association stud-ies with multiple types of genomic data Most integrativegenomic research focuses on identifying genetic epigeneticor posttranscriptional factors that control gene expressionregulation (or microRNA) by considering the complex inter-actions of SNP CNV and DM [9ndash11] Specifically the CancerGenomic Atlas [9] conducted large-scale multidimensionalanalysis with SNP CNV DM and GE to provide compre-hensive genomic characterizations for brain cancer In Aureet alrsquos work [10] the combination effects of CNV and DMwere examined to identify the association with alterations ofmiRNAexpression in breast tumorsWagner et al [11] studiedthe relationship between SNPDM andGE viamultiple eQTLanalysis

Most of the integration approaches have used step-by-step processes Ordinarily approaches filter candidatemarkers by using statistical techniques at the first step andfind the final markers that satisfy certain criteria at theremaining stages [12ndash15] This type of integration methodoftenmakes increased ldquotype II errorsrdquo at each step that is failsto find informative markers by incorrectly identifying themas insignificant Moreover they do not consider interactioneffects of the multiblock data Mechanism was not consid-ered

Hence research has recently started to shift towardapproaches using systematical models in order to integrateand analyze the heterogeneous data comprehensively ratherthan through simple step-wise processes [16ndash18] Multiblockmethods of Partial Least Squares (PLS) and GeneralizedCanonical Correlation Analysis (GCCA) are representativemethods A derivative of a sparse version of PLS wasproposed by penalizing both features and sample dimen-sions to identify ldquoregulatory modulesrdquo [16] Such PLS-basedmethods which maximize the covariance between latentvariables often fail to detect significant factors when theirintensities are weak Furthermore the method lacks theconsideration of the discriminant analysis of the disease

A sparse multiblock analysis method derived from Gener-alized Canonical Correlation (SGCCA) was developed toidentify multiblock association models while considering therelationship between the different data block such as cis-regulated mutations [17] This work builds a hybrid modelby combining both GWAS and eQTL models rather than amultiblock integration model The data integration approachwas suggested by utilizing multiple feature selectionmethodssuch as Principal Component Analysis (PCA) PLS andLASSO [18] They extracted the important factors using thedimensional reduction and feature selection methods andapplied them onCox survival models However combinationeffects of the multiblock data were ignored in this approach

To tackle these limitations we propose a novelMultiblockDiscriminant Analysis (MultiDA) method for the integrativegenomic study The proposed method MultiDA makes thefollowing main contributions

(i) A new integrative genomic model for the discrimi-nant analysis is introduced by exploiting class infor-mation

(ii) A sophisticated optimal solution is developed to solvethe discriminant analysis problem in the integrativegenomic model

First we built a novel integrative genomic model for the dis-criminant analysis The class data is considered as oneblock and the total squared correlation including the classblock is maximized The introduction of the class block tothe multiblock model enables us to perform discriminantanalysis in the integrative genomic model Secondly wepropose a sophisticated method to solve the discriminantanalysis problem in the new integrative genomic model Thediscriminant analysis is essential in identifying biomarkersof human diseases in computational biology Regardless ithas been overlooked in the multiblock analysis The efficientalgorithm for the discriminant analysis and assessment of itsperformance are explored in this paper

2 Methods

21 Notation We suppose that there are 119869 multiblock dataThemultiblock data are measured on119873 numbers of the sameset of observations A block consists of a group of featuresthat share common properties or represent one aspect of thesample The multiblock data is denoted by X = X

1 X

119869

The 119895th block data X119895is 119875119895-dimensional zero mean column

vectors X119895isin R119873times119875119895 A matrix C = 119888

119895119896| 119888119895119896isin 0 1 1 le

119895 119896 le 119869 is a binary matrix that determines the linkagebetween the multiblock where 119888

119895119896= 1 if the block 119895 and

the block 119896 are connected or 0 if otherwise In the proposedintegrative genomic model SNP CNV DM GE and classlabel (case or control) of the samples are considered as themultiblock components For simplicity X

1 X2 X3 X4 and

X5represent SNP CNV DM GE and class label respectively

Through this paper we use 119894 for the index of the sampleand 119895 119896 for the multiblock (120484) is used to denote a columnvector of a matrix or an element of a vector For instanceX119894(120484)

and 119886119894(120484)

represent the 120484th column vector of the matrix

BioMed Research International 3

Discriminantanalysis

Gene expressionSNP

CNV

DNA methylation

Disease

Latent variable

1198311

1198312

1198313

1198314

1198315

119831p4

119831p3

119831p2

119831p2

1198551

1198552

1198553

1198554

1198555

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1The conceptual graphic representation of the integrative genomicmodel A rectangle represents a manipulated variable and a circlerepresents a latent variable The graphic representation illustrates the structure model that shows the relationship between SNP CNV DNAmethylation gene expression and disease phenotype

X119894and the 120484th element of the vector a

119894 respectively Figure 1

illustrates the conceptual overview of the multi-block dataand framework

22 Multiblock Discriminant Analysis Multiblock Discrimi-nant Analysis (MultiDA) builds a sparse associationmodel bynot only maximizing the total squared correlations betweenthe multiblocks but also taking into account the discrimina-tive factors in themodelMultiDA considers a linear subspacewhich is a construction of low-dimensional basis of the dataThe linear subspaces of the multiblock which maximize thetotal squared correlations identify the significant factors ofthe association model with sparsity regularizationThe linearsubspace (or latent variable) k

119895of the 119895th block is represented

by

k119895= X119895120572119895 (1)

where 120572119895is a loading vector Then we introduce sparse

regularization (elastic net penalization) on the loading vectorto reduce the chance of including insignificant variables andto improve their interpretationThe sparse regularization hasits advantage especially when the number of features is muchlarger than the sample number (119873 ≪ 119875

119895)Therefore the basic

objective function can be represented as

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119888119895119896

120572⊤

119895X⊤119895X119896120572119896120572⊤

119895X⊤119895X119896120572119896

120572⊤

119895X⊤119895X119895120572119895120572⊤

119896X⊤119896X119896120572119896

st 120572⊤119895X⊤119895X119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(2)

where | sdot | and sdot 2 represent ℓ1-norm and ℓ

2-norm of

the vectors respectively and 1199051and 119905

2are the shrinkage

parameters that determine the sparsity Note that the basicobjective function is equivalent to the Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] Since theintegrative genomic model aims to represent gene expressionregulated by the combinations of SNP CNV and DM thematrix C can be defined as

C =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

]]]]]]]]

]

(3)

We further consolidate the model by (1) introducing aweight matrix of the correlation for the balance of the modeland (2) providing discriminant analysis in the integrativegenomicmodelWe also provide the sophisticated solution ofthe model while SGCCA heuristically estimates the optimalsolution by following Woldrsquos algorithm in the previous work[17]

221 Weight Matrix for the Balance of the Model The weightmatrix of the correlation between the multiblocks d = 119889

119895119896|

119889119895119896isin R 1 le 119895 119896 le 119869 is introduced in the model

In the original multiblock model the correlation betweengene expression and class label block tends to be overlookedInstead the sum of the squared pairwise correlations of X

1

X2 X3 and X

4contributes large portions The correlation

4 BioMed Research International

weight matrix D gives an equal balance of the total squaredcorrelations In this paper the weight matrix is defined as

D =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 3

0 0 0 3 0

]]]]]]]]

]

(4)

where the correlation between gene expression and class labelblocks is three times more weighted than others Then thematrixD simply replaces the matrix C

222 Discriminant Analysis In the proposed integrativegenomic model we need to find discriminative genes thatcharacterize diseases However the integrative genomicmodel is comprised of combinations ofmultiple linear regres-sion models Thus discriminant analysis such as LogisticRegression (LR) and Linear Discriminant Analysis (LDA)cannot be embedded into the integrative genomic modelTo solve this problem we adapted the Discriminative LeastSquares Regression (DLSR) method proposed by Xiang et al[19] DLSR was developed based on the linear regressionmodel and it is proved that DLSR provides equal or superiorperformance compared to other discriminant methods Thebasic concept of DLSR is to enlarge the distance betweenclasses by introducing slack variables Whereas they consid-ered a multi-class problem and developed its sparse versionwith ℓ

21-norm regularization in their work we reformulated

its sparse method with elastic net penalization to suit ourown needs In DLSR the slack variable is introduced into theordinary linear regression problem

Xa = y + b ⊙m (5)

where y is a dependent variable (119910119894= minus1 1 y isin R119873) X is

a multivariate independent variable (X isin R119873times119901) and a is acoefficient vector (a isin R119901) b is a direction of the class whereits element 119887

119894= minus1 if 119910

119894= minus1 or 1 if otherwise (b isin R119901) The

Hadamard product operator ⊙ of the direction vector b andthe slack variable vector m determines the distance betweenclasses (m isin R119901)The optimal solution will be covered in thenext section

223 The Objective Function of MultiDA We finally obtainthe objective function of MultiDA

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119889119895119896

120572⊤

119895120594⊤

119895120594119896120572119896120572⊤

119895120594⊤

119895120594119896120572119896

120572⊤

119895120594⊤

119895120594119895120572119895120572⊤

119896120594⊤

119896120594119896120572119896

st 120572⊤119895120594⊤

119895120594119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(6)

where 120594119895is defined as

120594119895=

X119895+ b ⊙m if 119895 = 5

X119895

if otherwise(7)

This setting enables one to perform discriminant analysisbetween gene expression and disease blocks

23 Optimization The optimal solution of (6) can beobtained by the Lagrangian function

L = minus

119869

sum

119895

119869

sum

119896=1119895 =119896

119889119895119896120572⊤

119895120594⊤

119895120594119896120572119896120572119895

⊤120594⊤

119895120594119896120572119896

+

119869

sum

119895

119911119895(120572⊤

119895120594⊤

119895120594119895120572119895minus 1) +

119869

sum

119895

120582119895

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816

+

119869

sum

119895

(1 minus 120582119895)

2

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

(8)

where 119911119895

and 120582119895

are the Lagrangian multipliers TheLagrangian function (8) is convex although not differen-tiable Therefore the local optimum of (8) provides a globalsolution The partial derivatives of the Lagrangian functionwith respect to 120572

119895and 120582

119895are derived from

120597L

120597120572119895

= minus

119869

sum

119896

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594⊤

119895120594119896120572119896+ 119911119895120594⊤

119895120594119895120572119895

+ 120582119895s119895+ (1 minus 120582

119895)120572119895= 0

(9)

120597L

120597120582119895

= 120572⊤

119895120594⊤

119895120594119895120572119895minus 1 = 0 (10)

where s119895is the vector of a

119895rsquos sign Although the stationary

equations have no closed form solutions the optimal solutioncan be estimated by an iterative algorithm

We can make (9) simple with the inner component

120592119895=

119869

sum

119896119896 =119895

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594119896120572119896 (11)

Then by introducing the inner component 120592119895into (9) the

solution of 120572119895can be written as

120572119895= [119911119895(120594⊤

119895120594119895+1 minus 120582119895

119911119895

)]

minus1

(120594⊤

119895120592119895minus 120582119895s119895) (12)

In (11) (120572⊤119895120594⊤

119895120594119896120572119896) is a squared correlation between the

latent variables of the 119894th and 119895th block which is a scalarTherefore the inner component is computed by 120572

119895of the

previous iteration and then new 120572119895is updated in iterations

Equation (12) is the normal equation of the regression of120592119895on 120594119895with ridge and shrinkage parameter [20] The final

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

BioMed Research International 3

Discriminantanalysis

Gene expressionSNP

CNV

DNA methylation

Disease

Latent variable

1198311

1198312

1198313

1198314

1198315

119831p4

119831p3

119831p2

119831p2

1198551

1198552

1198553

1198554

1198555

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1The conceptual graphic representation of the integrative genomicmodel A rectangle represents a manipulated variable and a circlerepresents a latent variable The graphic representation illustrates the structure model that shows the relationship between SNP CNV DNAmethylation gene expression and disease phenotype

X119894and the 120484th element of the vector a

119894 respectively Figure 1

illustrates the conceptual overview of the multi-block dataand framework

22 Multiblock Discriminant Analysis Multiblock Discrimi-nant Analysis (MultiDA) builds a sparse associationmodel bynot only maximizing the total squared correlations betweenthe multiblocks but also taking into account the discrimina-tive factors in themodelMultiDA considers a linear subspacewhich is a construction of low-dimensional basis of the dataThe linear subspaces of the multiblock which maximize thetotal squared correlations identify the significant factors ofthe association model with sparsity regularizationThe linearsubspace (or latent variable) k

119895of the 119895th block is represented

by

k119895= X119895120572119895 (1)

where 120572119895is a loading vector Then we introduce sparse

regularization (elastic net penalization) on the loading vectorto reduce the chance of including insignificant variables andto improve their interpretationThe sparse regularization hasits advantage especially when the number of features is muchlarger than the sample number (119873 ≪ 119875

119895)Therefore the basic

objective function can be represented as

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119888119895119896

120572⊤

119895X⊤119895X119896120572119896120572⊤

119895X⊤119895X119896120572119896

120572⊤

119895X⊤119895X119895120572119895120572⊤

119896X⊤119896X119896120572119896

st 120572⊤119895X⊤119895X119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(2)

where | sdot | and sdot 2 represent ℓ1-norm and ℓ

2-norm of

the vectors respectively and 1199051and 119905

2are the shrinkage

parameters that determine the sparsity Note that the basicobjective function is equivalent to the Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] Since theintegrative genomic model aims to represent gene expressionregulated by the combinations of SNP CNV and DM thematrix C can be defined as

C =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

]]]]]]]]

]

(3)

We further consolidate the model by (1) introducing aweight matrix of the correlation for the balance of the modeland (2) providing discriminant analysis in the integrativegenomicmodelWe also provide the sophisticated solution ofthe model while SGCCA heuristically estimates the optimalsolution by following Woldrsquos algorithm in the previous work[17]

221 Weight Matrix for the Balance of the Model The weightmatrix of the correlation between the multiblocks d = 119889

119895119896|

119889119895119896isin R 1 le 119895 119896 le 119869 is introduced in the model

In the original multiblock model the correlation betweengene expression and class label block tends to be overlookedInstead the sum of the squared pairwise correlations of X

1

X2 X3 and X

4contributes large portions The correlation

4 BioMed Research International

weight matrix D gives an equal balance of the total squaredcorrelations In this paper the weight matrix is defined as

D =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 3

0 0 0 3 0

]]]]]]]]

]

(4)

where the correlation between gene expression and class labelblocks is three times more weighted than others Then thematrixD simply replaces the matrix C

222 Discriminant Analysis In the proposed integrativegenomic model we need to find discriminative genes thatcharacterize diseases However the integrative genomicmodel is comprised of combinations ofmultiple linear regres-sion models Thus discriminant analysis such as LogisticRegression (LR) and Linear Discriminant Analysis (LDA)cannot be embedded into the integrative genomic modelTo solve this problem we adapted the Discriminative LeastSquares Regression (DLSR) method proposed by Xiang et al[19] DLSR was developed based on the linear regressionmodel and it is proved that DLSR provides equal or superiorperformance compared to other discriminant methods Thebasic concept of DLSR is to enlarge the distance betweenclasses by introducing slack variables Whereas they consid-ered a multi-class problem and developed its sparse versionwith ℓ

21-norm regularization in their work we reformulated

its sparse method with elastic net penalization to suit ourown needs In DLSR the slack variable is introduced into theordinary linear regression problem

Xa = y + b ⊙m (5)

where y is a dependent variable (119910119894= minus1 1 y isin R119873) X is

a multivariate independent variable (X isin R119873times119901) and a is acoefficient vector (a isin R119901) b is a direction of the class whereits element 119887

119894= minus1 if 119910

119894= minus1 or 1 if otherwise (b isin R119901) The

Hadamard product operator ⊙ of the direction vector b andthe slack variable vector m determines the distance betweenclasses (m isin R119901)The optimal solution will be covered in thenext section

223 The Objective Function of MultiDA We finally obtainthe objective function of MultiDA

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119889119895119896

120572⊤

119895120594⊤

119895120594119896120572119896120572⊤

119895120594⊤

119895120594119896120572119896

120572⊤

119895120594⊤

119895120594119895120572119895120572⊤

119896120594⊤

119896120594119896120572119896

st 120572⊤119895120594⊤

119895120594119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(6)

where 120594119895is defined as

120594119895=

X119895+ b ⊙m if 119895 = 5

X119895

if otherwise(7)

This setting enables one to perform discriminant analysisbetween gene expression and disease blocks

23 Optimization The optimal solution of (6) can beobtained by the Lagrangian function

L = minus

119869

sum

119895

119869

sum

119896=1119895 =119896

119889119895119896120572⊤

119895120594⊤

119895120594119896120572119896120572119895

⊤120594⊤

119895120594119896120572119896

+

119869

sum

119895

119911119895(120572⊤

119895120594⊤

119895120594119895120572119895minus 1) +

119869

sum

119895

120582119895

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816

+

119869

sum

119895

(1 minus 120582119895)

2

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

(8)

where 119911119895

and 120582119895

are the Lagrangian multipliers TheLagrangian function (8) is convex although not differen-tiable Therefore the local optimum of (8) provides a globalsolution The partial derivatives of the Lagrangian functionwith respect to 120572

119895and 120582

119895are derived from

120597L

120597120572119895

= minus

119869

sum

119896

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594⊤

119895120594119896120572119896+ 119911119895120594⊤

119895120594119895120572119895

+ 120582119895s119895+ (1 minus 120582

119895)120572119895= 0

(9)

120597L

120597120582119895

= 120572⊤

119895120594⊤

119895120594119895120572119895minus 1 = 0 (10)

where s119895is the vector of a

119895rsquos sign Although the stationary

equations have no closed form solutions the optimal solutioncan be estimated by an iterative algorithm

We can make (9) simple with the inner component

120592119895=

119869

sum

119896119896 =119895

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594119896120572119896 (11)

Then by introducing the inner component 120592119895into (9) the

solution of 120572119895can be written as

120572119895= [119911119895(120594⊤

119895120594119895+1 minus 120582119895

119911119895

)]

minus1

(120594⊤

119895120592119895minus 120582119895s119895) (12)

In (11) (120572⊤119895120594⊤

119895120594119896120572119896) is a squared correlation between the

latent variables of the 119894th and 119895th block which is a scalarTherefore the inner component is computed by 120572

119895of the

previous iteration and then new 120572119895is updated in iterations

Equation (12) is the normal equation of the regression of120592119895on 120594119895with ridge and shrinkage parameter [20] The final

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

4 BioMed Research International

weight matrix D gives an equal balance of the total squaredcorrelations In this paper the weight matrix is defined as

D =

[[[[[[[[

[

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 3

0 0 0 3 0

]]]]]]]]

]

(4)

where the correlation between gene expression and class labelblocks is three times more weighted than others Then thematrixD simply replaces the matrix C

222 Discriminant Analysis In the proposed integrativegenomic model we need to find discriminative genes thatcharacterize diseases However the integrative genomicmodel is comprised of combinations ofmultiple linear regres-sion models Thus discriminant analysis such as LogisticRegression (LR) and Linear Discriminant Analysis (LDA)cannot be embedded into the integrative genomic modelTo solve this problem we adapted the Discriminative LeastSquares Regression (DLSR) method proposed by Xiang et al[19] DLSR was developed based on the linear regressionmodel and it is proved that DLSR provides equal or superiorperformance compared to other discriminant methods Thebasic concept of DLSR is to enlarge the distance betweenclasses by introducing slack variables Whereas they consid-ered a multi-class problem and developed its sparse versionwith ℓ

21-norm regularization in their work we reformulated

its sparse method with elastic net penalization to suit ourown needs In DLSR the slack variable is introduced into theordinary linear regression problem

Xa = y + b ⊙m (5)

where y is a dependent variable (119910119894= minus1 1 y isin R119873) X is

a multivariate independent variable (X isin R119873times119901) and a is acoefficient vector (a isin R119901) b is a direction of the class whereits element 119887

119894= minus1 if 119910

119894= minus1 or 1 if otherwise (b isin R119901) The

Hadamard product operator ⊙ of the direction vector b andthe slack variable vector m determines the distance betweenclasses (m isin R119901)The optimal solution will be covered in thenext section

223 The Objective Function of MultiDA We finally obtainthe objective function of MultiDA

argmax120572119895

119869

sum

119895=1

119869

sum

119896=1119895 =119896

119889119895119896

120572⊤

119895120594⊤

119895120594119896120572119896120572⊤

119895120594⊤

119895120594119896120572119896

120572⊤

119895120594⊤

119895120594119895120572119895120572⊤

119896120594⊤

119896120594119896120572119896

st 120572⊤119895120594⊤

119895120594119895120572119895= 1

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816le 1199051

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

le 1199052 119895 = 1 119869

(6)

where 120594119895is defined as

120594119895=

X119895+ b ⊙m if 119895 = 5

X119895

if otherwise(7)

This setting enables one to perform discriminant analysisbetween gene expression and disease blocks

23 Optimization The optimal solution of (6) can beobtained by the Lagrangian function

L = minus

119869

sum

119895

119869

sum

119896=1119895 =119896

119889119895119896120572⊤

119895120594⊤

119895120594119896120572119896120572119895

⊤120594⊤

119895120594119896120572119896

+

119869

sum

119895

119911119895(120572⊤

119895120594⊤

119895120594119895120572119895minus 1) +

119869

sum

119895

120582119895

10038161003816100381610038161003816120572119895

10038161003816100381610038161003816

+

119869

sum

119895

(1 minus 120582119895)

2

10038171003817100381710038171003817120572119895

10038171003817100381710038171003817

2

(8)

where 119911119895

and 120582119895

are the Lagrangian multipliers TheLagrangian function (8) is convex although not differen-tiable Therefore the local optimum of (8) provides a globalsolution The partial derivatives of the Lagrangian functionwith respect to 120572

119895and 120582

119895are derived from

120597L

120597120572119895

= minus

119869

sum

119896

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594⊤

119895120594119896120572119896+ 119911119895120594⊤

119895120594119895120572119895

+ 120582119895s119895+ (1 minus 120582

119895)120572119895= 0

(9)

120597L

120597120582119895

= 120572⊤

119895120594⊤

119895120594119895120572119895minus 1 = 0 (10)

where s119895is the vector of a

119895rsquos sign Although the stationary

equations have no closed form solutions the optimal solutioncan be estimated by an iterative algorithm

We can make (9) simple with the inner component

120592119895=

119869

sum

119896119896 =119895

119889119895119896(120572⊤

119895120594⊤

119895120594119896120572119896)120594119896120572119896 (11)

Then by introducing the inner component 120592119895into (9) the

solution of 120572119895can be written as

120572119895= [119911119895(120594⊤

119895120594119895+1 minus 120582119895

119911119895

)]

minus1

(120594⊤

119895120592119895minus 120582119895s119895) (12)

In (11) (120572⊤119895120594⊤

119895120594119896120572119896) is a squared correlation between the

latent variables of the 119894th and 119895th block which is a scalarTherefore the inner component is computed by 120572

119895of the

previous iteration and then new 120572119895is updated in iterations

Equation (12) is the normal equation of the regression of120592119895on 120594119895with ridge and shrinkage parameter [20] The final

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

BioMed Research International 5

solution can be obtained by using the Univariate Soft-Thresh-olding (UST) method [21]

120572119895(120484)= sign (120594⊤

119895(120484)120592119895) (100381610038161003816100381610038161003816120594⊤

119895(120484)120592119895

100381610038161003816100381610038161003816minus 120582119895)+

(13)

where sign(119909) returns a sign of 119909 that is 1 if 119909 ge 0 or minus1if otherwise (119909)

+returns only positive values of 119909 (ie 119909

if 119909 ge 0 or 0 if otherwise) 120582119895can be obtained by 119870-fold

cross-validation that minimizes mean squared errors Theparameter 119911

119895can be ignored because the solution of 120572

119895is

normalized by (10)

120572119895=radic119873120572119895

10038171003817100381710038171003817120594119895120572119895

10038171003817100381710038171003817

(14)

For the discriminant analysis between gene expressionand disease data blocks the optimum of the slack variablem and the loading vector 120572

4can be estimated by solving the

following optimization problem

argmax1205724 m

1

2

100381710038171003817100381712059441205724 minus (1205925 + b ⊙m)10038171003817100381710038172

st 100381610038161003816100381612057241003816100381610038161003816 le 1205851

1003817100381710038171003817120572410038171003817100381710038172

le 1205852

(15)

The Lagrangian function of (15) is L = (12)12059441205724minus 1205925minus

b ⊙ m2 + 1205824|1205724| + ((1 minus 120582

4)2)120572

42 The derivative of the

Lagrangian function with respect to 1205724is

L

1205971205724

= 120594⊤

412059441205724minus 120594⊤

4120574 + 1205824s + (1 minus 120582

4)1205724= 0 (16)

where s is the sign of1205724and 120574 = 120592

5+b⊙mThus the equation

of 1205724becomes

1205724= (120594⊤

41205944+ 1 minus 120582

4)minus1

(120594⊤

4(120574) minus 120582

4s) (17)

Finally the optimal solution of 1205724for the discriminative

analysis is

1205724(120484)= sign (120594⊤

4(120484)120574) (

100381610038161003816100381610038161003816120594⊤

4(120484)120574

100381610038161003816100381610038161003816minus 1205824)+

(18)

1205824is also determined by 119870-fold cross-validation that min-

imizes mean squared errors like other 120582119895rsquos The optimal

solutions ofm are simply derived from [19]

m = max (b ⊙ (12059441205724minus 1205925) 0) (19)

The brief algorithm is described in Algorithm 1 In the algo-rithm 119903 represents a rank of the subspace which determinesthe dimension of the subspace For instance 120572119903

119895is 119903th rank

of 120572119895 MultiDA optimizes the first rank subspace and iterates

the optimization until the multiblock has no information Inlines 10ndash14 of Algorithm 1 Woldrsquos procedure guarantees theconvergence [22]

(1) For all block normalize loading vectors1205720

119895= radic119873120572

0

119895|120594119895120572119895|

(2) 119903 = 1(3) repeat(4) for 119895 = 1 to 119869 do(5) for 119896 = 1 to 119869 do(6) if block 119896 is binary class data then(7) estimatem and 120572

119895by (18) and (19)

(8) update 120594119896= X119896+ b ⊙m

(9) end if(10) if 119896 lt 119895 then(11) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903+1

119896)120594119903

119896120572119903+1

119896

(12) else if 119896 gt 119895 then(13) 120592

119895= sum119869

119896=1119896 =119895119889119895119896(120572119903

119895

⊤120594119903

119895

⊤120594119903

119896120572119903

119896)120594119903

119896120572119903

119896

(14) end if(15) Compute 120572119903+1

119895by UST

120572119895

119903+1

(120484)= sign(120594

119895

(120484)120592119895)(|120594119895

(120484)120592119895| minus 120582119895)+

(16) Normalize 120572119903+1119895

120572119903+1

119895= radic119899120572

119903+1

119895|120594119895120572119903+1

119895|

(17) 119903 = 119903 + 1

(18) end for(19) end for(20) until sum119869

119895=1120572119903

119895converges

Algorithm 1 Discriminant multiblock analysis

3 Experiment Results

The goal of the assessment is to identify significant factorsof the integrative genomic model with the multiblock dataspecifically the discriminative factors of human disease Thediscriminant factors include disease-specific locations orregions of SNP CNV DNAmethylation and gene expressionagainst normal patients

31 Simulation Study We assessed the performance of theproposed method MultiDA through simulated data Simula-tion data of various complexities were considered Genera-tionrsquos schemes of the simulation data for the assessment wereextended from the previous related works [16 23]

Four generation functions of different complexity aredefined as shown in Table 1 Type

1(120583) generates 119901-dimen-

sional normally distributed random variables of a givenmean(120583) and a variance (I

119901times119901) where I

119901times119901is an 119901 times 119901 identity

matrix Type2(120583 120575) generates more complicated data than

Type1(120583) In Type

2(120583 120575) a random model with a threshold

(120575) is implemented with the function 1120575 Given a uniform

distributed randomvalue (119906) 1120575= 1 if119906 le 120575 or 0 if otherwise

Type3(120583 120588) considers multicollinearity data in which more

than two variables are highly correlated The matrix data aregenerated by multivariate normal distribution N(120583Σ

119901times119901)

The covariance structure Σ119901times119901

is built by the first order ofautoregressive process Type

4(120583 120590) generates 119901-dimensional

normally distributed randomvariables from a givenmean (120583)and a variance (120590)

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

6 BioMed Research International

Table 1 Generation functions

Function ModelType1(120583) x = 120583 + 120598 120598 simN(0 I)

Type2(120583 120575) x = 120583 + 1

120575+ 120598 120598 simN(0 I)

Type3(120583 120588) x simN(120583Σ

119901times119901)

Type4(120583 120590) x simN(120583 120590I

119901times119901)

Table 2 Scheme of the simulation data

Simulation data Generation model type Column index

X1

x119894= Type

1(24) 1 le 120484 le 5

x119894= Type

1(minus26) 6 le 120484 le 10

x119894= Type

2(1 06) 11 le 120484 le 40

x119894= Type

3(0 08) 41 le 120484 le 100

X2

x119894= Type

1(3) 1 le 120484 le 5

x119894= Type

1(4) 6 le 120484 le 10

x119894= Type

3(0 09) 11 le 120484 le 60

x119894= Type

4(2 2) 61 le 120484 le 200

X3

x119894= Type

1(5) 1 le 120484 le 5

x119894= Type

1(minus3) 6 le 120484 le 10

x119894= Type

4(0 1) 11 le 120484 le 210

x119894= Type

3(0 09) 211 le 120484 le 300

The first three multiblocks (X119895isin R119873times119875119895 1 le 119895 le 3)

were simulated by compounding the generation functions asdefined in Table 2 where 119875

1= 100 119875

2= 200 119875

3= 300

and119873 = 500 For instance the first five columns of X1were

generated by Type1(24) and the following five columns were

by Type1(minus26) The next 30 columns were generated by the

generationmodel with a threshold Type2(1 06)The remain-

ing columns of X1were generated by the multicollinearity

random variables Type3(0 08) Then we considered the

multiblock linear model X4= sum3

119895=1X119895B119895+ Ξ where B

119895is a

119875119895times1198754loadingmatrix andΞ is a119875

119895times1198754dimensional normally

distributed noise matrix (1198754= 50) We assumed that only

the first ten variables of each block are significant to explainX4 The fifth block X

5is class label block Given a coefficient

vector B4isin R1198754times1 (all zeros but the first ten) the probability

of disease 120587 was computed by using

120587 =exp (X

4B4)

1 + exp (X4B4) (20)

Then the binary class label block was generated using theBernoulli distribution with the probability 120587

The simulation study was examined with 50 replicationsto assess the reproducibility We compared the performanceof MultiDA with the related methods Sparse CanonicalCorrelation Analysis (SCCA) [24] and Sparse GeneralizedCanonical Correlation Analysis (SGCCA) [17] SCCA is atwo-block method that maximizes the correlation betweenindependentX and response variableY In SCCA the threeblocks of data were combined into a single block (X =

X1X2X3) and the block GE was considered as response

(Y = X4) The class label block was not considered in SCCA

The multiblock method SGCCA was tuned to be compatible

with the proposed integrative genomic model Note that thesame matrixCwas used in SGCCA but SGCCA did not takethe discriminant analysis into account

We examined the performance by howwell they correctlyidentify significant factors of the integrative associationmodel Given a ground truth we computed a confusion ma-trix and measured True Positive Rate (TPR) Positive Pre-dictive Value (PPV) and Accuracy (ACCU) In the sparsesetting the true negatives are relatively much larger thanfalse positives Therefore True Negative Rates (TNR) andNegative Predictive Values (NPV) were not included inthis paper The results of the simulation experiment areillustrated in Figure 2The proposedmethodMultiDA (093plusmn003) and the multiblock method SGCCA (093 plusmn 003)outperformed SCCA (083 plusmn 024) in terms of TPR Itsupports that the multiblock methods reduce false negativesthat incorrectly identify the significant as the insignificantMultiDA appeared as the best performance in PPV andACCUMultiDA produced 058plusmn007 and 095plusmn001 for PPVand ACCU respectively Higher PPV values represent lowerfalse positives that incorrectly identify the insignificant as thesignificantThePPV andACCUof SCCAwere 048plusmn015 and089 plusmn 014 and were 054 plusmn 008 and 094 plusmn 001 for SGCCArespectively

32 Human Brain Data of Schizophrenia Human brain datawere obtained from three major psychiatric disorders suchas schizophrenia (SZ) bipolar disorder (BP) and majordepression (DP) as well as from control group Specifically39 samples of SZ 35 samples of BP 12 samples of DPand 43 samples of control were provided from the StanleyMedical Research Institute SNP CNV DNA methylationand gene expression data were acquired from the humanprefrontal cortex of the 129 samples in the preparation of thisexperiment For each individual 10760 SNPs after removinghighly correlated ones 1028 CNVs 20769 DNA methyla-tions and 19767 gene expressions were examined Due tothe recent research that reported that genetic effects may belargely shared in major psychiatric disorders such as autismspectrum disorder attention deficit-hyperactivity disorderbipolar disorder major depressive disorder and schizophre-nia we considered those psychiatric diseases together andperformed MultiDA to identify discriminate factors againstthe control [25 26]

Themultiblock data was analyzed byMultiDA As a resultof the analysis 78 SNPs 30 CNVs 47DNAmethylations and35 genes were detected where the high correlation betweenthe connections was found The potential gene markers ofthe psychiatric disorders were inferred from the result ofthe proposed method The genes physically located near theselected SNPs and the genes corresponding to the result ofCNV and the DNA methylation were chosen Significantlyobserved genes among the results of MultiDA are listed inTable 3 where the data source of the gene and literatureregarding the psychiatric disorders are described

The gene regulatory network of the genes from the resultwas searched by STRING database [27] Among a numberof the retrieved interactions we take note of one gene

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

BioMed Research International 7

Table 3 The gene results fromMultiDA with psychiatric disorders

Gene Chromosome Location Source ID MAF ReferenceHTR7 10 10q21-q24 GE 7934970 [28]APOE 19 19q132 DM cg14123992 [29]TRPM1 15 15q133 DM cg18085517EPHB1 3 3q21-q23 CNV CNP12652NPY 7 7p151 CNV CNP2267 [30]QKI 6 6q26 SNP rs1336225 018SLC15A1 13 13q323 SNP rs9517421 017 [31]NPAS3 14 14q131 SNP rs1124910 025 [32]C15orf53 15 15q14 SNP rs1433876 029 [33]

08

085

09

095

1

SCCA SGCCA MultiDA

True

pos

itive

rate

(a)

03

04

05

06

07

08

SCCA SGCCA MultiDA

Posit

ive p

redi

ctiv

e val

ue

(b)

08

085

09

095

1

SCCA SGCCA MultiDA

Accu

racy

(c)

Figure 2 Performance comparison in simulation study (a) True Positive Rate (b) Positive Predictive Value (c) Accuracy

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

8 BioMed Research International

CES1

HTR7

ADCY8

HTR1F

NPY

CA2

RYR2

QDPR

AKR1D1

Gene expressionSNP

CNVDNA methylation

Figure 3 The gene regulatory network searched with the gene results by STRING database The legend shows the data source of the gene

regulatory network illustrated in Figure 3 The interactionnetwork consists ofHTR7ADCY8HTR1FNPYCA2RYR2QDPR AKR1D1 and CES1 gene HTR7 is inferred from thegene expression set HTR1F and CA2 are from the DNAmethylation expression NPY and CES1 are from the CNVand the others are from the SNP dataThe negative coefficientof HTR1F in the model may support the widely acceptednotion that DNA methylation suppresses gene regulationimpeding the binding of transcriptional proteins to the gene[34] In particular the HTR7 gene (5-hydroxytryptaminereceptor 7) is a major neurotransmitter in the central nervoussystem and a number of literatures related to bipolar andschizophrenia disorder are reported [28] Interestingly theHTR7 gene was found in the gene expression data block inthis study while the other previous researches reported thegene with GWAS on the SNP data block The gene may havestrong incorporated interactions with other heterogeneousdata which is consequently considered to be significant in theintegrative model It supports the strength of the integrativeapproach Moreover we found that HTR7 and NPY arein the same pathway which is neuroactive ligand-receptorinteraction where the NPY gene is also a neurotransmitterin the brain and is known to play an important role inthe emotional process [30] A large number of psychiatricdisorder susceptible genes were associated with this pathway[25]ADCY8 which interacts with bothHTR7 andNPY maybe potentially a susceptibility gene that causes the psychiatricdisorders In previous research [35] they found that ADCY8

is a susceptibility gene for avoidance behavior on mouse andalso found that it indirectly induces the susceptibility onhuman mood disorders Our result supports their claim

4 Conclusion

In this paper we developed the novel Multiblock Discrim-inant Analysis method in order to dissect the mechanismof complex human disease using multiple genetic data Thegenomic association study with single type data may fallshort of identifying the mechanisms of the diseases On theother hand MultiDA enables comprehensive analysis usingmultiple genetic data Moreover MultiDA provides analysisfor the special setting of binary class data where it greatlydetects discriminative factors in the integrative genomicmodel The simulation experiments support the outstandingperformance of the proposed methods As a target applica-tion psychiatric disorder disease data including SNP CNVDNA methylation and gene expression were analyzed inthe integrative genomic model Among the large number ofvariables of each block candidate biomarkers were proposedas significant components of the diseasemechanismThepro-posed methods capture the global profile of the mechanismthat conventional single or two block methods fail to detectThis promising tool for the integrative genomic study canprovide flexible extensibility for new types of data in the erasuperseding new high-throughput technologies

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

BioMed Research International 9

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] J N Hirschhorn and M J Daly ldquoGenome-wide associationstudies for common diseases and complex traitsrdquo NatureReviews Genetics vol 6 no 2 pp 95ndash108 2005

[2] CNHenrichsen E Chaignat andAReymond ldquoCopynumbervariants diseases and gene expressionrdquo Human MolecularGenetics vol 18 no 1 pp R1ndashR8 2009

[3] Y Gilad S A Rifkin and J K Pritchard ldquoRevealing thearchitecture of gene regulation the promise of eQTL studiesrdquoTrends in Genetics vol 24 no 8 pp 408ndash415 2008

[4] M Slatkin ldquoEpigenetic inheritance and the missing heritabilityproblemrdquo Genetics vol 182 no 3 pp 845ndash850 2009

[5] J L Freeman G H Perry L Feuk et al ldquoCopy numbervariation new insights in genome diversityrdquo Genome Researchvol 16 no 8 pp 949ndash961 2006

[6] S Girirajan C D Campbell and E E Eichler ldquoHuman copynumber variation and complex genetic diseaserdquo Annual Reviewof Genetics vol 45 pp 203ndash226 2011

[7] E N Gal-Yam Y Saito G Egger and P A Jones ldquoCancerepigenetics modifications screening and therapyrdquo AnnualReview of Medicine vol 59 pp 267ndash280 2008

[8] L D Moore T Le and G Fan ldquoDNAmethylation and its basicfunctionrdquo Neuropsychopharmacology vol 38 no 1 pp 23ndash382013

[9] Cancer Genome Atlas Research Network ldquoComprehensivegenomic characterization defines human glioblastoma genesand core pathwaysrdquo Nature vol 455 pp 1061ndash1068 2008

[10] M R Aure S-K Leivonen T Fleischer et al ldquoIndividualand combined effects of DNA methylation and copy numberalterations on miRNA expression in breast tumorsrdquo GenomeBiology vol 14 no 11 article R126 2013

[11] J R Wagner S Busche B Ge T Kwan T Pastinen andM Blanchette ldquoThe relationship between DNA methylationgenetic and expression inter-individual variation in untrans-formedhumanfibroblastsrdquoGenomeBiology vol 15 no 2 articleR37 2014

[12] A C Nica S B Montgomery A S Dimas et al ldquoCandidatecausal regulatory effects by integration of expression QTLs withcomplex trait genetic associationsrdquo PLoS Genetics vol 6 no 4Article ID e1000895 2010

[13] Y-H Hsu M C Zillikens S G Wilson et al ldquoAn integrationof genome-wide association study and gene expression profilingto prioritize the discovery of novel susceptibility Loci forosteoporosis-related traitsrdquo PLoS Genetics vol 6 no 6 ArticleID e1000977 2010

[14] Q Xiong N Ancona E R Hauser S Mukherjee and TS Furey ldquoIntegrating genetic and gene expression evidenceinto genome-wide association analysis of gene setsrdquo GenomeResearch vol 22 no 2 pp 386ndash397 2012

[15] L Conde P M Bracci R Richardson S B Montgomeryand C F Skibola ldquoIntegrating GWAS and expression datafor functional characterization of disease-associated SNPsan application to follicular lymphomardquo American Journal ofHuman Genetics vol 92 no 1 pp 126ndash130 2013

[16] W Li S Zhang C C Liu and X J Zhou ldquoIdentifying mul-ti-layer gene regulatory modules from multi-dimensional ge-nomic datardquo Bioinformatics vol 28 no 19 Article ID bts476pp 2458ndash2466 2012

[17] M Kang B Zhang X Wu C Liu and J Gao ldquoSparse gen-eralized canonical correlation analysis for biological modelintegration a genetic study of psychiatric disordersrdquo in Pro-ceedings of the 35th Annual International Conference of the IEEEEngineering in Medicine and Biology Society (EMBC rsquo13) pp1490ndash1493 July 2013

[18] Q Zhao X Shi Y Xie J Huang B Shia and SMa ldquoCombiningmultidimensional genomicmeasurements for predicting cancerprognosis observations from TCGArdquo Briefings in Bioinformat-ics vol 16 no 2 pp 291ndash303 2015

[19] S Xiang FNieGMengC Pan andC Zhang ldquoDiscriminativeleast squares regression for multiclass classification and featureselectionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 23 no 11 pp 1738ndash1754 2012

[20] A Tenenhaus and M Tenenhaus ldquoRegularized generalizedcanonical correlation analysisrdquo Psychometrika vol 76 no 2 pp257ndash284 2011

[21] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series Bethodological vol58 no 1 pp 267ndash288 1996

[22] MHanafi ldquoPLS pathmodelling computation of latent variableswith the estimation mode Brdquo Computational Statistics vol 22no 2 pp 275ndash292 2007

[23] K-A Le Cao D Rossouw C Robert-Granie and P Besse ldquoAsparse PLS for variable selection when integrating omics datardquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[24] SWaaijenborg P CVerselewel deWittHamer andAHZwin-derman ldquoQuantifying the association between gene expressionsand DNA-markers by penalized canonical correlation analysisrdquoStatistical Applications in Genetics and Molecular Biology vol 7no 1 2008

[25] P Ragunath R Chitra S Mohammad and P Abhinand ldquoAsystems biological study on the comorbidity of autism spectrumdisorders and bipolar disorderrdquo Bioinformation vol 7 no 3 pp102ndash106 2011

[26] A Serretti and C Fabbri ldquoIdentification of risk loci with sharedeffects on five major psychiatric disorders a genome-wideanalysisrdquoThe Lancet vol 381 no 9875 pp 1371ndash1379 2013

[27] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

[28] Y M J Lin H C Yang T J Lai C S J Fann and H SSun ldquoReceptor mediated effect of serotonergic transmissionin patients with bipolar affective disorderrdquo Journal of MedicalGenetics vol 40 no 10 pp 781ndash786 2003

[29] F Vila-Rodriguez W G Honer S M Innis C L Wellingtonand C L Beasley ldquoApoE and cholesterol in schizophreniaand bipolar disorder comparison of grey and white matterand relation with APOE genotyperdquo Journal of Psychiatry ampNeuroscience vol 36 no 1 pp 47ndash55 2011

[30] M Heilig ldquoThe NPY system in stress anxiety and depressionrdquoNeuropeptides vol 38 no 4 pp 213ndash224 2004

[31] M Maheshwari S L Christian C Liu et al ldquoMutationscreening of two candidate genes from 13q32 in families affectedwith bipolar disorder human peptide transporter (SLC15A1)

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

10 BioMed Research International

and human glypican5 (GPC5)rdquo BMC Genomics vol 3 article30 2002

[32] B S Pickard A Christoforou P AThomson et al ldquoInteractinghaplotypes at the NPAS3 locus alter risk of schizophrenia andbipolar disorderrdquo Molecular Psychiatry vol 14 no 9 pp 874ndash884 2009

[33] TM Kranz S EkawardhaniM K Lin et al ldquoThe chromosome15q14 locus for bipolar disorder and schizophrenia isC15orf53 amajor candidate generdquo Journal of Psychiatric Research vol 46no 11 pp 1414ndash1420 2012

[34] P A Jones ldquoFunctions of DNAmethylation islands start sitesgene bodies and beyondrdquo Nature Reviews Genetics vol 13 no7 pp 484ndash492 2012

[35] A G de Mooij-van Malsen H A van Lith H Oppelaar etal ldquoInterspecies trait genetics reveals association of Adcy8with mouse avoidance behavior and a human mood disorderrdquoBiological Psychiatry vol 66 no 12 pp 1123ndash1130 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology


Recommended