+ All documents
Home > Documents > A Least-Squares Approach to Blind Source Separation in Multispeaker Environments

A Least-Squares Approach to Blind Source Separation in Multispeaker Environments

Date post: 22-Nov-2023
Category:
Upload: independent
View: 1 times
Download: 0 times
Share this document with a friend
9
A Least-Squares Approach to Blind Source Separation in Multispeaker Environments Robert M. Nickel and Ananth N. Iyer Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802, USA Email: {rmn10, ani103}@psu.edu Abstract—We are proposing a new approach to the solution of the cocktail party problem (CPP). The goal of the CPP is to isolate the speech signals of individuals who are concurrently talking while being recorded with a properly positioned microphone array. The new approach provides a powerful yet simple alternative to commonly used methods for the separation of speakers. It relies on the existence of so called exclusive activity periods (EAPs) in the source signals. EAPs are time intervals during which only one source is active and all other sources are inactive (i.e. zero). The existence of EAPs is not guaranteed for arbitrary signal classes. EAPs occur very frequently, however, in recordings of conversational speech. The methods proposed in this paper show how EAPs can be detected and how they can be exploited to improve the performance of blind source separation systems in speech processing applications. We consider both, the instantaneous mixture and the convolutive mixture case. A comparison of the proposed method with other popular source separation methods is drawn. The results show an improved performance of the proposed method over earlier approaches. Index Terms— blind source separation, least-squares opti- mization, cocktail party problem, exclusive activity periods. I. I NTRODUCTION An illustration of the scenario considered in the cocktail party problem (CPP) is shown in figure 1. We are record- ing an acoustic scene with an array of microphones. If we want to isolate the voice of a single speaker out of a mixture of signals from different sources then it is im- plicitly necessary to estimate the transmission properties (or equivalently the inverse transmission properties) of all channel between all microphones and all sources. Up to date, the most successful approaches to solving the cocktail party problem (CPP) employ blind source separation (BSS) techniques based on an assumption of statistical independence of the sources [1], [2]. The goal is to find an un-mixing system that maximizes a “measure of independence” from the reconstructed source signals. Generally, one distinguishes between methods that consider instantaneous mixing and methods that address convolutive mixing [1]–[3]. This paper is based on “A Novel Approach to Automated Source Separation in Multispeaker Environments,” by R. M. Nickel and A. N. Iyer, which appeared in the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 629-632, May 14-19, 2006, Toulouse, France. c 2005 IEEE. Figure 1. An illustration of the scenario considered in the cocktail party problem. In order to isolate the speech signal of a targeted speaker we need to implicitly estimate the transmission channels between all sources and all microphones. The crux of the aforementioned approach is to find a good (and mathematically tractable) “measure for in- dependence.” It was shown that a suitable objective in finding a solution is provided by the minimization of a contrast function which is a function of the PDF of the ob- served signals [2], [4]. In the context of speech and audio signal processing a suitable contrast function is usually derived from a likelihood measure and/or the INFOMAX concept [5]. The optimization procedure that minimizes a contrast function (and thus provides a solution to the BSS problem) is generally called an independent component analysis (ICA) [2]. Even though the BSS techniques have achieved remarkable results in the separation of mixed audio signals, they are still suboptimal in regard of separation of speech, partially because they usually do not permit an exploitation of the highly structured nature of speech. Solutions to the more general convolutive mixture case are significantly more complicated than solutions to the instantaneous mixture case [6]–[10]. Most solutions can be classified as either, time domain approaches (see [11] and the references therein) or frequency domain ap- proaches (see [1], [12], [13]). More specifically, we have to distinguish between: the domain in which we model the mixing process (mixing domain) and the domain in which we model the statistics of the source signals (source domain). A choice of either time or frequency for each of these domains have significant advantages and disadvantages (see [1]). JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 59 © 2007 ACADEMY PUBLISHER
Transcript

A Least-Squares Approach to Blind SourceSeparation in Multispeaker Environments

Robert M. Nickel and Ananth N. IyerDepartment of Electrical Engineering, The Pennsylvania State University,

University Park, PA 16802, USAEmail: {rmn10, ani103}@psu.edu

Abstract— We are proposing a new approach to the solutionof the cocktail party problem (CPP). The goal of the CPPis to isolate the speech signals of individuals who areconcurrently talking while being recorded with a properlypositioned microphone array. The new approach provides apowerful yet simple alternative to commonly used methodsfor the separation of speakers. It relies on the existence of socalled exclusive activity periods (EAPs) in the source signals.EAPs are time intervals during which only one source isactive and all other sources are inactive (i.e. zero). Theexistence of EAPs is not guaranteed for arbitrary signalclasses. EAPs occur very frequently, however, in recordingsof conversational speech. The methods proposed in thispaper show how EAPs can be detected and how they canbe exploited to improve the performance of blind sourceseparation systems in speech processing applications. Weconsider both, the instantaneous mixture and the convolutivemixture case. A comparison of the proposed method withother popular source separation methods is drawn. Theresults show an improved performance of the proposedmethod over earlier approaches.

Index Terms— blind source separation, least-squares opti-mization, cocktail party problem, exclusive activity periods.

I. INTRODUCTION

An illustration of the scenario considered in the cocktailparty problem (CPP) is shown in figure 1. We are record-ing an acoustic scene with an array of microphones. Ifwe want to isolate the voice of a single speaker out ofa mixture of signals from different sources then it is im-plicitly necessary to estimate the transmission properties(or equivalently the inverse transmission properties) of allchannel between all microphones and all sources.

Up to date, the most successful approaches to solvingthe cocktail party problem (CPP) employ blind sourceseparation (BSS) techniques based on an assumptionof statistical independence of the sources [1], [2]. Thegoal is to find an un-mixing system that maximizes a“measure of independence” from the reconstructed sourcesignals. Generally, one distinguishes between methodsthat consider instantaneous mixing and methods thataddress convolutive mixing [1]–[3].

This paper is based on “A Novel Approach to Automated SourceSeparation in Multispeaker Environments,” by R. M. Nickel and A.N. Iyer, which appeared in the Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP), vol.5, pp. 629-632, May 14-19, 2006, Toulouse, France. c© 2005 IEEE.

Figure 1. An illustration of the scenario consideredin the cocktail party problem. In order to isolatethe speech signal of a targeted speaker we need toimplicitly estimate the transmission channels betweenall sources and all microphones.

The crux of the aforementioned approach is to finda good (and mathematically tractable) “measure for in-dependence.” It was shown that a suitable objective infinding a solution is provided by the minimization of acontrast function which is a function of the PDF of the ob-served signals [2], [4]. In the context of speech and audiosignal processing a suitable contrast function is usuallyderived from a likelihood measure and/or the INFOMAXconcept [5]. The optimization procedure that minimizes acontrast function (and thus provides a solution to the BSSproblem) is generally called an independent componentanalysis (ICA) [2]. Even though the BSS techniqueshave achieved remarkable results in the separation ofmixed audio signals, they are still suboptimal in regardof separation of speech, partially because they usually donot permit an exploitation of the highly structured natureof speech.

Solutions to the more general convolutive mixture caseare significantly more complicated than solutions to theinstantaneous mixture case [6]–[10]. Most solutions canbe classified as either, time domain approaches (see [11]and the references therein) or frequency domain ap-proaches (see [1], [12], [13]). More specifically, we haveto distinguish between: the domain in which we modelthe mixing process (mixing domain) and the domainin which we model the statistics of the source signals(source domain). A choice of either time or frequency foreach of these domains have significant advantages anddisadvantages (see [1]).

JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 59

© 2007 ACADEMY PUBLISHER

In this paper we are developing an alternative approachbased on a concept introduced by Huang, Benesty, andChen in 2005 [14]. The alternative approach does notexplicitly rely on an independence assumption betweensources. Instead, we are assuming the existence of exclu-sive activity periods (EAPs). EAPs are time intervals dur-ing which only one source is active and all other sourcesare silent. The existence of EAPs is not guaranteed forarbitrary signal classes, but EAPs occur very frequentlyin recordings of conversational speech.

The caveat of the newly proposed method is that oneneeds to reliably detect exclusive activity periods from theobserved signals. The detection of such periods becomespossible if we are focusing on the unique structure ofspeech signals and in particular the unique structure ofvoiced segments of the speech signals.

The technical details of the proposed methodology arepresented in sections II and III. Section II focuses onthe instantaneous mixing case and section III focuses onthe convolutive mixing case. An experimental verificationand performance analysis is presented in section IV. Theresults are discussed in sections V and VI.

II. INSTANTANEOUS MIXING

The considered scenario is described by the followingmathematical model: we have K speakers in a room,each of which produces a speech signal xk[n] (for k =1, 2, . . . ,K). The K speech signals are captured by Mmicrophones (M ≥ K), each of which must be placedsuch that the acoustic waveforms measured must besignificantly different from microphone to microphone1.The signal recorded at microphone m will be referredto as the observed signal and is denoted as ym[n] (form = 1 . . . M ). To streamline the notation it is beneficialto introduce the following vectors:

xk = [xk[1] xk[2] . . . xk[P ] ]T (1)

ym = [ ym[1] ym[2] . . . ym[P ] ]T, (2)

where P is the observation segment length in samples.For simplicity we will use the notation xk and ym inthree ways: (i) to indicate vectors that encompass theentire recording length, (ii) to indicate one of a successiveset of 40 msec long segments of the recording, and(iii) to indicate an exclusive activity period which spansmultiple successive segments of length 40 msec. Matricesare formed from the vectors as X = [ x1 x2 . . . xK ]T

and Y = [ y1 y2 . . . yM ]T. If we assume instantaneousmixing then the connection between the source signalmatrix X and the observed signal matrix Y is determinedby the constant mixing matrix A:

Y = A · X. (3)

In general, the solution to the CPP is defined so asto determine an inverse/de-mixing matrix W such thatmatrix X from

X = W · Y (4)

1Appropriate placement of the microphones is crucial to ensure thatthe resulting estimation problem is not being ill-conditioned.

Figure 2. A block diagram of the proposed source sepa-ration algorithm. The de-mixing matrix W is estimatedfrom EAP sections of the observed signals Y. Thesource estimates X are obtained from equation (4).

provides an estimate for the rows of matrix X. The matrixW is considered a de-mixing matrix when the matrixproduct W · A is a permutation matrix.

The proposed source separation algorithm consists oftwo main parts: i) the detection of EAP segments and ii)the determination of the de-mixing matrix W. A blockdiagram of the proposed algorithm is shown in figure 2.

A. EAP Detection

The detection of EAPs is facilitated by the follow-ing definition of a normalized signal-to-interference ratio(SIR) γ:

γ = 1K−1 max

k

(K

xTk xk∑K

i=1 xTi xi

− 1

). (5)

If we let the xk’s denote successive 40 msec longsegments of the source signals then we obtain an SIRmeasure γ for each segment. Note that the SIR is nor-malized such that 0 ≤ γ ≤ 1 (with γ = 1 indicating aperfect EAP event). Unfortunately, we cannot access thetrue underlying SIR since we cannot measure the sourcesignals xk directly. Instead, we are estimating γ from theobservations ym. The proposed estimator is based on thefollowing three features:

(i) A Periodicity Measure. The pitch determinationalgorithm (PDA) developed by Medan et. al. [15]aims to determine the pitch of a voiced speechsegment. The method involves computing normal-ized inner products between adjacent, variable lengthspeech segments. The inner product is maximizedwhen the segment length equals the pitch period.The normalized inner product at the pitch providesa measure of periodicity fp for the given signalsegment ym.

(ii) The Harmonic-to-Signal Ratio. The energy of avoiced speech segment is concentrated around theharmonic frequencies of its pitch. The harmonic-to-signal ratio (HSR) fh is defined as the ratio ofspectral energy around the pitch harmonics versusthe overall energy of signal ym. Its computation isachieved by comparing the energy of the output ofa pitch synchronous comb-filter with the total signalenergy [16].

60 JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007

© 2007 ACADEMY PUBLISHER

(iii) The Spectral Autocorrelation. Spectral autocorre-lation measures are used in many PDAs (in additionto temporal correlation measures) to combat pitch-doubling/pitch-halving errors. We are employing thenormalized spectral autocorrelation proposed in [16].The maximum autocorrelation value fs in the pitchfrequency range (50Hz to 500Hz) is determined andused as a feature for the SIR estimation.

The SIR is estimated via γ = Φ(fp, fh, fs) in whichΦ(. . .) is a three dimensional, second order polynomial.The optimal polynomial coefficients are chosen to min-imize the least-squares error between the estimated SIR(ESIR) γ and the true underlying SIR γ for the trainingset described in section IV. The resulting normalizedcorrelation coefficients between fp, fh, fs, γ and γ areshown in table I. The EAP detection is performed witha threshold test. A segment is flagged as an EAP if γis greater or equal to a threshold γ (see section IV). Atime segment is considered an exclusive activity period ifthe corresponding segments ym are all flagged as EAPsacross all channels m = 1 . . . M .

TABLE I.CORRELATION COEFFICIENTS FOR SIR ESTIMATION

Feature Correlation with γ

Periodicity Measure fp 0.5838Harmonic-to-Signal Ratio fh 0.4360Spectral Autocorrelation fs 0.4329Polynomial Estimate γ 0.6338

B. Estimation of the De-mixing Matrix

During a true exclusive activity period it is readilyverified that the observed signals yi are scalar multiples ofthe one active source signal xq. We can obtain an estimatefor xq from each channel i by multiplying the channeloutput yi with a channel specific scalar wi:

xqi = wi yi. (6)

Ideally, we want xqi = xq

j for all i and j, which leavesus with an infinite number of choices for the wi’s if allof the yi’s are truly linearly dependent. Due to imperfectEAP detection and background noise, however, we aregenerally not able to achieve xq

i = xqj for all i and j for

any set of channel weights wi. Instead, we may seek tochoose the wi’s such that the deviation between each xq

i

and their average xq = 1M

∑Mj=1 xq

j is minimized, i.e.

minimizewi

Cq =M∑i=1

‖ xqi − xq ‖2 subject to ‖ xq ‖2 = ζ2.

(7)The constraint is necessary to avoid the trivial (yet mean-ingless) solution wi = 0 for all i. Expanding the terms ofthe cost function leads to

Cq =M∑i=1

w2i y

Ti yi − 1

M

M∑i=1

M∑k=1

wiwkyTi yk, (8)

which can be compactly written in matrix notation as

Cq = wT[Rd − 1M R ]w (9)

with R = YTY, Rd = diag(R) and w = [w1 , w2 , . . . ,wM ]T. With Lagrange multiplier λ we can cast equation(7) into the Lagrange function

L(w, λ) = wT[M Rd − (1 + λ)R ]w − λM2 ζ2. (10)

Differentiating L(w, λ) with respect to w and equatingto zero results in the condition

Rd w = λ+1M Rw. (11)

Condition (11) is satisfied by the generalized eigenvectorsof matrices R and Rd (with λ+1

M being the generalizedeigenvalues). It is readily shown that out of the Mgeneralized eigenvectors we must choose the one wq withthe smallest eigenvalue to minimize the cost function Cq.

The transpose of the solution wq establishes the q’s rowof the de-mixing matrix W. We must observe at least oneEAP from every source to obtain a full estimate of W.The decision of whether a set of disjoint EAPs belongsto the same source or to different sources is done witha simple hierarchical clustering algorithm2. If multipleestimates of wq (from disjoint EAPs) are cast into thesame cluster then their (renormalized) centroid is used asthe corresponding row in W.

III. CONVOLUTIVE MIXING

The methods proposed in the previous section for theinstantaneous mixing case are readily generalized to theconvolutive mixing case. A block diagram that depicts theconsidered scenario is shown in figure 3. For simplicitywe assume that we have M unknown source signals xi[n]for i = 1 . . . M . The transmission path between source iand receiver j is described by the transfer function of alinear time-invariant system with impulse response gij [n].The resulting M observation signals yj [n] are generatedaccording to:

yj [n] =M∑i=1

[ ∞∑k=0

gij [k]xi[n − k]

]for j = 1 . . . M.

(12)Equation (12) can also be expressed in the z-domain as

Y(z) = G(z) · X(z), (13)

in which X(z) is the z-transform of multichan-nel signal x[n] = [ x1[n] x2[n] . . . xM [n] ]T, Y(z)is the z-transform of multichannel signal y[n] =[ y1[n] y2[n] . . . yM [n] ]T, and G(z) is a matrix withthe z-transforms of the impulse responses gij [n] for alli and j.

The goal in convolutive blind source separation is tofind a matrix of transfer functions H(z) such that X(z)given by

X(z) = H(z) · Y(z) (14)

2It is assumed that the underlying number of sources is known andthat all sources have at least one EAP.

JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 61

© 2007 ACADEMY PUBLISHER

Sources Mixture Model

Observations

General Model Scenario

x1[n]

x2[n]

xM[n]

gii[n]

gij [n]

gji[n]

gjj [n]

y1[n]

y2[n]

yM[n]

Figure 3. A block diagram of the mixing scenariodescribed by equation (12). The M unknown signalsources are labelled with xi[n] (for i = 1 . . . M ).The M observations are labelled with yj [n] (for j =1 . . . M ).

is (in some appropriate metric) as close to X(z) aspossible. Similarly to the method described in section IIwe are applying the following three-step approach:

1) Find a set of time intervals [n1, n2] during whichonly one source is active and all other sources aresilent, i.e. find all exclusive activity periods (EAPs).

2) Find a set of transfer functions that deconvolve thesources during EAPs.

3) Construct H(z) by combining the results fromdifferent EAPs from different sources.

One of the caveats of the proposed approach is that EAPdetection is slightly more complicated with convolutivemixtures than it is with instantaneous mixtures. We haveto revise our EAP detection method first.

A. EAP Estimation for Convolutive Mixtures of Speech

An exclusive activity period is a time during whichonly one source xi[n] is active and all other sources xk[n]are silent, i.e. xk[n] = 0 for k �= i. Similarly to theinstantaneous mixing case, the estimation of exclusiveactivity periods for speech sources xi[n] can be basedon the (almost) periodic nature of vocalic sounds. If onlya single person is speaking then all observations yj [n]exhibit time intervals with a periodic structure. Whenmultiple persons are speaking the periodicity is generallydestroyed [17].

Again, we use a modification of the robust short-time periodicity measure proposed Medan et al. [15] (seesection II-A). They consider the similarity between twoadjacent observation segments of length k:

sj1[n, k] =[ yj [n − k] . . . yj [n − 2] yj [n − 1] ]T (15)

sj2[n, k] =[ yj [n] yj [n + 1] . . . yj [n + k − 1] ]T. (16)

A correlation measure NCORj [n, k] is defined through anormalized inner product of vectors sj

1[n, k] and sj2[n, k]:

NCORj [n, k] =sj1[n, k]T sj

2[n, k]‖ sj

1[n, k] ‖ · ‖ sj2[n, k] ‖ . (17)

The normalization ensures that the correlation measure isbounded between zero and one, i.e. 0 ≤ NCORs[n, k] ≤1. The correlation measure is equal to one at the trueperiod p of a perfectly periodic signal. Less than perfectly

periodic signals yield correlation values less than one.As a consequence, we can define a short-time periodicitymeasure as:

STPMj [n] = maxpmin≤k≤pmax

{ NCORj [n, k] } . (18)

The search range for the maximum should be bounded bythe typical pitch range of human speech (50Hz...500Hz).For observation signals that are sampled with samplingfrequency Fs we have:

pmin = Fs / 500Hz and pmax = Fs / 50Hz . (19)

A second feature that correlates well with EAPs is the socalled short-time zero crossing rate [16]:

STZCj [n] =n+L∑

m=n−L

| sign(yj [m]) − sign(yj [m − 1]) |.(20)

In our notation sign(x) is equal to +1 for x ≥ 0 and−1 for x < 0. The zero crossing rate counts the numberof transitions from positive samples to negative sampleswithin the range (n−L−1) . . . (n+L). The range lengthis usually chosen as L = Fs ·10msec. Typically, the zerocrossing rate is low for EAP sections and high otherwise.For the STZC measure to work properly it is importantthat possible quantization offsets in the recorded speechsignal are removed prior to processing.

A normalized short-time zero crossing measureNZCMj [n] is constructed with the maximum ZCmaxj

and the minimum ZCminj of STZCj [n] over all n:

NZCMj [n] =STZCj [n] − ZCmaxj

ZCmaxj −ZCminj. (21)

The identification of EAP candidates is done as follows:1) find all times n for which STPMj [n] ≥ 0.7 for allobservations j = 1 . . . M , 2) expand the so found sectionsin forward and backward direction until STPMj [n] <0.6, 3) remove times for which NZCMj [n] > −0.5, and4) retain only the intersection of all EAP candidates acrossall channels j = 1 . . . M . An example for STPMj [n],NZCMj [n], and the resulting set of EAP candidate sec-tions is shown in figure 6.

B. Blind EAP Deconvolution

In this section we discuss the subproblem of blind sys-tem identification under an exclusive activity assumption,i.e. we assume that we have identified a time interval[n1, n2] during which only source xi[n] is active and allother sources are silent, i.e. xk[n] = 0 for k �= i. Underthe EAP assumption we may attempt to reconstruct sourcexi[n] from each observation yj [n] via an appropriatelychosen inverse filter hij [n]:

xji [n] =

P∑k=0

hij [k] yj [n − k]. (22)

Ideally xi[n] = xji [n] for all j = 1 . . . M . Practically,

however, we have xki [n] �= xj

i [n] for k �= j due to noise,

62 JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007

© 2007 ACADEMY PUBLISHER

imperfect estimation of the EAPs, improper choice of P ,non-minimum phase properties of gij [n], and so forth.

An estimate Ei of the reconstruction error can bedefined with

Ei =M∑

j=1

∑n

| xji [n] − xi[n] |2 (23)

and xi[n] =1M

M∑j=1

xji [n]. (24)

It is readily seen from equation (23) that a perfectreconstruction with xi[n] = xj

i [n] yields a minimum errorestimate of Ei = 0. Unfortunately, xi[n] = xj

i [n] may notbe the only solution that satisfies Ei = 0 (e.g. if the gij [n]are linearly dependent). One may hope, however, that (ifthe gij [n] are sufficiently different) a global minimizationof Ei will lead to good estimates for the hij [n] forj = 1 . . . M .

The computation of the global minimum of Ei is aidedby the following notation. We define:

xji =

[xj

i [n1 + P ] . . . xji [n2 − 1] xj

i [n2]],

Yj =

⎡⎢⎢⎢⎢⎣

yj [n1] yj [n1 + 1] · · · yj [n1 + P ]

yj [n1 + 1] yj [n1 + 2] · · · ......

.... . .

...yj [n2 − P ] · · · · · · yj [n2]

⎤⎥⎥⎥⎥⎦ ,

and hji =

[hij [P ] . . . hij [1] hij [0]

]. (25)

Equation (22) can be rewritten as xji = Yj hj

i and:

xi =1M

M∑j=1

xji =

1M

M∑j=1

Yj hji . (26)

The error estimate (23) becomes:

Ei =M∑

j=1

∥∥∥∥∥Yj hji − 1

M

M∑k=1

Yk hki

∥∥∥∥∥2

(27)

=M∑

j=1

[Yj hji ]Yj hj

i −1M

M∑j=1

M∑k=1

[Yk hki ]Yj hj

i .

We define the matrices Rjk = [Yk ]Yj and

RF =

⎡⎢⎢⎢⎢⎣

R11 R12 · · · R1M

R21 R22 · · · ......

.... . .

...RM1 · · · · · · RMM

⎤⎥⎥⎥⎥⎦ , (28)

RD =

⎡⎢⎢⎢⎢⎣

R11 0 · · · 0

0 R22 · · · ......

.... . . 0

0 · · · 0 RMM

⎤⎥⎥⎥⎥⎦ , (29)

and Hi =[

[ h1i ] [ h2

i ] . . . [ hMi ]

]. (30)

Using equations (25) to (30) we can compactly write theerror estimate as

Ei = Hi [RD − 1M

RF ] Hi. (31)

In order to avoid the trivial minimization of equation (31)(with Hi = 0) we constrain the solution to

‖ xi ‖2 =1

M2Hi RF Hi = 1. (32)

We have thus reformulated the problem into that offinding the vector Hi that minimizes Ei subject to equa-tion (32).

It is readily shown with Lagrange multipliers that thesolution to the above problem is provided by one of thegeneralized eigenvectors Φm of matrices RD and RF :

ϕm RDΦm = RF Φm. (33)

We assume that the eigenvalues ϕm are sorted in decreas-ing order ϕ1 ≥ ϕ2 ≥ ϕ3 ≥ ϕ4 ≥ . . . and that theeigenvectors are normalized such that ‖Φm ‖ = 1 forall m. Choosing Φm as the solution leads to an errorestimate of Ei = M ( M

ϕm− 1). The optimal solution, i.e.

the one that minimizes Ei is thus given by

Hi = Φ1 and Ei = M (M

ϕ1− 1).

C. Blind Source Separation

As a result of the methods described in sections III-A and III-B we obtain an inverse filter estimate Hi

(with its associated eigenvalue ϕ1) for each separatelyidentified EAP section. In a first step we discard all EAPsections (and Hi) for which log10

MM−ϕ1

was greaterthan a certain EAP acceptance threshold (EAT - seesection V). In a second step we use a simple minimumEuclidean distance hierarchical clustering method [18]to associate each vector Hi to one of the M sources. Allvectors associated with the same source k are averaged3

(arithmetic mean) into an average eigenvector Hk foreach source k = 1 . . . M . By extracting the correspondingsubvectors hj

i in analogy to equation (30):

Hi =[

[ h1i ] [ h2

i ] . . . [ hMi ]

], (34)

we obtain a complete set of inverse filter vectors hji :

hji =

[hij [P ] . . . hij [1] hij [0]

]. (35)

An estimate for the mixing matrix G(z) from equa-tion (13) is obtained from:[

G(z)]

ij=

1∑Pk=0 hij [k] z−k

, (36)

where notation [G]ij refers to the element of matrix G inrow i and column j. An estimate for the demixing matrixH(z) from equation (14) can be obtained by numericallyinverting G(z) via Gaussian elimination [19]. Unfortu-nately, the inversion process may introduce unstable poles

3The weight for each vector was chosen proportional to the number ofsamples contained in the associated EAP section. Longer EAP sectionshad thus more weight then shorter ones.

JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 63

© 2007 ACADEMY PUBLISHER

into the transfer functions of H(z). The production ofstable filters can be enforced by mirroring poles that falloutside of the unit circle back into the inside of the unitcircle. The mirroring process distorts the correct phaseresponse, but leaves the magnitude response of individualchannels intact.

IV. EXPERIMENTS

We evaluated the performance of the proposed methodwith mixing/de-mixing trials over speech data from theTIMIT4 database. The TIMIT dataset contains record-ings of 10 phonetically rich sentences from 630 speakersof 8 major dialects of American English. The corpus isstored in 16bit/16kHz waveform files for each utterance.One subset of files is strictly reserved for training andanother subset is reserved for testing.

A. Instantaneous Mixing

During each mixing/de-mixing trial we randomly choseM utterances xi from the corpus. The M utteranceswere mixed with a random mixing matrix A accordingto equation (3) to produce M observations yi. Theelements of A were chosen as independent, uniformlydistributed random numbers over the interval [0, 1]. Theobservations yi were then subjected to the proposed de-mixing procedure to produce M source estimates xi.

The quality of the de-mixing process was measuredwith the following signal-to-noise ratio (SNR):

SNR = minP∈P

10M

∑[i,j]∈P

log[

xTi xi

(xi − η xj)T(xi − η xj)

].

(37)The scaling factor η is chosen as η = xT

i xj/(xTi xi) to

account for the unknown scaling of the reconstructedsignals. Parameter P refers to a set of M index pairs[i, j] with i = 1 . . . M and j = 1 . . . M and such thateach number between 1 and M is only used once for iand once for j. P is the entirety of all possible index setsP. The minimization over P accounts for the unknownsignal permutations introduced by WA as discussed insection 2.

In a first experiment we ran several sets of 256 randommixing/de-mixing trials with M = 2 over the trainingsubset of the corpus. For each set of 256 trials we chose adifferent EAP decision threshold γ as described in sectionII. The resulting average SNR (averaged over all trials)as a function of γ is displayed in figure 4. The optimalthreshold, i.e. the one that produced the highest averageSNR, was found to be γ = 0.85.

In a second experiment we ran several sets of 256random mixing/de-mixing trials over the testing subset ofthe corpus. This time, we kept γ fixed at 0.85 and variedthe number of sources/channels M . The resulting averageSNR (averaged over all 256 trials per M ) as a functionof M is shown in figure 9 (EAPD).

4The TIMIT database is available through the Linguistic Data Con-sortium (LDC) at the University of Pennsylvania (www.ldc.upenn.edu).

0.5 0.6 0.7 0.8 0.9 165

70

75

80

85

90

95

100

105

110

SN

R [

dB

]

SIR Threshold

Figure 4. Determination of the EAP decision thresholdγ. The average SNR is maximized in the two channelcase (M = 2) for γ = 0.85.

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Source Signals

Time [sec]

Cha

nnel

#1

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Time [sec]

Cha

nnel

#2

Figure 5. An example of two source signals x1[n]and x2[n] from the TIMIT database. The signals werealigned to have a 30% overlap in time.

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Mixed Signals and Features

Time [sec]

Cha

nnel

#1

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Time [sec]

Cha

nnel

#2

Figure 6. The resulting mixed signals y1[n] and y2[n]from the example in figure 5. The upper dashed linein each axis indicates the resulting STPMj [n] contour(equation (18)) and the lower dashed line indicates theresulting NZCMj [n] contour (equation (21)). The grayregions indicate the EAP sections that were estimatedfrom the mixed signals.

64 JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007

© 2007 ACADEMY PUBLISHER

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Demixed Signals − EAP Method

Time [sec]

Cha

nnel

#1

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Time [sec]

Cha

nnel

#2

Figure 7. The resulting demixed signals x1[n] andx2[n] from the example in figure 5 after applicationof the proposed EAP method. Both signals are veryclose to the original source signals x1[n] and x2[n]depicted in figure 5.

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Demixed Signals − Parra/Spence Method

Time [sec]

Cha

nnel

#1

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Time [sec]

Cha

nnel

#2

Figure 8. The demixing result of a commonly usedconventional method of blind source separation afterL. Parra and C. Spence [20] (see section V).

B. Convolutive Mixing

Our experiments with convolutive mixing were con-ducted over the SI-subset of the TIMIT database. Thechosen subset consists of recordings from 630 subjectseach uttering 3 phonetically-diverse sentences5. The sig-nals were low-pass filtered and down-sampled to 8 kHzprior to processing. All 3 sentences from the same speakerwere concatenated and then truncated to 4 seconds. As aresult we received a total of 630 different 4 seconds longsource signals x[n].

We ran experiments with different source numbers(M = 2, 3, and 4) and particularly small filter lengths(P + 1 = 5, 7, and 10). Unfortunately, the increasedcomplexity and the poor conditioning of the underlyingmatrices RF and RD at higher dimensions prevented

5None of the sentences are repeated more than once.

us (as of now) from considering more realistic filterlengths (between 1000 . . . 100000) of real reverberantenvironments. The presented theory holds for arbitraryfilter lengths, however, one must employ an eigenvec-tor/eigenvalue decomposition algorithm with a sensitivityfar beyond the algorithm available to the authors.

For each M the available data was randomly split6 into[ 630/M ] groups of M source signals xi[n]. To simulateconversational speech the signals were partially fadedout to obtain a relative time-overlap between signals ofroughly 30% (see figure 5). The M source signals of eachgroup were mixed with order-P random minimum phasefilters gij [n] according to equation (12). The resultingobservations yj [n] for j = 1 . . . M were then used toestimate the inverse filter matrix H(z) according to sec-tion III. The reconstructed source signal estimates xi[n]for i = 1 . . . M were computed according to equation (14)via X(z) = H(z)Y(z).

The quality of the estimated model was evaluated withthe Signal-to-Interference Ratio (SIR in [dB]) between thereconstructed signal xi[n] and the original signal xi[n]:

SIRi = maxp

{10 log10

∑n |xi[n] |2∑

n |xi[n] − xi[n − p] |2}

.

(38)The evaluation of the SIR was performed under carefulconsideration of possible numbering permutations be-tween the original signals and the reconstructions.

Figures 5 to 7 show an example for an experimentwith two sources. The gray regions in the figures indicatethe EAP sections that were estimated from the mixedsignals y1[n] and y2[n] from figure 6. Figure 8 showsthe result for a commonly used conventional method ofblind source separation after L. Parra and C. Spence [20](see section V).

V. RESULTS

The performance results of the proposed EAP methodfor the instantaneous mixing case are presented in fig-ure 9. The performance of two other popular BSS meth-ods, FastICA7 [21] and AMUSE7 [11], are provided forcomparison. For each method the average SNR and thestandard deviation over all 256 trials per M are shown(via I-bars).

The proposed EAPD method clearly outperforms theother two methods for smaller source numbers. As thenumber of simultaneously talking sources M increases,the number (and quality) of available EAP sections nat-urally decreases. As a result, the performance of theproposed method declines. All methods performed aroundthe same average SNR for 6 and more simultaneoussources (with a slight edge of FastICA and EAPD overAMUSE). Increased errors in EAP detection, however,lead to a much larger standard deviation of the EAPD me-

6Every source signal was only used once.7We employed software provided by the original developers of the

FastICA and AMUSE algorithms for the comparison. All third partsoftware was run with default parameters.

JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 65

© 2007 ACADEMY PUBLISHER

1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

Number of Sources

SB

R [

dB

]

EAPDFICAAMUSE

Figure 9. An SNR comparison between the proposedmethod (EAPD) and two other popular BSS methods:FastICA (FICA) [21] and AMUSE [11].

thod in comparison to FastICA and AMUSE at highersource numbers.

Note that, unlike for the convolutive mixing case, wedid not enforce the existence of EAPs through artificalfading (as described in section IV-B). The results are,hence, based on M simultaneously talking speakers.EAPs, in this scenario, are mostly due in very shortperiods of low-power articulatory transitions (as well aslow-power unvoiced sections) within speech. It is notsurprising that the availability of such isolated eventsdeclines rapidly with increasing M .

The results of the experiments for the convolutivemixing are summarized in tables II and III. Table II liststhe average SIR values (AvSIR) that were obtained byaveraging the SIRi after equation (38) over all channelsand all experiments with the same source number Mand filter order P . The third column reports the averageSIR values for the proposed EAP method. The fourthcolumn reports the average SIR values that resulted froman application of the popular blind source separationmethod proposed by L. Parra and C. Spence [20]. Theresults for the Parra/Spence method was computed withsoftware written by S. Harmeling (MATLAB functionconvbss.m, endorsed by L. Parra and C. Spence).The last column reports the average SIR between theobservations yj [n] and the sources xi[n] as a reference.

Table III provides supplemental information for eachexperiment. Column three of table III lists the averageSIR results that are obtained when the proposed methodsis applied to the true EAP locations (and not the estimatedEAP locations). The fourth column of table III reportsthe number of instances (in %) in which the numericalinversion of matrix G(z) after equation (36) led tounstable poles that had to be mapped back into the unitcircle. Column five reports the chosen value for the EAPacceptance threshold (EAT) as described in section III-C.

It is clearly visible from table II that the pro-posed method achieves significant improvements over theParra/Spence method for small complexity tasks withsmaller source numbers M and smaller filter orders P . In

Table II

Source Filter AvSIR AvSIR AvSIRNumber Length EAP Parra/ Mixed

M P + 1 Method Spence Signals

2 5 16.06 4.96 4.472 7 10.45 5.11 4.502 10 7.37 5.35 4.763 5 11.24 5.17 3.773 7 8.41 5.16 3.883 10 6.22 5.40 4.024 5 6.67 5.17 3.104 7 2.72 5.10 3.26

Table II. Average signal-to-noise ratios for varioussource numbers M , filter orders P , and algorithms.

Table III

Source Filter AvSIR PoleNumber Length True Mirror EAT

M P + 1 EAP %

2 5 47.29 15.56 4.02 7 26.54 30.79 4.02 10 9.29 55.56 4.03 5 30.53 59.52 3.03 7 19.15 77.14 3.03 10 9.68 95.24 3.04 5 18.11 92.36 3.04 7 9.09 99.36 3.0

Table III. Supplemental statistics about the exper-iments with various source numbers M and filterorders P .

the best case scenario, for two sources and with a filterlength of 5 taps, we can obtain a 11 dB improvementin average signal-to-interference ratio. Unfortunately, theadvantage vanishes with larger complexity tasks. Thereason for the decline is partially due to the increasingnumber of pole location changes as listed in column fourof table III.

A very promising result for future developments iscontained in column three of table III. If the result ofthe EAP estimation is replaced with the true location ofthe EAPs in the given mixture signals yj [n] for j =1 . . . M then the average SIR is dramatically improvedover the Parra/Spence method even for higher complexitycases. It is thus expected that the method will producesignificantly better results if equipped with a more robustEAP detection strategy.

VI. CONCLUSIONS

We presented a new approach to the solution of thecocktail party problem with instantaneous mixing andconvolutive mixing (with small filter lengths). Instead ofinsisting on independence between sources and samples,we exploited the fact that speech is generally character-ized by frequent pauses (EAPs). These pauses can be usedfor a one-channel-at-a-time estimation of the unknown

66 JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007

© 2007 ACADEMY PUBLISHER

mixing matrix. Experiments have shown that the proposedmethod can outperform common BSS methods, especiallyfor smaller source numbers.

Under the instantaneous mixing condition it should benoted that the presented simulation results were obtainedfrom (unrealistically) harsh conditions for the EAP detec-tion. All speakers were talking at the exact same time. Theresulting EAP sections were, hence, short and generallyof poor quality. More realistically, speakers tend to listenand respond in a dialog which makes the detection oflong, high quality EAP section much more probable.

For the convolutive mixing case we provided a proof ofconcept for EAP based blind source separation methods.Some of the methods presented in this paper, especiallythe section on EAP detection, are, in their current form,still suboptimal and deserve to be studied in greaterdetail. Despite its suboptimality, however, the proposedmethod still improves upon existing strategies (especiallyfor lower complexity tasks).

A caveat of the proposed method is that (in its currentform) we have not imposed a constraint that forces theoptimal unmixing matrix H(z) to be representative ofa stable system. Instead, we employed a simple pole-mirroring strategy that, by itself, is responsible for a sub-stantial part of the performance loss at higher complexitytasks (see table II).

The proposed method may readily be extended to ahybrid system that combines an EAP based approachwith independent component analysis (ICA). Since mostICA method are iterative in nature, it is conceivable thatEAP based methods deliver good initialization points forthe iterations. In on-line methods one may adaptivelydecide if an EAP or ICA method is more appropriatelysuited to analyze the current frame of observations. Atracking method that switches between EAP and ICAmethods may lead to improvements in adaptive blindsystem identification as well.

REFERENCES

[1] N. Mitianoudis and M. E. Davies, “Audio source sepa-ration: solutions and problems,” International Journal ofAdaptive Control and Signal Processing, vol. 18, no. 3,pp. 299–314, Apr. 2004.

[2] T. W. Lee, Independent Component Analysis: Theory andApplications. Kluwer Academic Publishers, 1998.

[3] R. M. Nickel, “Blind multichannel system identificationwith applications in speech signal processing,” Proc. Int.Conf. on Comp. Intell. for Modelling Control and Automa-tion (CIMCA), 2005, Vienna, Austria, Nov. 2005.

[4] P. Comon, “Independent component analysis: A new con-cept,” Signal Processing, vol. 36, pp. 287–314, 1994.

[5] A. Hyvarinen, J. Karhunen, and E. Oja, Independentcomponent analysis. John Wiley & Sons., 2001.

[6] S. C. Douglas, H. Sawada, and S. Makino, “Naturalgradient multichannel blind deconvolution and speech sep-aration using causal FIR filters,” IEEE Transactions onSpeech and Audio Processing, vol. 13, no. 1, pp. 92–104,Jan. 2005.

[7] M. Z. Ikram and D. R. Morgan, “Permutation inconsis-tency in blind speech separation: investigation and solu-tions,” IEEE Transactions on Speech and Audio Process-ing, vol. 13, no. 1, pp. 1–13, Jan. 2005.

[8] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura,T. Nishikawa, and K. Shikano, “Blind source separationcombining independent component analysis and beam-forming,” EURASIP Journal on Applied Signal Processing,vol. 2003, no. 11, pp. 1135–1146, Oct. 2003.

[9] D. Martinez and A. Bray, “Nonlinear blind source sepa-ration using kernels,” IEEE Transactions on Neural Net-works, vol. 14, no. 1, pp. 228–235, Jan. 2003.

[10] W. Wang, S. Sanei, and J. A. Chambers, “Penalty function-based joint diagonalization approach for convolutive blindseparation of nonstationary sources,” IEEE Transactionson Signal Processing, vol. 53, no. 5, pp. 1654–1669, May2005.

[11] A. Cichocki and S. Amari, Adaptive Blind Signal andImage Processing: Learning Algorithms and Applications.Chichester, U.K.: Wiley, 2002.

[12] S. Araki, R. Mukai, S. Makino, T. Nishikawa, andH. Saruwatari, “The fundamental limitation of frequencydomain blind source separation for convolutive mixtures ofspeech,” IEEE Transactions on Speech and Audio Process-ing, vol. 11, no. 2, pp. 109–116, Mar. 2003.

[13] R. R. Gharieb and A. Cichocki, “Second-order statisticsbased blind source separation using a bank of subbandfilters,” Digital Signal Processing, vol. 13, no. 2, pp. 252–274, Apr. 2003.

[14] Y. A. Huang, J. Benesty, and J. Chen, “A blind chan-nel identification-based two-stage approach to separationand dereverberation of speech signals in a reverberantenvironment,” IEEE Transactions on Speech and AudioProcessing, vol. 13, no. 5, pp. 882–895, Sept. 2005.

[15] Y. Medan, E. Yair, and D. Chazan, “Super resolution pitchdetermination of speech signals,” IEEE Transactions onSignal Processing, vol. 39-1, pp. 40–48, January 1991.

[16] A. M. Kondoz, Digital Speech: Coding for Low Bit RateCommunication Systems, 2nd ed. John Wiley & Sons,November 2004.

[17] J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-TimeProcessing of Speech Signals. New York: Macmillan,1993.

[18] R. O. Duda and P. E. Hart, Pattern Classification and SceneAnalysis. Menlo Park, CA: Wiley-Interscience, 1973.

[19] G. H. Golub and C. F. V. Loan, Matrix Computations. 701West 40th Street, Baltimore, Maryland 21211: The JohnsHopkins University Press, 1991.

[20] L. Parra and C. Spence, “Convolutive blind separation ofnon-stationary sources,” IEEE Transactions on Speech andAudio Processing, vol. 8, no. 3, pp. 320–327, May 2000.

[21] A. Hyvarinen, “A family of fixed-point algorithms forindependent component analysis,” in IEEE Int. Conf. onAcoustics, Speech Signal Processing (ICASSP’97), 1997.

JOURNAL OF COMPUTERS, VOL. 2, NO. 7, SEPTEMBER 2007 67

© 2007 ACADEMY PUBLISHER


Recommended