+ All documents
Home > Documents > Parallel Feature Selection for Regularized Least-Squares

Parallel Feature Selection for Regularized Least-Squares

Date post: 02-Dec-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
17
Linear Time Feature Selection for Regularized Least-Squares Tapio Pahikkala, Antti Airola, and Tapio Salakoski March 19, 2010 Abstract We propose a novel algorithm for greedy forward fea- ture selection for regularized least-squares (RLS) re- gression and classification, also known as the least- squares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leave- one-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of se- lected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our al- gorithm and its ability to find good quality feature sets. 1 Introduction Feature selection is one of the fundamental tasks in machine learning. Simply put, given a set of features, the task is to select an informative subset. The se- lected set can be informative in the sense that it guar- antees a good performance of the machine learning method or that it will provide new knowledge about the task in question. Such increased understanding of the data is sought after especially in life sciences, where feature selection is often used, for example, to find genes relevant to the problem under considera- tion. Other benefits include compressed representa- tions of learned predictors and faster prediction time. This can enable the application of learned models when constrained by limited memory and real-time response demands, as is typically the case in embed- ded computing, for example. The feature selection methods are divided by [1] into three classes, the so called filter, wrapper, and embedded methods. In the filter model feature se- lection is done as a preprocessing step by measur- ing only the intrinsic properties of the data. In the wrapper model [2] the features are selected through interaction with a machine learning method. The quality of selected features is evaluated by measur- ing some estimate of the generalization performance of a prediction model constructed using them. The embedded methods minimize an objective function which is a trade-off between a goodness-of-fit term and a penalty term for a large number of features (see e.g. [3]). On one hand, the method we propose in this work is a wrapper method, since equivalent results can be obtained by using a black-box learning algorithm to- gether with the considered search and performance estimation strategies. On the other hand it may also be considered as an embedded method, because the feature subset selection is so deeply integrated in the learning process, that our work results in a novel ef- ficient training algorithm for the method. As suggested by [4], we estimate the generalization performance of models learned with different subsets of features with cross-validation. More specifically, we consider the leave-one-out cross-validation (LOO) approach, where each example in turn is left out of the training set and used for testing. Since the num- 1 arXiv:1003.3570v1 [stat.ML] 18 Mar 2010
Transcript

Linear Time Feature Selection for Regularized Least-Squares

Tapio Pahikkala, Antti Airola, and Tapio Salakoski

March 19, 2010

Abstract

We propose a novel algorithm for greedy forward fea-ture selection for regularized least-squares (RLS) re-gression and classification, also known as the least-squares support vector machine or ridge regression.The algorithm, which we call greedy RLS, starts fromthe empty feature set, and on each iteration addsthe feature whose addition provides the best leave-one-out cross-validation performance. Our methodis considerably faster than the previously proposedones, since its time complexity is linear in the numberof training examples, the number of features in theoriginal data set, and the desired size of the set of se-lected features. Therefore, as a side effect we obtain anew training algorithm for learning sparse linear RLSpredictors which can be used for large scale learning.This speed is possible due to matrix calculus basedshort-cuts for leave-one-out and feature addition. Weexperimentally demonstrate the scalability of our al-gorithm and its ability to find good quality featuresets.

1 Introduction

Feature selection is one of the fundamental tasks inmachine learning. Simply put, given a set of features,the task is to select an informative subset. The se-lected set can be informative in the sense that it guar-antees a good performance of the machine learningmethod or that it will provide new knowledge aboutthe task in question. Such increased understandingof the data is sought after especially in life sciences,where feature selection is often used, for example, tofind genes relevant to the problem under considera-

tion. Other benefits include compressed representa-tions of learned predictors and faster prediction time.This can enable the application of learned modelswhen constrained by limited memory and real-timeresponse demands, as is typically the case in embed-ded computing, for example.

The feature selection methods are divided by [1]into three classes, the so called filter, wrapper, andembedded methods. In the filter model feature se-lection is done as a preprocessing step by measur-ing only the intrinsic properties of the data. In thewrapper model [2] the features are selected throughinteraction with a machine learning method. Thequality of selected features is evaluated by measur-ing some estimate of the generalization performanceof a prediction model constructed using them. Theembedded methods minimize an objective functionwhich is a trade-off between a goodness-of-fit termand a penalty term for a large number of features(see e.g. [3]).

On one hand, the method we propose in this workis a wrapper method, since equivalent results can beobtained by using a black-box learning algorithm to-gether with the considered search and performanceestimation strategies. On the other hand it may alsobe considered as an embedded method, because thefeature subset selection is so deeply integrated in thelearning process, that our work results in a novel ef-ficient training algorithm for the method.

As suggested by [4], we estimate the generalizationperformance of models learned with different subsetsof features with cross-validation. More specifically,we consider the leave-one-out cross-validation (LOO)approach, where each example in turn is left out ofthe training set and used for testing. Since the num-

1

arX

iv:1

003.

3570

v1 [

stat

.ML

] 1

8 M

ar 2

010

ber of possible feature combinations grows exponen-tially with the number of features, a search strategyover the power set of features is needed. The strategywe choose is the greedy forward selection approach.The method starts from the empty feature set, and oneach iteration adds the feature whose addition pro-vides the best leave-one-out cross-validation perfor-mance. That is, it performs a steepest descent hill-climbing in the space of feature subsets of size up toa given limit k ∈ N and is greedy in the sense thatit does not remove features once they have been se-lected.

Our method is built upon the Regularized least-squares (RLS), also known as the least-squares sup-port vector machine (LS-SVM) and ridge regression,which is is a state-of-the art machine learning methodsuitable both for regression and classification [5–11].An important property of the algorithm is that ithas a closed form solution, which can be fully ex-pressed in terms of matrix operations. This allowsdeveloping efficient computational shortcuts for themethod, since small changes in the training data ma-trix correspond to low-rank changes in the learnedpredictor. Especially it makes possible the devel-opment of efficient cross-validation algorithms. Anupdated predictor, corresponding to a one learnedfrom a training set from which one example has beenremoved, can be obtained via a well-known compu-tationally efficient short-cut (see e.g. [12–14]) whichin turn enables the fast computation of LOO-basedperformance estimates. Analogously to removing theeffect of training examples from learned RLS predic-tors, the effects of a feature can be added or removedby updating the learned RLS predictors via similarcomputational short-cuts.

Learning a linear RLS predictor with k featuresand m training examples requires O(min{k2m, km2})time, since the training can be performed either inprimal or dual form depending whether m > k orvice versa. Given that the computation of LOO per-formance requires m retrainings, that the forward se-lection goes through of the order of O(n) featuresavailable for selection in each iteration, and that theforward selection has k iterations, the overall timecomplexity of the forward selection with LOO crite-rion becomes O(min{k3m2n, k2m3n}) in case RLS is

used as a black-box method.

In machine learning and statistics literature, therehave been several studies in which the computationalshort-cuts for LOO, as well as other types of cross-validation, have been used to speed up the evaluationof feature subset quality for ordinary non-regularizedleast-squares (see e.g [15, 16]) and for RLS (see e.g.[17,18]). However, the considered approaches are stillbe computationally quite demanding, since the LOOestimate needs to be re-calculated from scratch foreach considered subset of features. Recently, [19] in-troduced a method which uses additional computa-tional short-cuts for efficient updating of the LOOpredictions when adding new features to those al-ready selected. The computational cost of the incre-mental forward selection procedure is only O(km2n)for the method. The speed improvement is notableespecially in cases, where the aim is to select a largenumber of features but the training set is small. How-ever, the method is still impractical for large trainingsets due to its quadratic scaling.

In this paper, we improve upon the above resultsby showing how greedy forward selection according tothe LOO criterion can be carried out in O(kmn) time.Thus, our algorithm, which we call greedy RLS orgreedy LS-SVM, is linear in the number of examples,features, and selected features, while it still providesthe same solution as the straightforward wrapper ap-proach. As a side effect we obtain a novel trainingalgorithm for linear RLS predictors that is suitablefor large-scale learning and guarantees sparse solu-tions. Computational experiments demonstrate thescalability and efficiency of greedy RLS, and the pre-dictive power of selected feature sets.

2 Regularized Least-Squares

We start by introducing some notation. Let Rm andRn×m, where n,m ∈ N, denote the sets of real valuedcolumn vectors and n×m-matrices, respectively. Todenote real valued matrices and vectors we use cap-ital letters and bold lower case letters, respectively.Moreover, index sets are denoted with calligraphiccapital letters. By denoting Mi, M:,j , and Mi,j , werefer to the ith row, jth column, and i, jth entry of

2

the matrix M ∈ Rn×m, respectively. Similarly, forindex sets R ∈ {1, . . . , n} and L ∈ {1, . . . ,m}, wedenote the submatrices of M having their rows in-dexed by R, the columns by L, and the rows by Rand columns by L as MR, M:,L, and MR,L, respec-tively. We use an analogous notation also for columnvectors, that is, vi refers to the ith entry of the vectorv.

Let X ∈ Rn×m be a matrix containing the wholefeature representation of the examples in the trainingset, where n is the total number of features and mis the number of training examples. The i, jth entryof X contains the value of the ith feature in the jthtraining example. Moreover, let y ∈ Rm be a vectorcontaining the labels of the training examples. Inbinary classification, the labels can be restricted tobe either −1 or 1, for example, while they can be anyreal numbers in regression tasks.

In this paper, we consider linear predictors of type

f(x) = wTxS , (1)

where w is the |S|-dimensional vector representationof the learned predictor and xS can be considered asa mapping of the data point x into |S|-dimensionalfeature space.1 Note that the vector w only containsentries corresponding to the features indexed by S.The rest of the features of the data points are notused in prediction phase. The computational com-plexity of making predictions with (1) and the spacecomplexity of the predictor are both O(k), where kis the desired number of features to be selected, pro-vided that the feature vector representation xS forthe data point x is given.

Given training data and a set of feature indicesS, we find w by minimizing the so-called regularizedleast-squares (RLS) risk. Minimizing the RLS riskcan be expressed as the following problem:

argminw∈R|S|

((wTXS)T − y)T((wTXS)T − y) + λwTw.

(2)The first term in (2), called the empirical risk, mea-sures how well the prediction function fits to the

1In the literature, the formula of the linear predictors oftenalso contain a bias term. Here, we assume that if such a biasis used, it will be realized by using an extra constant valuedfeature in the data points.

training data. The second term is called the regu-larizer and it controls the tradeoff between the losson the training set and the complexity of the predic-tion function.

A straightforward approach to solve (2) is to setthe derivative of the objective function with respectto w to zero. Then, by solving it with respect to w,we get

w = (XS(XS)T + λI)−1XSy, (3)

where I ∈ Rm×m is the identity matrix. We note (seee.g. [20]) that an equivalent result can be obtainedfrom

w = XS((XS)TXS + λI)−1y. (4)

If the size of the set S of currently selected features issmaller than the number of training examples m, it iscomputationally beneficial to use the form (3) whileusing (4) is faster in the opposite case. Namely, thecomputational complexity of the former is O(|S|3 +|S|2m), while the that of the latter is O(m3 +m2|S|),and hence the complexity of training a predictor isO(min{|S|2m,m2|S|}).

To support the considerations in the following sec-tions, we introduce the dual form of the predictionfunction and some extra notation. According to [7],the prediction function (1) can be represented in dualform as follows

f(x) = aT(XS)TxS .

Here a ∈ Rm is the vector of so-called dual variables,which can be obtained from

a = Gy,

whereG = (K + λI)−1 (5)

andK = (XS)TXS . (6)

Note that if S = ∅, K is defined to be a matrix whoseevery entry is equal to zero.

In the context of kernel-based learning algorithms(see e.g. [21–23]), K and a are usually called the ker-nel matrix and the vector of dual variables, respec-tively. The kernel matrix contains the inner prod-ucts between the feature vectors of all training data

3

points. With kernel methods, the feature vector rep-resentation of the data points may be even infinitedimensional as long as there is an efficient, althoughpossibly implicit, way to calculate the value of theinner product and the dual variables are used to rep-resent the prediction function. However, while thedual representation plays an important role in theconsiderations of this paper, we restrict our consid-erations to feature representations of type xS whichcan be represented explicitly.

Now let us consider some well-known efficient ap-proaches for evaluating the LOO performance of atrained RLS predictor (see e.g. [12–14,24] and for thenon-regularized case, see e.g [15,16]). The LOO pre-diction for the jth training example can be obtainedin constant time from

(1− qj)−1(fj − qjyj), (7)

wheref = (wTXS)T

and

qj = (XS,j)T(XS(XS)T + λI)−1XS,j ,

provided that the vectors f and q are computed andstored in memory. The time complexity of computingf and q is O(|S|3 + |S|2m), which is the same as thatof training RLS via (3). Alternatively, the constanttime LOO predictions can be expressed in terms ofthe dual variables a and the diagonal entries of thematrix G as follows:

yj − (Gj,j)−1aj . (8)

Here, the time complexity of computing a and thediagonal entries of the matrix G is O(m3 + m2|S|),which is equal to the complexity of training RLS via(4). Thus, using either (7) or (8), the LOO perfor-mance for the whole training set can be computedin O(min{|S|2m,m2|S|}) time, which is equal to thetime complexity of training RLS.

3 Feature Selection

In this section, we consider algorithmic implementa-tions of feature selection for RLS with leave-one-out

(LOO) criterion. We start by considering a straight-forward wrapper approach which uses RLS as a black-box method in Section 3.1. Next, we recall a previ-ously proposed algorithm which provides results thatare equivalent to those of the wrapper approach butwhich has computational short-cuts to speed up thefeature selection in Section 3.2. In Section 3.3, wepresent greedy RLS, our novel linear time algorithm,which again provides the same results as the wrap-per approach but which has much lower computa-tional and memory complexity than the previouslyproposed algorithms.

3.1 Wrapper Approach

Algorithm 1: Standard wrapper algorithm forRLS

Input: X ∈ Rn×m, y ∈ Rm, k, λOutput: S, w

1 S ← ∅;2 while |S| < k do3 e←∞;4 b← 0;5 foreach i ∈ {1, . . . , n} \ S do6 R ← S ∪ {i};7 ei ← 0;8 foreach j ∈ {1, . . . ,m} do9 L ← {1, . . . ,m} \ {j};

10 w← t(XRL,yL, λ);

11 p← wTXR,j ;12 ei ← ei + l(yj , p);

13 end14 if ei < e then15 e← ei;16 b← i;

17 end

18 end19 S ← S ∪ {b};20 end21 w← t(XS ,y, λ);

Here, we consider the standard wrapper approachin which RLS is used as a black-box method which isre-trained for every feature set to be evaluated during

4

the selection process and for every round of the LOOcross-validation. In forward selection, the set of allfeature sets up to size k is searched using hill climb-ing in the steepest descent direction (see e.g. [25])starting from an empty set. The method is greedy inthe sense that it adds one feature at a time to the setof selected features but no features are removed fromthe set at any stage. The wrapper approach is pre-sented in Algorithm 1. In the algorithm description,t(·, ·, ·) denotes the black-box training procedure forRLS which takes a data matrix, a label vector, and avalue of the regularization parameter as input and re-turns a vector representation of the learned predictorw. The function l(·, ·) computes the loss for one train-ing example given its true label and a predicted labelobtained from the LOO procedure. Typical lossesare, for example, the squared loss for regression andzero-one error for classification.

Given that the computation of LOO performancerequires m retrainings, that the forward selectiongoes through O(n) features in each iteration, andthat k features are chosen, the overall time com-plexity of the forward selection with LOO criterionis O(min{k3m2n, k2m3n}). Thus, the wrapper ap-proach is feasible with small training sets and in caseswhere the aim is to select only a small number offeatures. However, at the very least quadratic com-plexity with respect to both the size of the train-ing set and the number of selected features make thestandard wrapper approach impractical in large scalelearning settings. The space complexity of the wrap-per approach is O(nm) which is equal to the cost ofstoring the data matrix X.

An immediate reduction for the above consideredcomputational complexities can be achieved via theshort-cut methods for calculating the LOO perfor-mance presented in Section 2. In machine learningand statistics literature, there have been several stud-ies in which the computational short-cuts for LOO, aswell as other types of cross-validation, have been usedto speed up the evaluation of feature subset qualityfor non-regularized least-squares (see e.g [15,16]) andfor RLS (see e.g. [17, 18]). For the greedy forwardselection, the short-cuts reduce the overall compu-tational complexity to O(min{k3mn, k2m2n}), sinceLOO can then be calculated as efficiently as train-

ing the predictor itself. However, both [18] and [17]considered only the dual formulation (8) of the LOOspeed-up which is expensive for large data sets. Thus,we are not aware of any studies in which the primalformulation (7) would be implemented for the greedyforward selection for RLS.

3.2 Previously Proposed Speed-Up

We now present a feature selection algorithm pro-posed by [19] which the authors call low-rank updatedLS-SVM. The features selected by the algorithm areequal to those by standard wrapper algorithm and bythe method proposed by [17] but the method has alower time complexity with respect to k due to cer-tain matrix calculus based computational short-cuts.The low-rank updated LS-SVM algorithm for featureselection is presented in Algorithm 2.

The low-rank updated LS-SVM is based on updat-ing the matrix G and the vector a as new featuresare added into S. In the beginning of the forwardupdate algorithm, the set S of selected features isempty. This means that the matrix G contains noinformation of the data matrix X. Instead, every en-try of the matrix K is equal to zero, and hence Gand a are initialized to λ−1I and λ−1y, respectively,in lines 2 and 3 in Algorithm 2. When a new featurecandidate is evaluated, the LOO performance corre-sponding to the updated feature set can be calculatedusing the temporarily updated G and a, denoted inthe algorithm as G and a, respectively. Here, theimproved speed is based on two short-cuts. Namely,updating the matrix G via the well known Sherman-Morrison-Woodbury (SMW) formula and using thefast LOO computation.

Let us first consider updating the matrix G. For-mally, if the ith feature is to be evaluated, wherei /∈ S, the algorithm calculates the matrix G corre-sponding the feature index set S + {i}. According to

(5), the matrix G is

(K + vvT + λI)−1, (9)

where K is the kernel matrix defined as in (6) withouthaving the ith feature in S and v = (Xi)

T containsthe values of the ith feature for the training data.

5

Algorithm 2: Low-rank updated LS-SVM

Input: X ∈ Rn×m, y ∈ Rm, k, λOutput: S, w

1 S ← ∅;2 a← λ−1y;3 G← λ−1I;4 while |S| < k do5 e←∞;6 b← 0;7 foreach i ∈ {1, . . . , n} \ S do8 v← (Xi)

T;

9 G← G−Gv(1 + vTGv)−1(vTG);

10 a← Gy;11 ei ← 0;12 foreach j ∈ {1, . . . ,m} do13 p← yj − (Gj,j)

−1aj ;14 ei ← ei + l(yj , p);

15 end16 if ei < e then17 e← ei;18 b← i;

19 end

20 end

21 v← (Xb)T;

22 G← G−Gv(1 + vTGv)−1(vTG);23 a← Gy;24 S ← S ∪ {b};25 end26 w← XSa;

The time needed to compute (9) is cubic in m, whichprovides no benefit compared to the standard wrap-per approach. However, provided that the matrix Gis stored in memory, it is possible to speed up thecomputation of G via the SMW formula. Formally,the matrix G can also be obtained from

G−Gv(1 + vTGv)−1(vTG) (10)

which requires a time quadratic in m provided thatthe multiplications are performed in the order indi-cated by the parentheses. Having calculated G, theupdated vector of dual variables a can be obtainedfrom

Gy (11)

also with O(m2) floating point operations.Now let us consider the efficient evaluation of the

features with LOO performance. Provided that Gand a are already stored in memory, the LOO per-formance of RLS trained with the S + {i} trainingexamples can be calculated via (8). After the fea-ture, whose inclusion would be the most favorablewith respect to the LOO performance, is found, it isadded to the set of selected features and RLS solu-tion is updated accordingly in the last four lines ofthe algorithm.

The outermost loop is run k times, k denoting thenumber of features to be selected. For each round ofthe outer loop, the middle loop is run O(n) times,because each of the n − |S| features is evaluated inturn before the best of them is added to the set of se-lected features S. Due to the efficient formula (8) forLOO computation, the calculation of the LOO pre-dictions for the m requires only O(m) floating pointoperations in each run of the innermost loop. Thecomplexity of the middle loop is dominated by theO(m2) time needed for calculating G in line 9 and a inline 10, and hence the overall time complexity of thelow-rank updated LS-SVM algorithm is O(knm2).

The complexity with respect to k is only linearwhich is much better than that of the standard wrap-per approach, and hence the selection of large featuresets is made possible. However, feature selection withlarge training sets is still infeasible because of thequadratic complexity with respect to m. In fact, therunning time of the standard wrapper approach with

6

the LOO short-cut can be shorter than that of thelow-rank updated LS-SVM method in cases wherethe training set is large and the number of featuresto be selected is small.

The space complexity of the low-rank updated LS-SVM algorithm is O(nm+m2), because the matricesX and G require O(nm) and O(m2) space, respec-tively. Due to the quadratic dependence of m, thisspace complexity is worse than that of the standardwrapper approach.

3.3 Linear Time Forward SelectionAlgorithm

Here, we present our novel algorithm for greedy for-ward selection for RLS with LOO criterion. We referto our algorithm as greedy RLS, since in addition tofeature selection point of view, it can also be consid-ered as a greedy algorithm for learning sparse RLSpredictors. Our method resembles the low-rank up-dated LS-SVM in the sense that it also operates byiteratively updating the vector of dual variables, andhence we define it on the grounds of the considera-tions of Section 3.2. Pseudo code of greedy RLS ispresented in Algorithm 3.

The time and space complexities of the low-rankupdated LS-SVM algorithm are quadratic in m, be-cause it involves updating the matrix G (lines 9 and22 in Algorithm 2) and the vector a (lines 10 and 23in Algorithm 2). In order to improve the efficiency ofthe algorithm by getting rid of the quadratic depen-dence of m in both space and time, we must avoidthe explicit calculation and storing of the matricesG and G. Next, we present an improvement whichsolves these problems.

In the middle loop of the low-rank updated LS-SVM algorithm in Algorithm 2, it is assumed thatthe matrix G constructed according to the currentset S of selected features is stored in memory. It istemporarily updated to matrix G corresponding tothe set S + {i}, where i is the index of the featureto be evaluated. We observe that the LOO compu-tations via the formula (8) require the vector of dualvariables a corresponding to S + {i} but only the

diagonal elements of G.

Let us first consider speeding up the computationof a. Since we already have a stored in memory, wecan take advantage of it when computing a. Thatis, since the value of the vector a before the updateis Gy, its updated value a can be, according to (10)and (11), obtained from

a−Gv(1 + vTGv)−1(vTa), (12)

where v = (Xi)T. This does not yet help us much,

because the form (12) still contains the matrix G ex-plicitly. However, if we assume that, in addition toa, the product Gv is calculated in advance and theresult is stored in memory, calculating (12) only re-quires O(m) time. In fact, since the operation (12) isperformed for almost every feature during each roundof the greedy forward selection, we assume that wehave an m× n cache matrix, denoted by

C = GXT, (13)

computed in advance containing the required prod-ucts for all features, each column corresponding toone product.

Now, given that C is available in memory, we cal-culate, in O(m) time, a vector

u = C:,i(1 + vTC:,i)−1 (14)

and store it in memory for later usage. This is donein lines 10 and 24 of the algorithm in Algorithm 3.Consequently, (12) can be rewritten as

a− u(vTa). (15)

and hence the vector a can be updated in O(m) time.We also have to ensure that fast computation of the

LOO performance. The use of (8) for computing theLOO for the updated feature set requires the diagonalelements of the matrix G. Let d and d denote thevectors containing the diagonal elements of G and G,respectively. We observe that the SMW formula (10)for calculating G can be rewritten as

G− u(C:,i)T. (16)

Now, we make an assumption that, instead of havingthe whole G being stored in memory, we have only

7

stored its diagonal elements, that is, the vector d.Then, according to (16), the jth element of the vectord can be computed from

dj − ujCj,i (17)

requiring only a constant time, the overall timeneeded for computing d again becoming O(m).

Thus, provided that we have all the necessarycaches available, evaluating each feature requiresO(m) time, and hence one pass through the wholeset of n features needs O(mn) floating point opera-tions. But we still have to ensure that the caches canbe initialized and updated efficiently enough.

In the initialization phase of the greedy RLS algo-rithm (lines 1-4 in Algorithm 3) the set of selectedfeatures is empty, and hence the values of a, d, andC are initialized to λ−1y, λ−11, and λ−1XT, respec-tively, where 1 ∈ Rm is a vector having every entryequal to 1. The computational complexity of the ini-tialization phase is dominated by the O(mn) timerequired for initializing C. Thus, the initializationphase is no more complex than one pass through thefeatures.

When a new feature is added into the set of se-lected features, the vector a is updated according to(15) and the vector d according to (17). Putting to-gether (13), (14), and (16), the cache matrix C canbe updated via

C − u(vTC),

which requires O(mn) time. This is equally expensiveas the above introduced fast approach for trying eachfeature at a time using LOO as a selection criterion.

Finally, if we are to select altogether k features,the overall time complexity of greedy RLS becomesO(kmn). The space complexity is dominated by thematrices X and C which both require O(mn) space.

4 Experimental results

In Section 4.1, we explore the computational effi-ciency of the introduced method and compare it withthe best previously proposed implementation. In Sec-tion 4.2, we explore the quality of the feature selec-tion process. In addition, we conduct measurements

Algorithm 3: Greedy RLS algorithm proposedby us

Input: X ∈ Rn×m, y ∈ Rm, k, λOutput: S, w

1 a← λ−1y;2 d← λ−11;

3 C ← λ−1XT;4 S ← ∅;5 while |S| < k do6 e←∞;7 b← 0;8 foreach i ∈ {1, . . . , n} \ S do9 v← (Xi)

T;10 u← C:,i(1 + vC:,i)

−1;

11 a← a− u(vTa);12 ei ← 0;13 foreach j ∈ {1, . . . ,m} do14 dj ← dj − ujCj,i;

15 p← yj − (dj)−1aj ;

16 ei ← ei + l(yj , p);

17 end18 if ei < e then19 e← ei;20 b← i;

21 end

22 end

23 v← (Xb)T;

24 u← C:,b(1 + vTC:,b)−1;

25 a← a− u(vTa);26 foreach j ∈ {1, . . . ,m} do27 dj ← dj − ujCj,b;28 end

29 C ← C − u(vTC);30 S ← S ∪ {b};31 end32 w← XSa;

8

of the degree to which the LOO criterion overfits inSection 4.3.

4.1 Computational efficiency

First, we present experimental results about the scal-ability of our method. We use randomly generateddata from two normal distributions with 1000 fea-tures of which 50 are selected. The number of ex-amples is varied. In the first experiment we vary thenumber of training examples between 500 and 5000,and provide a comparison to the low-rank updatedLS-SVM method [19]. In the second experiment thetraining set size is varied between 1000 and 50000.We do not consider the baseline method in the secondexperiment, as it does not scale up to the consideredtraining set sizes. The runtime experiments were runon a modern desktop computer with 2.4 GHz IntelCore 2 Duo E6600 processor, 8 GB of main memory,and 64-bit Ubuntu Linux 9.10 operating system.

We note that the running times of these two meth-ods are not affected by the choice of the regularizationparameter, or the distribution of the features or theclass labels. This is in contrast to iterative optimiza-tion techniques commonly used to train for examplesupport vector machines [26]. Thus we can draw gen-eral conclusions about the scalability of the methodsfrom experiments with a fixed value for the regular-ization parameter, and synthetic data.

In Fig. 1 are the runtime comparisons with lin-ear scaling on y-axis, and in Fig. 2 with logarith-mic scaling. The results are consistent with the al-gorithmic complexity analysis of the methods. Themethod of [19] shows quadratic scaling with respectto the number of training examples, while the pro-posed method scales linearly. In Fig. 3 are the run-ning times for the proposed method for up to 50000training examples, in which case selecting 50 featuresout of 1000 took a bit less than twelve minutes.

4.2 Quality of selected features

In this section, we explore the quality of the featureselection process. We recall that our greedy RLS al-gorithm leads to equivalent results as the algorithms

500 1000 1500 2000 2500 3000 3500 4000 4500 5000training set size

0

5000

10000

15000

20000

25000

CPU

seco

nds

greedy RLSlow-rank updated LS-SVM

Figure 1: Running times in CPU seconds for the pro-posed greedy RLS method and the and the low-rankupdated LS-SVM of [19]. Linear scaling on y-axis.

500 1000 1500 2000 2500 3000 3500 4000 4500 5000training set size

101

102

103

104

105

CPU

seco

nds

greedy RLSlow-rank updated LS-SVM

Figure 2: Running times in CPU seconds for the pro-posed greedy RLS method and the and the low-rankupdated LS-SVM of [19] Logarithmic scaling on y-axis..

9

0 10000 20000 30000 40000 50000training set size

0

100

200

300

400

500

600

700

CPU

seco

nds

greedy RLS

Figure 3: Running times in CPU seconds for the pro-posed greedy RLS method.

proposed by [17] and by [19], while being computa-tionally much more efficient. The aforementioned au-thors have in their work shown that the approachcompares favorably to other commonly used featureselection techniques. However, due to computationalconstraints the earlier evaluations have been run onsmall data sets consisting only of tens or hundredsof examples. We extend the evaluation to large scalelearning with the largest of the considered data setsconsisting of more than hundred thousand trainingexamples. Wrapper selection methods do not typ-ically scale to such data set sizes, especially whenusing cross-validation as a selection criterion. Thuswe consider as a baseline an approach which choosesk features at random. This is a good sanity-check,since training RLS with this approach requires onlyO(min(k2m, km2)) time that is even less than thetime required by greedy RLS.

We consider the binary classification setting, thequality of a selected feature set is measured by theclassification accuracy of a model trained on thesefeatures, evaluated on independent test data. Theexperiments are run on a number of publicly avail-able benchmark data sets, whose characteristics aresummarized in Table 1. The mnist dataset is orig-inally a multi-class digit recognition task, here we

Table 1: Data setsdata set #instances #featuresadult 32561 123australian 683 14colon-cancer 62 2000german.numer 1000 24icjnn1 141691 22mnist5 70000 780

consider the binary classification task of recognizingthe digit 5.

All the presented results are average accuraciesover stratified tenfold-cross validation run on the fulldata sets. On each of the ten cross-validation rounds,before the feature selection experiment is run we se-lect the value of the regularization parameter as fol-lows. We train the learner on the training folds us-ing the full feature set, and perform a grid searchto choose a suitable regularization parameter valuebased on leave-one-out performance.

Using the chosen regularization parameter value,we begin the incremental feature selection process onthe training folds. One at a time we select the fea-ture, whose addition to the model provides highestLOO accuracy estimate on the training folds. Eachtime a feature has been selected, the generalizationperformance of the updated model is evaluated onthe test fold. The process is carried on until all thefeatures have been selected.

In the presented figures we plot the number ofselected features against the accuracy achieved onthe test fold, averaged over the ten cross-validationrounds. In Fig. 4 are the results for the adult, inFig. 5 for the australian, in Fig. 6 for colon-cancer,in Fig. 7 for german.numer, in Fig. 8 for ijcnn1, andin Figure 9 for the mnist5 data set.

On all data sets, the proposed approach clearlyoutperforms the random selection strategy, suggest-ing that the leave-one-out criterion leads to selectinginformative features. In most cases with only a smallsubset of features one can achieve as good perfor-mance as with all the features.

10

20 40 60 80 100 120selected features

0.5

0.6

0.7

0.8

0.9

1.0

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 4: Performance on the adult data set.

2 4 6 8 10 12 14selected features

0.5

0.6

0.7

0.8

0.9

1.0

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 5: Performance on the australian data set.

500 1000 1500 2000selected features

0.5

0.6

0.7

0.8

0.9

1.0

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 6: Performance on the colon-cancer data set.

5 10 15 20selected features

0.5

0.6

0.7

0.8

0.9

1.0

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 7: Performance on the german.numer dataset.

11

5 10 15 20selected features

0.80

0.85

0.90

0.95

1.00

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 8: Performance on the ijcnn1 data set.

100 200 300 400 500 600 700selected features

0.90

0.92

0.94

0.96

0.98

1.00

test

acc

urac

y

greedy RLSrandom selectionmajority voter

Figure 9: Performance on the mnist5 data set.

20 40 60 80 100 120selected features

0.5

0.6

0.7

0.8

0.9

1.0

accu

racy

TestLOO

Figure 10: Test vs. LOO accuracy on the adult dataset.

4.3 Overfitting in feature selection

One concern with using the LOO estimate for select-ing features is that the estimate may overfit to thetraining data. If one considers large enough set offeatures, it is quite likely that some features will byrandom chance lead to improved LOO estimate, whilenot generalizing outside the training set. We exploreto what extent and in which settings this is a prob-lem by comparing the LOO and test accuracies in thepreviously introduced experimental setting. Again,we plot the number of selected features against theaccuracy estimates.

In most cases the LOO and test accuracies are closeto identical (Figures 10, 11, 14, and 15). However, onthe colon-cancer and german.numer data sets (Fig-ures 12 and 13) we see evidence of overfitting, withLOO providing over-optimistic evaluation of perfor-mance. The effect is especially clear on the colon-cancer data set, which has 2000 features, and only 62examples. The results suggest that reliable feature se-lection can be problematic on small high-dimensionaldata sets, the type of which are often considered forexample in bioinformatics. On larger data sets theLOO estimate is quite reliable.

12

2 4 6 8 10 12 14selected features

0.5

0.6

0.7

0.8

0.9

1.0

accu

racy

TestLOO

Figure 11: Test vs. LOO accuracy on the australiandata set.

500 1000 1500 2000selected features

0.5

0.6

0.7

0.8

0.9

1.0

accu

racy

TestLOO

Figure 12: Test vs. LOO accuracy on the colon-cancer data set.

5 10 15 20selected features

0.5

0.6

0.7

0.8

0.9

1.0

accu

racy

TestLOO

Figure 13: Test vs. LOO accuracy on the ger-man.numer data set.

5 10 15 20selected features

0.80

0.85

0.90

0.95

1.00

accu

racy

TestLOO

Figure 14: Test vs. LOO accuracy on the ijcnn1 dataset.

13

100 200 300 400 500 600 700selected features

0.90

0.92

0.94

0.96

0.98

1.00

accu

racy

TestLOO

Figure 15: Test vs. LOO accuracy on the mnist5data set.

5 Discussion and Future Direc-tions

In this paper, we present greedy RLS, a linear time al-gorithm for feature selection for RLS. The algorithmuses LOO as a criterion for evaluating the goodnessof the feature subsets and greedy forward selection asa search strategy in the set of possible feature sets.The proposed algorithm opens several directions forfuture research. In this section, we will briefly dis-cuss some of the directions which we consider to bethe most promising ones.

Greedy RLS can quite straightforwardly be gener-alized to use different types of cross-validation cri-teria, such as n-fold or repeated n-fold. These aremotivated by their smaller variance compared to theleave-one-out and they have also been shown to havebetter asymptotic convergence properties for featuresubset selection for ordinary least-squares [15]. Thisgeneralization can be achieved by using the short-cutmethods developed by us [27] and by [28].

In this paper, we focus on the greedy forwardselection approach but an analogous method canalso be constructed for backward elimination. How-ever, backward elimination would be computation-ally much more expensive than forward selection, be-

cause an RLS predictor has to be trained first withthe whole feature set. In the feature selection litera-ture, approaches which combine both forward andbackward steps, such as the floating search meth-ods (see e.g. [29, 30]), have also been proposed. Re-cently, [31] considered a modification of the forwardselection for least-squares, which performs correctivesteps instead of greedily adding a new feature in eachiteration. The considered learning algorithm was or-dinary least-squares and the feature subset selectioncriterion was empirical risk. The modified approachwas shown to have approximately the same computa-tional complexity (in number of iterations) but betterperformance than greedy forward selection or back-ward elimination. A rigorous theoretical justificationwas also presented for the modification. While thispaper concentrates on the algorithmic implementa-tion of the feature subset selection for RLS, we aimto investigate the performance of this type of modi-fications in our future work.

The computational short-cuts presented in this pa-per are possible due to the closed form solution theRLS learning algorithms has. Such short-cuts are alsoavailable for variations of RLS such as RankRLS, anRLS based algorithm for learning to rank proposedby us in [32, 33]. In our future work, we also plan todesign and implement similar feature selection algo-rithms for RankRLS.

Analogously to the feature selection methods,many approaches has been developed also for so-called reduced set selection used in context of kernel-based learning algorithms (see e.g [10, 34] and refer-ences therein). For example, incremental approachesfor selecting this set for regularized least-squares hasbeen developed by [35]. Closely related to the re-duced set methods, we can also mention the methodstraditionally used for selecting centers for radial basisfunction networks. Namely, the algorithms based onthe orthogonal least-squares method proposed by [36]for which efficient forward selection methods are alsoknown. In the future, we plan to investigate how wellapproaches similar to our feature selection algorithmcould perform on the tasks of reduced set or centerselection.

14

6 Conclusion

We propose greedy regularized least-squares, a noveltraining algorithm for sparse linear predictors. Thepredictors learned by the algorithm are equivalentwith those obtained by performing a greedy forwardfeature selection with leave-one-out (LOO) criterionfor regularized least-squares (RLS), also known as theleast-squares support vector machine or ridge regres-sion. That is, the algorithm works like a wrappertype of feature selection method which starts fromthe empty feature set, and on each iteration addsthe feature whose addition provides the best LOOperformance. Training a predictor with greedy RLSrequires O(kmn) time, where k is the number of non-zero entries in the predictor, m is the number oftraining examples, and n is the original number offeatures in the data. This is in contrast to the com-putational complexity O(min{k3m2n, k2m3n}) of us-ing the standard wrapper method with LOO selectioncriterion in case RLS is used as a black-box method,and the complexity O(km2n) of the method proposedby [19], which is the most efficient of the previouslyproposed speed-ups. Hereby, greedy RLS is compu-tationally more efficient than the previously knownfeature selection methods for RLS.

We demonstrate experimentally the computationalefficiency of greedy RLS compared to the best previ-ously proposed implementation. In addition, we ex-plore the quality of the feature selection process andmeasure the degree to which the LOO criterion over-fits with different sizes of data sets.

We have made freely available a software packagecalled RLScore out of our previously proposed RLSbased machine learning algorithms2. An implemen-tation of the greedy RLS algorithm will also be madeavailable as a part of this software.

References

[1] I. Guyon and A. Elisseeff, “An introduction tovariable and feature selection,” Journal of Ma-chine Learning Research, vol. 3, pp. 1157–1182,2003.

2Available at http://www.tucs.fi/RLScore

[2] R. Kohavi and G. H. John, “Wrappers forfeature subset selection,” Artificial Intelligence,vol. 97, no. 1-2, pp. 273–324, 1997.

[3] S. S. Keerthi and S. K. Shevade, “A fast track-ing algorithm for generalized lars/lasso,” IEEETransactions on Neural Networks, vol. 18, no. 6,pp. 1826–1830, 2007.

[4] G. H. John, R. Kohavi, and K. Pfleger, “Ir-relevant features and the subset selection prob-lem,” in Proceedings of the Eleventh Interna-tional Conference on Machine Learning, W. W.Cohen and H. Hirsch, Eds. San Fransisco, CA:Morgan Kaufmann Publishers, 1994, pp. 121–129.

[5] A. E. Hoerl and R. W. Kennard, “Ridge regres-sion: Biased estimation for nonorthogonal prob-lems,” Technometrics, vol. 12, pp. 55–67, 1970.

[6] T. Poggio and F. Girosi, “Networks for approxi-mation and learning,” Proceedings of the IEEE,vol. 78, no. 9, 1990.

[7] C. Saunders, A. Gammerman, and V. Vovk,“Ridge regression learning algorithm in dualvariables,” in Proceedings of the Fifteenth Inter-national Conference on Machine Learning. SanFrancisco, CA, USA: Morgan Kaufmann Pub-lishers Inc., 1998, pp. 515–521.

[8] J. A. K. Suykens and J. Vandewalle, “Leastsquares support vector machine classifiers,”Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.

[9] J. Suykens, T. Van Gestel, J. De Brabanter,B. De Moor, and J. Vandewalle, Least SquaresSupport Vector Machines. World Scientific Pub.Co., Singapore, 2002.

[10] R. Rifkin, G. Yeo, and T. Poggio, “Regular-ized least-squares classification,” in Advances inLearning Theory: Methods, Model and Applica-tions, ser. NATO Science Series III: Computerand System Sciences, J. Suykens, G. Horvath,S. Basu, C. Micchelli, and J. Vandewalle, Eds.

15

Amsterdam: IOS Press, 2003, vol. 190, ch. 7, pp.131–154.

[11] T. Poggio and S. Smale, “The mathematics oflearning: Dealing with data,” Notices of theAmerican Mathematical Society (AMS), vol. 50,no. 5, pp. 537–544, 2003.

[12] V. Vapnik, Estimation of Dependences Basedon Empirical Data [in Russian]. Moscow, So-viet Union: Nauka, 1979, (English translation:Springer, New York, 1982).

[13] G. Wahba, Spline Models for ObservationalData. Philadelphia, USA: Series in AppliedMathematics, Vol. 59, SIAM, 1990.

[14] P. Green and B. Silverman, NonparametricRegression and Generalized Linear Models, ARoughness Penalty Approach. London, UK:Chapman and Hall, 1994.

[15] J. Shao, “Linear model selection by cross-validation,” Journal of the American StatisticalAssociation, vol. 88, no. 422, pp. 486–494, 1993.

[16] P. Zhang, “Model selection via multifold crossvalidation,” The Annals of Statistics, vol. 21,no. 1, pp. 299–313, 1993.

[17] E. K. Tang, P. N. Suganthan, and X. Yao, “Geneselection algorithms for microarray data basedon least squares support vector machine,” BMCBioinformatics, vol. 7, p. 95, 2006.

[18] Z. Ying and K. C. Keong, “Fast leave-one-out evaluation for dynamic gene selection,” SoftComputing, vol. 10, no. 4, pp. 346–350, 2006.

[19] F. Ojeda, J. A. Suykens, and B. D. Moor, “Lowrank updated LS-SVM classifiers for fast vari-able selection,” Neural Networks, vol. 21, no. 2-3, pp. 437 – 449, 2008, advances in Neural Net-works Research: IJCNN ’07.

[20] S. R. Searle, Matrix Algebra Useful for Statistics.New York, NY: John Wiley and Sons, Inc., 1982.

[21] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda,and B. Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Transactionson Neural Networks, vol. 12, pp. 181–201, 2001.

[22] B. Scholkopf and A. J. Smola, Learning with ker-nels. MIT Press, Cambridge, MA, 2002.

[23] J. Shawe-Taylor and N. Cristianini, KernelMethods for Pattern Analysis. Cambridge:Cambridge University Press, 2004.

[24] R. Rifkin and R. Lippert, “Notes on regular-ized least squares,” Massachusetts Institute ofTechnology, Tech. Rep. MIT-CSAIL-TR-2007-025, 2007.

[25] S. Russell and P. Norvig, Artificial Intelligence:A Modern Approach. Prentice Hall, 1995.

[26] L. Bottou and C.-J. Lin, “Support vector ma-chine solvers,” in Large-Scale Kernel Machines,ser. Neural Information Processing, L. Bottou,O. Chapelle, D. DeCoste, and J. Weston, Eds.Cambridge, MA, USA: MIT Press, 2007, pp. 1–28.

[27] T. Pahikkala, J. Boberg, and T. Salakoski, “Fastn-fold cross-validation for regularized least-squares,” in Proceedings of the Ninth Scandina-vian Conference on Artificial Intelligence (SCAI2006), T. Honkela, T. Raiko, J. Kortela, andH. Valpola, Eds. Espoo, Finland: Otamedia,2006, pp. 83–90.

[28] S. An, W. Liu, and S. Venkatesh, “Fast cross-validation algorithms for least squares supportvector machine and kernel ridge regression,”Pattern Recognition, vol. 40, no. 8, pp. 2154–2162, 2007.

[29] P. Pudil, J. Novovicova, and J. Kittler, “Float-ing search methods in feature selection,” Pat-tern Recogn. Lett., vol. 15, no. 11, pp. 1119–1125,1994.

[30] J. Li, M. T. Manry, P. L. Narasimha, and C. Yu,“Feature selection using a piecewise linear net-work,” IEEE Transactions on Neural Networks,vol. 17, no. 5, pp. 1101–1115, 2006.

16

[31] T. Zhang, “Adaptive forward-backward greedyalgorithm for sparse learning with linear mod-els,” in Advances in Neural Information Pro-cessing Systems 21, D. Koller, D. Schuurmans,Y. Bengio, and L. Bottou, Eds., 2009, pp. 1921–1928.

[32] T. Pahikkala, E. Tsivtsivadze, A. Airola,J. Boberg, and T. Salakoski, “Learning to rankwith pairwise regularized least-squares,” in SI-GIR 2007 Workshop on Learning to Rank forInformation Retrieval, T. Joachims, H. Li, T.-Y. Liu, and C. Zhai, Eds., 2007, pp. 27–33.

[33] T. Pahikkala, E. Tsivtsivadze, A. Airola,J. Boberg, and J. Jarvinen, “An efficient al-gorithm for learning to rank from preferencegraphs,” Machine Learning, vol. 75, no. 1, pp.129–165, 2009.

[34] A. J. Smola and B. Scholkopf, “Sparse greedymatrix approximation for machine learning,”in Proceedings of the Seventeenth InternationalConference on Machine Learning (ICML 2000),P. Langley, Ed. San Francisco, CA: MorganKaufmann Publishers Inc., 2000, pp. 911–918.

[35] L. Jiao, L. Bo, and L. Wang, “Fast sparse ap-proximation for least squares support vector ma-chine,” IEEE Transactions on Neural Networks,vol. 18, no. 3, pp. 685–697, 2007.

[36] S. Chen, C. F. N. Cowan, and P. M. Grant,“Orthogonal least squares learning algorithm forradial basis function networks,” IEEE Transac-tions on Neural Networks, vol. 2, no. 2, 1991.

17


Recommended