+ All documents
Home > Documents > Screening oil spills by mid-IR spectroscopy and supervised pattern recognition techniques

Screening oil spills by mid-IR spectroscopy and supervised pattern recognition techniques

Date post: 11-Nov-2023
Category:
Upload: unimib
View: 0 times
Download: 0 times
Share this document with a friend
11
Screening oil spills by mid-IR spectroscopy and supervised pattern recognition techniques M.P. Gómez-Carracedo a , R. Fernández-Varela a , D. Ballabio b , J.M. Andrade a, a Department of Analytical Chemistry, University of A Coruña, Campus da Zapateira s/n, E-15008, A Coruña, Spain b Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, P.za della Scienza, 1-20126 Milano, Italy abstract article info Article history: Received 19 December 2011 Received in revised form 15 March 2012 Accepted 24 March 2012 Available online 29 March 2012 Keywords: Oil spill ngerprint IR spectroscopy Partial Least Squares Discriminant Analysis Kernel-PLS Counterpropagation Articial Neural Networks Support Vector Machines Supervised pattern recognition methods had scarcely been applied to assess the origin of hydrocarbons lumps arrived at the coastline. In this work eight supervised multivariate methods based on quite different principles (Discriminant Analysis, Principal Components Analysis combined to Discriminant Analysis, Soft Independent Modelling of Class Analogy, K-Nearest Neighbours, Partial Least Squares Discriminant Analysis (PLS-DA), kernel-PLS (radial basis functions-PLS), Counterpropagation Articial Neural Networks (CPANN) and Support Vector Machines with linear, radial basis function and polynomial kernels) and a consensusapproach were used to discriminate between the aliquots of six oil spillages monitored on time by mid-IR spectroscopy. Further, a set of 45 unknowns collected in Galician beaches after a major shipwreck were analyzed by both the IR-chemometric-based method and an international oil ngerprinting standard protocol (the European Guideline CEN/TR 15522-2 guide) to set their trueassignations. Classication of the controlled spillages yielded almost 100% successful classication ratios (precision, sensitivity and specicity) whereas less than 5% false positives and false negatives were obtained when the 45 samples were classied. SVM with polynomial kernels had only 1 misclassication and outperformed the other approaches, including the consensusapproach. CPANN, radial basis functions-PLS and the consensus approach were the second best models with 93.3% agreement with the standard protocol. On the other hand, linear PLS-DA yielded the worst classication model. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Maritime navigation routes for tankers are of huge environmental concern because besides occasional ship accidents repeated oil spillages exist in their surroundings due to vessels ballast release, cleanup of the cargo deposits and so forth. A number of physical, chemical and biochemical processes occur after oil reaches the sea. They are collectively known as oil weathering. They change considerably the physical properties and the chemical composition of the original oil although the particular details depend on the nature of the oil and the climatological conditions during and after the spill (namely; temperature, pH, wave movements, salinity, sun-incidence, etc.) [1]. Thus, heavy crude oils and heaviest distillates (mainly fuel oils) weather at a much slower pace than lighter crude oils or lightmedium distillates. Detailed characterization and understanding of oil weathering at a molecular level constitutes the basis of common tiered approaches for forensic oil spill identication [2]. Most of them rely on chromatographic analyses, a paradigmatic example being the EU guide [3]. Nevertheless, those methods are costly, time-consuming, require highly skillful staff and, what is worst, can lead to no conclusivematches between the oil spill and one or several suspicious sources. Hence, in this work a cheap and fast analytical methodology for screening the origin of oil is proposed, based on mid-IR spectral measurements combined to a suite of supervised chemometric techniques. It is expected that such a hybridizationyields fast analytical methods prone to be deployed on-site by current staff and allowing fast decision-making. Most approaches to this problem considered only unsupervised pattern recognition methods [47]. Therefore strictly speaking no classications could be made, but projections of the samples in the models. This may be a problem in liability circumstances because no probability values can be given to an assignation. The use of supervised methods in this eld has been limited, likely because of the difculties to obtain a collection of training samples. Some interesting reports are resumed next. Density functions of IR data [4] and Soft Independent Modelling of Class Analogy (SIMCA) of chromatographic data [8] were used to measure the distance between different classes of samples and to evaluate how close they were from the source oils. The use of SIMCA and PLS-DA (Partial Least Squares Discriminant Analysis) was also compared [9]. Correlation analysis [10] and Genetic Algorithms (GA) [11] were employed to identify the source of an oil spill. Parallel Factor Analysis (PARAFAC2) was used [12] to classify oils using GCMS datasets. A novel multivariate method based on Principal Components Chemometrics and Intelligent Laboratory Systems 114 (2012) 132142 Corresponding author. Fax: + 34 981167065. E-mail address: [email protected] (J.M. Andrade). 0169-7439/$ see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2012.03.013 Contents lists available at SciVerse ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Transcript

Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r .com/ locate /chemolab

Screening oil spills by mid-IR spectroscopy and supervised patternrecognition techniques

M.P. Gómez-Carracedo a, R. Fernández-Varela a, D. Ballabio b, J.M. Andrade a,⁎a Department of Analytical Chemistry, University of A Coruña, Campus da Zapateira s/n, E-15008, A Coruña, Spainb Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, P.za della Scienza, 1-20126 Milano, Italy

⁎ Corresponding author. Fax: +34 981167065.E-mail address: [email protected] (J.M. Andrade).

0169-7439/$ – see front matter © 2012 Elsevier B.V. Alldoi:10.1016/j.chemolab.2012.03.013

a b s t r a c t

a r t i c l e i n f o

Article history:Received 19 December 2011Received in revised form 15 March 2012Accepted 24 March 2012Available online 29 March 2012

Keywords:Oil spill fingerprintIR spectroscopyPartial Least Squares Discriminant AnalysisKernel-PLSCounterpropagation Artificial NeuralNetworksSupport Vector Machines

Supervised pattern recognition methods had scarcely been applied to assess the origin of hydrocarbonslumps arrived at the coastline. In this work eight supervised multivariate methods based on quite differentprinciples (Discriminant Analysis, Principal Components Analysis combined to Discriminant Analysis, SoftIndependent Modelling of Class Analogy, K-Nearest Neighbours, Partial Least Squares Discriminant Analysis(PLS-DA), kernel-PLS (radial basis functions-PLS), Counterpropagation Artificial Neural Networks (CPANN)and Support Vector Machines with linear, radial basis function and polynomial kernels) and a ‘consensus’approach were used to discriminate between the aliquots of six oil spillages monitored on time by mid-IRspectroscopy. Further, a set of 45 unknowns collected in Galician beaches after a major shipwreck wereanalyzed by both the IR-chemometric-based method and an international oil fingerprinting standardprotocol (the European Guideline CEN/TR 15522-2 guide) to set their ‘true’ assignations. Classification of thecontrolled spillages yielded almost 100% successful classification ratios (precision, sensitivity and specificity)whereas less than 5% false positives and false negatives were obtained when the 45 samples were classified.SVM with polynomial kernels had only 1 misclassification and outperformed the other approaches, includingthe ‘consensus’ approach. CPANN, radial basis functions-PLS and the consensus approach were the secondbest models with 93.3% agreement with the standard protocol. On the other hand, linear PLS-DA yielded theworst classification model.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Maritime navigation routes for tankers are of huge environmentalconcern because besides occasional ship accidents repeated oil spillagesexist in their surroundings due to vessels ballast release, cleanup ofthe cargo deposits and so forth. A number of physical, chemical andbiochemical processes occur after oil reaches the sea. They are collectivelyknown as ‘oil weathering’. They change considerably the physicalproperties and the chemical composition of the original oil although theparticular details depend on the nature of the oil and the climatologicalconditions during and after the spill (namely; temperature, pH, wavemovements, salinity, sun-incidence, etc.) [1]. Thus, heavy crude oils andheaviest distillates (mainly fuel oils) weather at a much slower pace thanlighter crude oils or light–medium distillates.

Detailed characterization and understanding of oil weathering at amolecular level constitutes the basis of common tiered approaches forforensic oil spill identification [2]. Most of them rely on chromatographicanalyses, a paradigmatic example being the EU guide [3]. Nevertheless,those methods are costly, time-consuming, require highly skillful staff

rights reserved.

and, what is worst, can lead to ‘no conclusive’ matches between the oilspill and one or several suspicious sources. Hence, in this work a cheapand fast analytical methodology for screening the origin of oil isproposed, based on mid-IR spectral measurements combined to a suiteof supervised chemometric techniques. It is expected that such a‘hybridization’ yields fast analytical methods prone to be deployedon-site by current staff and allowing fast decision-making.

Most approaches to this problem considered only unsupervisedpattern recognition methods [4–7]. Therefore strictly speaking noclassifications could be made, but projections of the samples in themodels. This may be a problem in liability circumstances because noprobability values can be given to an assignation. The use of supervisedmethods in this field has been limited, likely because of the difficultiesto obtain a collection of training samples. Some interesting reports areresumed next. Density functions of IR data [4] and Soft IndependentModelling of Class Analogy (SIMCA) of chromatographic data [8] wereused to measure the distance between different classes of samples andto evaluate how close they were from the source oils. The use of SIMCAand PLS-DA (Partial Least Squares Discriminant Analysis) was alsocompared [9]. Correlation analysis [10] and Genetic Algorithms (GA)[11] were employed to identify the source of an oil spill. Parallel FactorAnalysis (PARAFAC2) was used [12] to classify oils using GC–MSdatasets. A novel multivariate method based on Principal Components

133M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

Analysis of pre-processed sections of chromatograms was proposed[13] to characterize the complex PAH pollution patterns in sedimentsfrom Guanabara Bay, Brazil. Finally, different unsupervised (ParallelFactor Analysis, PARAFAC) and supervised methods (Artificial NeuralNetworks, PARAFAC coupled to Linear Discriminant Analysis, Discrim-inant Unfolded Partial Least Squares, Discriminant MultidimensionalPartial Least-squares and Discriminant Multidimensional Partial Least-Squares with residual bilinearization) were considered [14] to classifyoil spill sources.

The aim of this paper was twofold. First, to assess whether thecombination of a fast analytical tool (mid-IR spectrometry) and asupervised classificationmethodwould allow a satisfactory identificationof the origin of an oil spill. Second, to take into account the very complexevolution of the oil samples and the intrinsic difficulties in modelingthem, parametric and non-parametric classification methods wereconsidered to ascertain whether some of them would result in bestclassification models. Hence, eight supervised multivariate methodsbased on quite different principles were studied. Four were parametrictechniques: Discriminant Analysis, DA, –Linear and Quadratic Discrimi-nant Analysis (LDA andQDA)–, Principal Components Analysis combinedto Discriminant Analysis, PCA-DA, SIMCA and Partial Least SquaresDiscriminant Analysis (PLS-DA). Three were non-parametric models:K-Nearest Neighbours, KNN, kernel-PLS (in particular, radial basisfunctions-PLS), and Support Vector Machines, SVM. Finally, atechnique was based on natural computation: CounterpropagationArtificial Neural Networks, CPANN. In total, 102 weathered sampleswere considered to develop the models and a set of 45 unknownscollected in Galician beaches after a major shipwreck were classified.

2. Experimental

2.1. IR measurement and IR band ratios

A 16PC Perkin-Elmer mid-IR spectrometer (beamsplitter Ge-KBr,DTGS detector, 4 cm−1 nominal resolution, Beer–Norton apodization)with a horizontal, fixed path, ATR device (ZnSe, trapezoidal, 45°, 12reflections) was used throughout (50 scans, 4000–600 cm−1 measur-ing range, a function was applied to correct for wavelength penetrationand spectra were baseline corrected). Weekly and monthly qualityassurance tests were carried out to verify the S/N ratio, wavenumberaccuracy, laser characteristics and transmittance accuracy. Specialattention was paid on cleaning the ZnSe ATR plate [5].

In order to avoid unstructured and noisy information, Pieri et al. [15]and Permanyer et al. [16] defined ten indexes ratioing different mid-IRpeak areas against the total area under the spectral peaks or sums ofseveral areas measured from valley to valley. The wavenumbers thatform the quotients were intended to visualize how samples evolved ontime and interpret them chemically in a straightforward manner. Theywere aromaticity (as a measure of the total amount of C_C bonds,A1600/Atot), aliphaticity (A1450+1376/Atot), branched chains (A1376/A1450+1376), long chains (A724/A1450+1376), aromatic condensation(A864+814+743/Atot), carbonyl (A1700/Atot), sulphoxide (A1030/Atot),substitution 1 (as a measure of the number of isolated CH groups inmultisubstituted rings, A864/A864+814+743), substitution 2 (as ameasure of the number of 2 or 3 adjacent CH groups in substitutedrings, a good indication of three-substituted aromatic rings, A814/A864+814+743) and substitution 3 (as a measure of the number of 3or 4 adjacent CH groups in substituted rings, a good indication of di-and mono- substituted rings, A743/A864+814+743) indexes.

2.2. Samples

Two main types of samples were considered throughout:(i) controlled spillages of six oils monitored on time, and (ii) samplesobtained from oil slicks beached at the shoreline, they will be referredto as ‘beaches’.

Four crude oils (Maya, Ashtart, Brent and Sahara Blend), a ‘MarineFuel Oil’ (briefly IFO), which was quite similar to the Prestige's oil, andthe original fuel oil from the Prestige tanker (sunk off the Galician coast(NW Spain), November 2002) were studied. The Maya oil was heavy,Ashtart was intermediate, and Brent and Sahara Blend were very lightcrude oils. The Prestige and IFO were heavy residues of crude oildistillation processes. Controlled spillages were made by pouringaround 500 mL of oil on metallic containers filled with sea water andweathered under atmospheric conditions. In total, 17 aliquots of the oillumps were taken at different time intervals [17,18].

In addition, 45 samples were taken at several beaches locatedalong the province of A Coruña on different sampling seasons(2002–2005) to ascertain if they came from the Prestige wreck (seeFig. 1 for more details). Samples from beaches were collectedsystematically. On 2002 and 2003 composite samples were takendirectly from oil lumps just arrived at the beaches (i.e., severalaliquots were taken at different locations of the oil lump and mixed toyield the final sample). Note that in this case study many oil lumpsthat were drifting at the sea beached at the coastline several monthsafter the shipwreck. Small oil lumps (less than ca. 1 m2) weresampled at their thickest part or, if they were small enough (tar balls,ca. 5–15 cm ‘diameter’) collected totally. During 2004 and 2005samples were obtained from areas where the fuel accumulated,typically ponds and protected areas in rocky sites at the beaches(most of them located at the intertidal area). In those cases, as muchfuel as possible was sampled either from the surface or inner regionsof the ponds to get a composite sample. Whenever possible the ‘same’location was explored in different years (e.g. Langosteira 1, 2 and 3).None of the sampled areas was cleaned chemically nor mechanically,which limited the amount of sites available. Debris and sand wereavoided as far as possible. Samples were homogenized mechanicallyand kept at 4 °C until analysis. Aliquots of 10–15 mL were transferredto Pyrex centrifuge tubes where ca. 10 mL of dichloromethane and 1 gof anhydrous sodium sulfate were added. The mixture was placed ona thermostatic bath at 50–60 °C (±2 °C) until the two phasesseparated. The organic phase was transferred to another tube, sodiumsulfate added again, and the dichlorometane evaporated gently in athermostatic bath.

2.3. Supervised pattern recognition techniques

Discriminant Analysis (DA) is a standard classification methodbased on determining multivariate linear discriminant functions,which maximize the ratio of between-class variances and minimizethe ratio of within-class variances [19]. The factors obtained in DA aretermed discriminant functions or canonical variates. Linear Discrim-inant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) areused depending on the linear or non-linear class separation and onthe reliability of the class covariance matrices. QDA requires thenumber of variables not to exceed the number of objects of each class.If this does not occur, an alternative consists on applying DA on thescores obtained from Principal Components Analysis (PCA).

The discrimination power of variables was also tested by means ofthe Wilk's lambda [20], which is defined as:

Λ ¼ Wj jW þ Bj j

where W is the within sum of squares and cross-products matrix andaccounts for the average within class variability; B is the betweensum of squares and cross-products matrix and accounts for theaverage between class variability. The Wilk's lambda is related to thelikelihood ratio criterion and ranges between 0 and 1, where valuesclose to 0 indicate that the class means are different. Variables withthe lowest Wilk's lambda values can be retained in the classificationmodel as optimal variables for separating the considered classes. In

Fig. 1. Location of the samples taken on beaches along the province of A Coruña (NW Spain).

134 M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

order to decide which variables will be used to calculate the LDAmodel, two approaches are used currently. In the forward variableselection technique the method starts with no variables and adds onevariable at a time to the model; the inclusion of a variable is based onthe error rate calculated by cross-validation leave-one-out (ERLOO),i.e. a variable will be entered into the model if ERLOO is minimized[21]. The backward approach starts with all variables into the modeland tests for statistical significance of each variable, deleting thatwhich is not significant at the critical level. Here, both approachesyielded the same results.

SIMCA is probably the most used method among those of class-modeling [9]. It is a supervised classification method based on disjointPCA models obtained for each class in the training set [22]. Unknownsamples are then compared to the class models and assigned to classesaccording to their analogywith the training samples. A new samplewillbe recognized as a member of a class if it is similar enough to the othermembers (according to a sort of ‘confidence boundary’), else it will berejected.

Partial least squares regression (PLS) is a supervised multivariatelatent-variable method that relates one ‘dependent’ variable, orpredictand (y), to a set of ‘independent’ variables, or predictors, X.Partial Least Squares Discriminant Analysis (PLS-DA) is the applica-tion of PLS to classification problems where y is a vector that codifies

the class of each sample. The class of an unknown sample is assignedbased on the y value predicted by the PLS-DA model [23]. Ideally, thepredicted y should be close to the values that codify the class (hereeither 0 or 1). In practice, it is a real number and different approaches[24–31] were used to translate the predicted y into a class label.

Further, a non-linear and non-parametric kernel method based onPLS and radial basis functions (RBF) proposed by Walczak and Massart[32–34] was applied to copewith eventual non-linearities not simple toaddress with the linear PLS-DAmodel. RBF-PLS was also presented as agood alternative to Support Vector Machines (SVM) [35], which ispresented below. The conceptual idea of RBF-PLS can be resumed as thereplacement of the traditional X matrix of predictor variables by asquare matrix (with dimensions equal to the number of samples in thetraining set), called activation matrix, which contains the radial basisfunctions. Each RBF is centered initially at the location of each sample ofthe training set and itswidthmust be optimized alongwith the numberof factors. Optimization of the models rely on the modeling andgeneralization capabilities of the RBF (a form of neural net) and, thus,the model is not parametric although its initialization is not stochasticand, consequently, randomness is avoided and a unique solution isobtained. Further technical details can be found in literature [32,33].

The K-Nearest Neighbour classification method (KNN) is a non-parametric pattern recognition method that classifies a sample in the

135M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

category to which the majority of the ‘K’ nearest neighbouringsamples belong to. It follows that KNN categorizes an unknownsample according to its proximity to (known) samples. A definitionfor ‘closeness’ is, therefore, required. Different options exist (see, e.g.[36]). The cityblock distance was selected here after some prelimi-nary assays. Note that KNN does not explicitly form a separate modelfrom the training data set [37]. K use to be an odd number to ensurethat a majority vote is obtained locally. The determination of theoptimal value for K is the most important part of this process. Weselected it after an optimization step where cross validation pro-cedures were used to look for the K value that minimized theclassification error of both the training and validation sets.

SVM (Support Vector Machines) constitute supervised powerfulpattern recognition tools to model two-class problems with the aimof classifying future unknowns. The key idea of SVM is to representthe original (usually inseparable) classes of samples in a higherdimensional space (the so-called ‘feature space’). The term ‘higher’refers to the original dimensionality of the problem, i.e. the variablesthat will be used to classify the samples. They can be either theanalytical measured variables or a simplified view of them, like theprincipal components scores. The key idea is to extend the spacewherethe samples are represented by one or several extra dimensions so that,hopefully, the extended spacewill be useful to separate the classes (thisis termed non-linear mapping). This can be done mathematically byusing kernel functions, among which the most common ones are thelinear and the radial basis (RBF) functions [38–40] although, sometimes,polynomial kernels are also used [38].

The appealing characteristic of SVM is that the a priori complex stepof non-linear mapping of the variables to a feature space can becalculated in the original space by the kernel functions after some keyparameters are optimized, remarkably a penalty parameter, whichrequires not only to develop a model but also to validate it with a set ofsamples not used at all to get themodel. Overfittingmust be avoided as itwas demonstrated that SVM are prone to easily overfit the data [38–40].

Nowadays, efforts are being made to apply SVM to complexproblems with more than two classes (for which they were originallydeveloped) although the two classical approaches ‘one-vs-all’ and‘one-vs-one’ are commonly applied [38]. The first was employed here,and it consists on the development of as many two-class models asdifferent classes there are. In each case a class is opposed to all theother ones. Finally, the pertainance of unknowns to the class(es) isdefined most commonly by votation.

Finally, Counterpropagation Artificial Neural Networks (CPANN) arenatural computation algorithmsbased on the Kohonen neural networksapproach inwhich similar input objects are linked to topologically closeneurons in the network. This means that neurons that are located closeto each other ‘react’ similarly to similar inputs. An additional outputlayer is needed to perform supervised classification. Thus, a new layer ofneurons is added to theKohonenmap. Those neuronswill have asmanyweights as the number of classes of samples, and the neuron of theoutput layer to be corrected is chosen on the basis of the neuron in theKohonen layer that is more similar to the input vector. In this work aneuron of the additional layer will be assigned to the class for which ahighest weight is obtained. More technical details can be found inprevious studies [41,42].

2.4. Software

Matlab© 6.5 R13 (The Mathworks Inc., Natick, MA, USA) built-insubroutines and in-house programs was used to perform KNN.CPANN were from the Milano Chemometrics and QSAR ResearchGroup [42]. PLS-Toolbox 3.5 was employed to perform SIMCA andPLS-DA. PASW Statistics, v.18 (IBM SPSS, USA) was used to performDiscriminant Analysis. GenEx Pro (MultiD Analyses AB, Sweden,version Datan 5.0.4.4) was used to perform Support Vector Machines.TOMCAT [34] was used to perform RBF-PLS.

3. Results and discussions

The samples from the controlled spillages were divided intotraining and validation sets by random splitting. Training samples (79samples, 13 samples per product plus 14 for the Prestige oil) wereemployed to calculate and cross-validate the supervised models. Theremaining 25% samples were used as external validation set so thatthey never participated on model development. The class proportionswere maintained, that is, the number of validation samples of eachclass was proportional to the number of training samples of that class.Table 1 presents some univariate statistics associated to each class(oil). Data were autoscaled before all the statistical studies, otherwisestated.

The models were evaluated and compared using several indices,calculated on the training and validation sets of samples [36]: ErrorRate (ER), as a measure of the number (percentage) of incorrectlyclassified samples; Non-Error Rate (NER), as a measure of the number(percentage) of correctly classified objects; sensitivity (Seng), whichdescribes the model ability to correctly recognize objects belongingto the g-th class; specificity (Speg) characterizes the capability of theg-th class to reject objects of all other classes; and, finally, classprecision (Precg) which measures the ability of a classification modelto not include objects of other classes in the considered class.

3.1. Discriminant Analysis

Preliminary assays were made to select whether the model shouldbe linear (LDA) or quadratic (QDA). Best models were obtained usingLDA and, so, they will be presented here. LDA functions were calculatedstepwise using the forward and backward approaches. They showed anice differentiation between the groups in the space of the first twodiscriminant functions (figure not shown here). The parametersdescribing the total number of errors and the goodness of the modelare shown in Tables 2 and 3, respectively.

The error rate was 1.27% in training (only sample Maya 15 wasmisclassified) and 3.80% in Coss-Validation Leave-One-Out (CV-LOO)(Table 2). Worst classification ratios were obtained for the precisionof the class ‘IFO’ and the sensitivity of class ‘Maya’ (Table 3), due tothe weathered Maya15 sample not being recognized as Maya but asIFO. This was coherent with the fact that the two groups could not beseparated satisfactorily in the LDA subspace. Specificity was 100%except for IFO (98.48%) which means that, mainly, the six classes ofthe model can reject samples that do not belong to them.

Table 4 presents the standardized coefficients of the discriminantfunctions (DF). Two indexes (aromaticity and substitution 3) were notconsidered in the DFs. The first discriminant function (83.7% explainedvariance) opposed, essentially, indexes related to highly substitutedaromatic molecules (substitution 1 and 2) to carbonyl and relativecontent of aromatic structures in the spectra (aromatic condensationindex). This suggested that the first discriminant function (DF1)opposed heaviest products to lightest ones. The latter degraded faster,according to the carbonyl index, which increased fast in those products,mainly Sahara Blend, Brent and Ashtart [5], due to their lightercomposition, which allowed fast weathering processes to occur.

The second discriminant function (11.2% explained variance)characterized mainly the Brent crude oil as it had the lowest valuesof the substitution 1 index (i.e., it contained the least substitutedaromatic compounds) and, also, low figures in aromatic condensation(Table 1). The third discriminant function (3.2% explained variance)was related to the substitution 2 and aliphaticity indexes, without aclear explanation. The 4th and 5th discriminant functions (DF4 andDF5) discriminated between heavy and light products. Table 1revealed lower values of the long chains index for the former onesand, at the same time, higher values on total aliphaticity. All theseresults agreed with previous studies where Sahara, Brent and Ashtartshowed the highest sums of n-alkanes from n-C10 to n-C40. [5].

Table 1Classical univariate statistics for the oil spills training set.

Class Statistics IR spectrometric indexes

Aromaticity Aliphaticity Branchedchains

Longchains

Aromaticcondensation

Carbonyl Sulphoxide Substitution1

Substitution2

Substitution3

Ashtart Mean 1.0703 0.1497 0.1558 0.0392 0.0146 0.0020 0.0035 0.4547 0.3578 0.1875Median 1.0243 0.1527 0.1548 0.0385 0.0155 0.0000 0.0038 0.4604 0.3543 0.1828SD 0.1667 0.0056 0.0041 0.0061 0.0020 0.0029 0.0009 0.0169 0.0186 0.0103Minimum 0.8396 0.1380 0.1512 0.0314 0.0104 0.0000 0.0020 0.4120 0.3268 0.1786Maximum 1.4799 0.1564 0.1665 0.0479 0.0163 0.0088 0.0047 0.4671 0.3969 0.2170

Brent Mean 1.1378 0.1490 0.1600 0.0303 0.0111 0.0039 0.0027 0.3661 0.3566 0.2773Median 1.1160 0.1491 0.1601 0.0311 0.0120 0.0031 0.0026 0.3728 0.3558 0.2754SD 0.2956 0.0075 0.0028 0.0047 0.0019 0.0041 0.0004 0.0373 0.0365 0.0151Minimum 0.8408 0.1345 0.1558 0.0244 0.0063 0.0000 0.0019 0.2871 0.2554 0.2549Maximum 1.7331 0.1659 0.1669 0.0364 0.0127 0.0115 0.0037 0.4458 0.4043 0.3086

Maya Mean 0.9457 0.1580 0.1505 0.0285 0.0165 0.0039 0.0061 0.4028 0.3981 0.1991Median 0.9357 0.1581 0.1496 0.0286 0.0163 0.0030 0.0054 0.4082 0.3974 0.1994SD 0.1267 0.0040 0.0030 0.0017 0.0016 0.0044 0.0029 0.0112 0.0105 0.0048Minimum 0.7759 0.1514 0.1473 0.0259 0.0138 0.0000 0.0019 0.3784 0.3830 0.1913Maximum 1.1528 0.1640 0.1590 0.0323 0.0185 0.0119 0.0098 0.4178 0.4150 0.2072

Sahara Mean 1.2025 0.1442 0.1692 0.0369 0.0083 0.0051 0.0013 0.5094 0.3767 0.1139Median 0.8743 0.1453 0.1687 0.0366 0.0091 0.0028 0.0013 0.5131 0.3690 0.1063SD 0.6516 0.0048 0.0039 0.0060 0.0016 0.0059 0.0001 0.0213 0.0308 0.0417Minimum 0.7018 0.1346 0.1653 0.0293 0.0047 0.0009 0.0010 0.4593 0.3111 0.0622Maximum 2.7229 0.1505 0.1806 0.0455 0.0099 0.0192 0.0015 0.5362 0.4334 0.2296

IFO Mean 0.8036 0.1614 0.1483 0.0268 0.0212 0.0021 0.0061 0.4252 0.3608 0.2140Median 0.7772 0.1623 0.1483 0.0269 0.0216 0.0006 0.0054 0.4290 0.3589 0.2122SD 0.0654 0.0024 0.0012 0.0009 0.0015 0.0031 0.0024 0.0085 0.0055 0.0047Minimum 0.7530 0.1571 0.1459 0.0251 0.0183 0.0000 0.0029 0.4067 0.3553 0.2088Maximum 0.9560 0.1641 0.1500 0.0286 0.0228 0.0105 0.0105 0.4343 0.3743 0.2246

Prestige Mean 0.9256 0.1629 0.1414 0.0252 0.0267 0.0030 0.0076 0.4198 0.3464 0.2339Median 0.8474 0.1650 0.1418 0.0242 0.0273 0.0007 0.0067 0.4228 0.3456 0.2314SD 0.1573 0.0045 0.0022 0.0027 0.0013 0.0045 0.0030 0.0080 0.0033 0.0062Minimum 0.8060 0.1512 0.1356 0.0232 0.0239 0.0001 0.0041 0.4056 0.3428 0.2253Maximum 1.3461 0.1664 0.1436 0.0334 0.0280 0.0139 0.0130 0.4295 0.3531 0.2468

136 M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

With regard to validation, there were 3 errors when CV-LOO wasapplied. They corresponded to two samples from the lightest product(Brent0 and Brent12) and an aliquot from the heaviest crude oil(Maya15). They were classified as belonging to the Ashtart, Maya andIFO products, respectively. There was an error in the classification ofthe validation dataset (Prestige14 was classified as IFO).

3.2. PCA-LDA

LDA was applied on the first 5 PC-score vectors (which explained95.1% of the variance in the training set). The number of PCswas selectedusing CV-LOO and studying the overall classification ability of themodelsfor both training and validation. The forward selection approach wasemployed here because the PCswere independent. The error rates of thismodel were highly similar to those of LDA (Tables 2 and 3).

Table 2Overall error rates (ER, given as %) obtained for selected models derived from eachsupervised technique. Two ERs are given for the training set, using the overall samples(column headed as ‘training’ set) and using cross-validation leave-one-out (wheneverits application was possible, column headed as ‘CV-LOO’). SVM linear denotes thatlinear kernels were used for Prestige and Sahara classes whereas RBF kernels wereemployed for the remaining ones, see text for details.

Method ER

Training set CV-LOO Validation set

LDA 1.27 3.80 4PCA-LDA 1.27 3.80 0SIMCA 0 – 0PLS-DA 15.19 18.99 12RBF-PLS 0 2.53 0KNN 0 0 0SVM-linear 0 – 4SVM-Polynomial 0 – 12CPANN 0 3.80a 0

a Venetian blinds cross-validation.

The chemical interpretation of PCA-LDA models is difficult sincethey involve PCs which are associated to all IR indexes. Despite this,some comments can be given here because each DF was, grossly,associated to a unique PC. Thus, the 1st discriminant function (DF1)was mostly associated to the variables defining PC1 (long chains,aliphaticity and aromatic condensation). Likewise, the 2nd discrim-inant function (DF2) was related to the carbonyl index; DF3 to thesubstitution 1 and 3 indexes; DF4 to the substitution 2 index and DF5to the aromaticity and sulphoxide indexes. Therefore, the majortrends discriminating the classes seemed to be highly similar to LDAand they will be not repeated here. The main difference with LDA wasthat no errors occurred in the classification of the validation data set.

3.3. SIMCA

SIMCA required the development of independent, individual PC-based models for each class. Three PCs were selected to model theAshtart, Brent, IFO and Prestige products; four PCs were required forMaya, whereas only two for the Sahara crude oil. They explained 91%,97%, 98%, 99%, 92% and 97% of the variances of each class, respectively.The statistics for training (Table 2) were excellent, with 100% precision,sensitivity and specificity for the six oils (Table 3). No errors were foundwhen the models were applied to the validation set.

3.4. PLS

When the linear and parametric PLS-DA methodology was applied,seven latent variables (LV) (which explained 99.04 and 63.69% ofvariance in the X-block and Y-block respectively) were selected(autoscaled data). The classes were codified using strings of 0 s and 1 s(e.g., 1 0 0 0 0 0 and 0 1 0 0 0 0 for classes ‘one’ and ‘two’, respectively).Cross validation-LOO was used to select the number of latent variables(LV).

Table 3Resume of the quality of the selected models (values given as %). SVM linear denotesthat linear kernels were used for Prestige and Sahara classes whereas RBF kernels wereemployed for the remaining ones, see text for details. Prec/Sen/Spe refer to precision,sensitivity and specificity, respectively.

Method Class Training Validation

Prec/Sen/Spe Prec/Sen/Spe

LDA (PCA-LDA) Ashtart 100/100/100 100/100/100Brent 100/100/100 100/100/100Maya 100/92/100 100/100/100Sahara 100/100/100 100/100/100IFO 93/100/98 100/80(100)/100Prestige 100/100/100 80(100)/95(100)/100

SIMCA Ashtart 100/100/100 100/100/100Brent 100/100/100 100/100/100Maya 100/100/100 100/100/100Sahara 100/100/100 100/100/100IFO 100/100/100 100/100/100Prestige 100/100/100 100/100/100

KNN Ashtart 100/100/100 100/100/100Brent 100/100/100 100/100/100Maya 100/100/100 100/100/100Sahara 100/100/100 100/100/100IFO 100/100/100 100/100/100Prestige 100/100/100 100/100/100

PLS-DA Ashtart 100/92/100 100/100/100Brent 86/92/97 100/100/100Maya 93/100/98 100/100/100Sahara 100/92/100 100/100/100IFO 100/31/100 100/25/100Prestige 61/100/86 62/100/85

RBF-PLS Ashtart 100/100/100 100/100/100Brent 100/100/100 100/100/100Maya 100/100/100 100/100/100Sahara 100/100/100 100/100/100IFO 100/100/100 100/100/100Prestige 100/100/100 100/100/100

CPANN Ashtart 100/100/100 100/100/100Brent 100/100/100 100/100/100Maya 100/100/100 100/100/100Sahara 100/100/100 100/100/100IFO 100/100/100 100/100/100Prestige 100/100/100 100/100/100

SVM-linear Ashtart vs all 100/100/100 100/100/100Brent vs all 100/100/100 80/100/95Maya vs all 100/100/100 100/75/100Sahara vs all 100/100/100 100/100/100IFO vs all 100/100/100 100/100/100Prestige vs all 100/100/100 100/100/100

SVM-Polynomial Ashtart 100/100/100 100/75/100Brent 100/100/100 57/100/86Maya 100/100/100 100/50/100Sahara 100/100/100 100/100/100IFO 100/100/100 100/100/100Prestige 100/100/100 100/100/100

137M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

Themodel statistics (error rates) were unsatisfactory both in trainingand validation (Table 2) despite precision, sensitivity and specificitywereregular (Table 3). Further, the IFO and Prestige groups (which are similar)overlapped so that the majority of the IFO samples were classified as

Table 4Standardized coefficients of the discriminant functions using LDA.

Mid-IR index LDA

DF1 (83.7%) DF2 (11.2%) DF3 (3.2%) DF4 (1.9%) DF5 (0%)

Aliphaticity 0.129 −0.477 1.054 0.165 1.090Branched chains 0.975 0.624 0.000 −0.140 −0.008Long chains 0.206 −0.031 −0.212 1.373 0.162Aromaticcondensation

−1.126 1.058 −0.524 −0.116 −0.383

Carbonyl −1.158 0.323 −0.688 −1.838 0.206Sulphoxide 0.582 0.524 0.744 0.761 0.029Substitution 1 1.074 1.041 0.096 −0.165 0.084Substitution 2 1.298 0.829 1.257 0.274 −0.530

Prestige. Other models were attempted but results were similar, withouta clear reason justifying this. As a consequence, the PLS-DAmodelwill notbe employed to predict the origin of the samples taken at the beaches.

The non-linear and non-parametric RBF-PLS models yielded muchbetter results than linear PLS-DA (Tables 2 and 3). Each class wasmodeled separately (using the same strategy as for SVM), thus yieldingsix models, one per class; samples were finally assigned to the classwith the highest predicted response. The widths of the radial functionswere 0.3, 0.7, 0.3, 0.9, 0.2, and 0.4 for classes 1 to 6, respectively, and inall models the number of optimal factorswas 10. The original IR indexeswere scaled to the 0–1 range before modeling. Model training andvalidation were excellent, showing an obvious difference with thelinear and parametric PLS-DA option, likely reflecting the existence ofnon-linear relationships between the classes and the IR indexes.

3.5. KNN

To develop a non-parametric KNN model the metric of thedistance and the number of neighbours were optimized. Variableautoscaling and the cityblock distance were selected after severalassays. CV-LOO showed that the models with 4 neighbours predictedthe validation set best, with fewer errors in training and validation.

The model statistics (Table 2) were excellent both in training andvalidation, with successful 100% precision, sensitivity and specificityfor each of the six classes of oils (Table 3).

3.6. SVM

To model the six classes of hydrocarbons, the ‘one-vs-all’ approach[38] involved the development of six models, each opposing ahydrocarbon product against all other samples. Both the validation andunknown samples are then projected onto all the models and theirassignation decided by the majority vote, in case of tied scoring noassignation is given. The special characteristics of the Prestige andSahara fuels (they had like ‘separated’ positions, Fig. 2) suggested thatSVM models with linear kernels would be enough to model them(Fig. 2(a) and (b)) whereas RBF (Radial Basis Function) kernels wereemployed for the other groups (Fig. 2(c) and (d) shows two examples).It is worth noting that SVM models were developed without samplesBrent0 andMaya0 in the training set to avoid overlapping between bothclasses. Linear kernel functions were optimized varying the penaltyparameter (from 100 to 0.1)whereas the RBF ones required a sequentialoptimization of the penalty and sigma parameters (100bCb0.1 and5bσb0.25 standard deviations of the overall dataset). The parametersselected for the linear SVM models were C=1, whereas the RBF-SVMones required C=1 and σ=1 (except for the Maya model, withC=0.5). The results for the training stage were excellent, 100% success,as well as those for validation since only a sample (Maya3) becameclassified wrongly as Brent (Tables 2 and 3).

Development of a unique SVM model to take account of all classessimultaneously was assayed using RBF kernels for all groups althoughresults were unsatisfactory because one of the classes surrounded all theother ones, which is not a desirable situation [38]. Hence, a SVM modelusing polynomial kernels (second grade, the C penalty parameter wastested in the 100–0.1 range, shift=1) was set. The model with C=0.5yielded good results (see Tables 2 and 3 and Fig. 3). Despite in Table 2 a12% error rate was observed, this corresponded to only 3 samples (out of25) thatwere includedwrongly into the Brent class. Clearly, the precisionfor the Brent class (i.e., its capability to reject samples from other classes)is poor, which can be attributed to its difficult position in the PCAsubspace (Fig. 3).

3.7. CPANN

The CPANN was optimized varying the topology from 7×7 to 9×9neurons and from 100 to 500 the number of training epochs. The

Fig. 2. Linear SVM models for the Prestige (in cyan) (a) and Sahara (in green) (b) classes (‘one-vs-all’, C=1). RBF SVM models for the Ashtart (in black) (c) and IFO (in dark blue)(d) (C=1, σ=1).

138 M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

solution giving the best cross validation results (training dataset dividedinto 6 blocks of samples) involved a squared Kohonen map with 8×8neurons and trained for 200 epochs. A quite good separation among thesix groups of samples was observed in the top map displayed in Fig. 4(recall the toroidal topology of the Kohonen map).

Fig. 3. SVMmodel considering the six classes altogether (polynomial kernel, seconddegree).Prestige = cyan, IFO = dark blue, Maya = red, Brent = gray, Ashtart = black, Sahara =green and samples from beaches =white triangles.

The model statistics (Table 2) were excellent, with 100% precision,sensitivity and specificity for each of the six oils (Table 3). The mostinfluencing variables were assessed by inspecting the layers of

111

1

11

111

11

1 1

222 2

22

2222

222

333

333

3

33 33

33

4

4 444

4

4

444

4

44

555

555

555

55

5

5

6666

66 6

66

6

6

66

6

Fig. 4. Kohonen top map for the CPANN model derived for the oil spills dataset. Codesfor neurons: 1= yellow= Ashtart, 2 = green= Brent, 3 = blue =Maya, 4= cyano=Sahara, 5 = magenta = IFO, 6 = gray = Prestige.

139M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

weights. Fig. 5a shows the Kohonen weights associated to thealiphaticity layer, with light products (Ashtart, Brent and SaharaBlend) linked to lower weights. Two reasons explained this: thepolymerization processes undergone during weathering, which maybe faster and easier in light products, and that smaller and morevolatile structures (e.g. cycloalkanes, naphthalene, toluene, mercaptanes,etc.), evaporate easily on light oils. Fig. 5b and c (associated to branchedchains and aromatic condensation) opposed heaviest to lightestproducts. The former involved more molecules including aromaticrings and less aliphatic chains (among which, the branched chainsindex was calculated). A similar situation occurred for the sulphoxideindex, Fig. 5d, which clearly pointed out the largest contents of sulfurand related compounds in heaviest crude oils (Maya) and distillates

Fig. 5. Graphical display of the weights of the SOM of the CPANN. The darker the color is, tindex), (c) 3rd level (branched chains index), (d) 5th level (aromatic condensation index) (1) (h) 10th level (substitution 3). Numbers have the same meaning as in Fig. 4.

(IFO and Prestige). Lowest weights associated to the 8th and 10th levels(substitution 1 and 3, respectively; Fig. 5e and f) discriminate the twolightest products, Brent and Sahara, respectively. This seemed logical asthey contained low total amounts of highly substituted/condensedaromatic structures (Table 1).

There were 3 errors in CV (Venetian windows) as Ashtart0, Brent0and Sahara0 became classified as Sahara, Sahara and Ashtart, respec-tively. No errors occurred for the validation set.

3.8. Classification of samples from Galician beaches

Table 5 shows the prediction of the samples collected at Galicianbeaches using the different supervised pattern recognition techniques.

he higher the weights are. (a) 1st level (aromaticity index), (b) 2nd level (aliphaticitye) 6th level (carbonyl index) (f) 7th level (sulphoxide index) (g) 8th level (substitution

140 M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

In this case study the major interest was to assess whether the samplesbecame classified as Prestige. The main reason being that while theshipwreck loosed fuel it was observed that some other carriersnavigating through the ‘International Corridor for Hazardous goods’(NW Spain, off the Galician Coast) released unduly ballast and cleanedbilges. Therefore, despite it was expected that the majority of thesamples should proceed from the Prestige vessel, some others shouldnot; as other authors reported as well [43].

To validate the predictions yielded by the supervised models thesamples were analyzed also by the EU reference guide based on achromatographic standard procedure [3]. It is important to bear inmindthat the standard method itself may yield inconclusive results and thatit only classified the samples as Prestige or not Prestige (match vs. nomatch assignations). Table 5 contains the class to which each samplewas assigned and, to simplify the comparisons with the standardprotocol, whether the final assignment of the consensus predictionwould be ‘Prestige’ or ‘not Prestige’. In general, it can be deduced thatthe behavior of the supervisedmodels was excellent (recall that PLS-DAwas not considered here).

Table 5Classification of the beaches on one of the six classes (1: Ashtart, 2: Brent, 3: Maya, 4: Sahaand NC stand for Prestige, Not Prestige or Not Classified, respectively. SVM linear denotes temployed for the remaining ones.

LDA LDA-PCA SIMCA KNN RBF-PLS

Lira 1 1 1 4 4 3Lira 2 1 1 1 4 3Lira 3 1 4 2 4 3Lira 4 1 3 1 3 3Lira 5 6 6 6 6 6Cayon 1 6 6 3 6 3Cayon 2 6 6 6 6 6Cayon 3 6 6 6 6 6Cayon 4 6 6 3 6 3Cayon 5 6 6 6 6 6Lires 1 6 6 6 6 6Lires 2 6 6 6 6 6Lires3 6 6 6 6 6Portiño 1 5 5 5 5 6Portiño 2 6 6 6 6 6Portiño 3 6 6 6 6 5Portiño 4 6 6 2 2 2Portiño 5 6 6 6 6 6Louro 1 6 6 6 6 6Louro 2 6 6 6 6 6Louro 3 1 4 4 4 3Louro 4 6 6 6 6 6Louro 5 6 6 6 6 6Farolira 1 5 5 2 6 6Farolira 2 6 6 6 6 6Farolira 3 6 6 6 6 6Farolira 4 6 6 6 6 6Muxia 1 6 6 6 6 6Muxia 2 6 6 6 6 6Barco 6 6 6 6 6Touriñan 6 6 6 6 6Lago 5 6 6 6 6Langosteira 1 6 6 6 6 3Langosteira 2 5 5 2 1 3Langosteira 3 1 4 4 4 3Pindo 1 1 1 1 3Nemiña 6 6 6 6 6Rostro 6 6 6 6 6M1 6 6 6 6 6M2 6 6 6 6 6M3 6 6 6 6 6M4 6 6 6 6 6M5 6 6 6 6 6M6 6 6 6 6 6M7 5 6 6 6 6

SVM with polynomial kernels (SVM-Polynomial) offered the bestresults (97.8% agreement, only a wrong sample) when compared to thestandard method. CPANN and RBF-PLS had 3 errors (93.3% agreement).LDA-PCA and KNN had 4 errors (8.9% differences with respect to thestandard method); SIMCA had 5 errors (11.1% disagreement) and LDAhad 6 errors (13.3% disagreement). SVM with linear and RBF kernelshad 6 errors (13.3% disagreement) because 4 samples could not beassigned due to a tie in the ‘one-vs-all’ approach. In addition, a‘consensus prediction’ was developed considering all the multivariatepattern recognition techniques in the following way: the ‘consensusprediction’ for a sample was calculated as the most frequent predictionout of the eight methods employed to classify it (considering thetwo SVM possibilities). In case different predictions amount thesame number of votes, a ‘not-classified’ result will be obtained. The‘consensus prediction’ performed very well as it had only 3 errors (6.7%differenceswith the standardmethod), and it gave the samepredictionsas CPANN.

Accordingly, SVM with polynomial kernels outperformed the pre-diction of the other methods, including the consensus one. The three

ra, 5: IFO, 6: Prestige). Standard protocol refers to the CEN/TR 15522-2 guideline. P, NPhat linear kernels were used for Prestige and Sahara classes whereas RBF kernels were

SVM CPANN Consensus Standardprotocol

Linear Poly Class P vs NP

4 4 1 4 NP NPNC 4 1 1 NP NP4 4 1 4 NP NP2 2 5 3 NP NP6 6 6 6 P P6 6 6 6 P NP6 6 6 6 P P6 6 6 6 P PNC 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P PNC 6 5 5 NP P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P4 4 1 4 NP NP6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P PNC 4 6 6 P NPNC 4 4 NC NP NP4 4 1 4 NP NP4 4 1 1 NP NP6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P6 6 6 6 P P

141M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

common misclassifications were Cayon1, Portiño1 and Langosteira1.Thus, polynomial SVM yielded 2.2% of false positives. The falsepositives and negatives for the consensus approach were 4.4% and2.2%, respectively. RBF-PLS had not false positives but 6.7% of falsenegatives.

Among the other non-parametric methods, CPANN and RBF-PLSbehaved slightly worse than polynomial SVM. Parametric methodsperformed worse than the non-parametric ones and, surprisingly, theworst overallmethodwas PLS-DA,without a sound reason to explain thisbut for the existence of complex non-linear relationships between themembership of a class and the IR spectral indexes. Likely, these would becaused by the differences on the weathering processes undergone by theproducts after their spillages. The best parametric method was SIMCA,with 100% success in the training and validation samples, although 5misclassifications were obtained when compared to the EU standardguideline.

4. Conclusions

The combination of a simple and fast analytical technique (mid-IRspectroscopy) to supervised pattern recognition models can be used topredict the origin of oil lumps collected at polluted beaches reliably. Thisis a relevant result because European directives rely on standardmethods based on chromatography which require a long time toevaluate just a sample (as much as around 2 h just to perform theanalysis) and are based on somewhat subjective decisions since they arebased on visual comparisons of n-alkane distributions, bar charts of PAHconcentration and double plots of diagnostic ratios derived from thechromatograms. Eight supervised pattern recognition techniques(Discriminant Analysis, Principal Components Analysis combined toDiscriminant Analysis, Soft Independent Modelling of Class Analogy,Partial Least Squares Discriminant Analysis, radial basis functions-PLS, K-Nearest Neighbours, Support Vector Machines and Counter-propagation Artificial Neural Networks) plus a consensus approachwere employed to classify six oil spills and a set of 45 unknownsamples taken on beaches. In general, both the non-parametric andnatural computation algorithms yielded best results than theparametric supervised ones.

The very good performance of the non-parametric and naturalcomputation methods, mainly when they are applied to unknown(complex) weathered samples, can be explained by their superiorgeneralization abilities and, likely, the ability to model subtle non-linearities on the relationships between the classes and the IR spectralindexes. It seemed that the parametric models, which in essence, arebased on the use of factors which are linear combinations of variablesbecome affected more by unstructured or/and noisy information(variance not related to the classification itself). Thus, the weathering(and very complex) processes undergone by the spilled fuel sampleson the sea (impossible to model or to repeat in different climatolog-ical or sea conditions) affected least the non-parametric methods as,likely, they considered only the essential critical information todifferentiate among the samples. In this sense, the development ofdifferent models for each class (instead of a unique model for allclasses) justified the superior performance of SIMCA, among theparametric methods.

Acknowledgments

M.P.G-C acknowledges the Galician Government (Xunta de Galicia;‘Ángeles Alvariño’ and ‘Estadías de Investigación’ grants, partiallysupported by the EU FEDER funds) and the SpanishMinistry of Education(‘José Castillejo’ grant) to stay at the Universitá degli Studi di Milano-Bicocca.

The authors thank the insightful comments given by Prof. RomàTauler which greatly improved the original manuscript.

References

[1] Z. Wang, S.A. Stout, Oil Spill Environmental Forensics: Fingerprinting and SourceIdentification, Academic Press, Burlington, MA, 2007.

[2] S. Ezra, S. Feinstein, I. Pelly, D. Bauman, I. Miloslavsky, Weathering of fuel oil spillon the east Mediterranean coast, Ashdod, Israel, Organic Geochemistry 31 (2000)1733–1741.

[3] CEN/TR 15522-2, Oil spill identification. Waterborne petroleum and petroleumproducts, part 2: analytical methodology and interpretation of results, TechnicalReport, 2006.

[4] G. Pérez-Caballero, J.M. Andrade, S. Muniategui, D. Prada, Comparison of single-reflection near-infrared and attenuated total reflection mid-infrared spectros-copies to identify and monitor hydrocarbons spilled in the marine environment,Analytical and Bioanalytical Chemistry 395 (2009) 2335–2347.

[5] P. Fresco-Rivera, R. Fernández-Varela, M.P. Gómez-Carracedo, F. Ramírez-Villalobos,D. Prada, S. Muniategui, J.M. Andrade, Development of a fast analytical tool toidentify oil spillages employing infrared spectral indexes and pattern recognitiontechniques, Talanta 74 (2007) 163–175.

[6] J.M. Andrade, P. Fresco, S. Muniategui, D. Prada, Comparison of oil spillages usingmid-IR indexes and 3-way procrustes rotation, matrix-augmented principalcomponents analysis and parallel factor analysis, Talanta 77 (2008) 863–869.

[7] C. Borges, M.P. Gómez-Carracedo, J.M. Andrade, M.F. Duarte, J.L. Biscaya, J. Aires-de-Sousa, Geographical classification of weathered crude oil samples withunsupervised self-organizing maps and a consensus criterion, Chemometricsand Intelligent Laboratory Systems 101 (2010) 43–55.

[8] L. Bartolomé, M. Deusto, N. Etxebarria, P. Navarro, A. Usobiaga, O. Zuloaga, Chemicalfingerprinting of petroleum biomarkers in biota samples using retention-timelocking chromatography and multivariate analysis, Journal of Chromatography. A1157 (2007) 369–375.

[9] O. Galtier, O. Abbas, Y. Le Dréau, C. Rebufa, J. Kister, J. Artaud, N. Dupuy, Comparisonof PLS1-DA, PLS2-DA and SIMCA for classification by origin of crude petroleum oilsby MIR and virgin olive oils by NIR for different spectral regions, VibrationalSpectroscopy 55 (2011) 132–140.

[10] P. Sun, M. Bao, G. Li, X. Wang, Y. Zhao, Q. Zhou, L. Cao, Fingerprinting and sourceidentification of an oil spill in China Bohai Sea by gas chromatography-flameionization detection and gas chromatography–mass spectrometry coupled withmulti-statistical analyses, Journal of Chromatography. A 1216 (2009) 830–836.

[11] B.K. Lavine, D. Brzozowski, A.J. Moores, C.E. Davidson, H.T. Mayfield, Geneticalgorithm for fuel spill identification, Analytica Chimica Acta 437 (2001) 233–246.

[12] D. Ebrahimi, J. Li, D.B. Hibbert, Classification of weathered petroleum oils by multi-way analysis of gas chromatography–mass spectrometry data using PARAFAC2parallel factor analysis, Journal of Chromatography. A 1166 (2007) 163–170.

[13] J.H. Christensen, G. Tomasi, A.L. Scofield, M.F.G. Meniconi, A novel approach forcharacterization of polycyclic aromatic hydrocarbon (PAH) pollution patterns insediments from Guanabara Bay, Rio de Janeiro, Brazil, Environmental Pollution158 (2010) 3290–3297.

[14] J.A. Arancibia, C.E. Boschetti, A.C. Olivieri, G.M. Escandar, Screening of oil sampleson the basis of excitation–emission room-temperature phosphorescence dataand multiway chemometric techniques. introducing the second-order advantagein a classification study, Analytical Chemistry 80 (2008) 2789–2796.

[15] N. Pieri, J.P. Planche, J. Kister, Chemical characterization of road paving bitumenusing FTIR and UV synchronous fluorescence, Analusis 24 (1996) 113–122.

[16] A. Permanyer, L. Douifi, A. Lahcini, J. Lamontagne, J. Kister, FTIR and SUVFspectroscopy applied to reservoir compartmentalization: a comparative studywith gas chromatography fingerprints results, Fuel 81 (2002) 861–866.

[17] R. Fernández-Varela, D. Suárez-Rodríguez, M.P. Gómez-Carracedo, J.M. Andrade, E.Fernández, S. Muniategui, D. Prada, Screening the origin andweathering of oil slicksby attenuated total reflectance mid-IR spectrometry, Talanta 68 (2005) 116–125.

[18] R. Fernández-Varela, J.M. Andrade, S. Muniategui, D. Prada, F. Ramírez-Villalobos,The comparison of two heavy fuel oils in composition and weathering pattern,based on IR, GC-FID and GC–MS analyses: application to the Prestige wreckage,Water Research 43 (2009) 1015–1026.

[19] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics, Part A, Elsevier, Amsterdam,1997.

[20] K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press, New York,1979.

[21] R.J. Jennrich, Statistical Methods for Digital Computers, Stepwise DiscriminantAnalysis, Wiley & Sons, New York, 1977.

[22] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Gasoline classification using near infrared(NIR) spectroscopy data: comparison of multivariate techniques, Analytica ChimicaActa 671 (2010) 27–35.

[23] C. Wang, C. Chen, C. Chiang, S. Young, S. Chow, H.K. Chiang, A probability-basedmultivariate statistical algorithm for autofluorescence spectroscopic identifica-tion of oral carcinogenesis, Photochemistry and Photobiology 69 (1999) 471–477.

[24] A. Eriksson, K. Persson Waller, K. Svennersten-Sjaunja, J.-E. Haugen, F. Lundby, O.Lind, Detection of mastitic milk using a gas-sensor array system (electronic nose),International Dairy Journal 15 (2005) 1193–1201.

[25] D. Cozzolino, A. Chree, J.R. Scaife, I. Murray, Usefulness of Near-Infrared Reflectance(NIR) spectroscopy and chemometrics to discriminate fishmeal batches made withdifferent fish species, Journal of Agricultural and Food Chemistry 53 (2005)4459–4463.

[26] C.M. Bishop, Pattern Recognition andMachine Learning, Springer, NewYork, NY, 2006.[27] B.M. Wise, N.B. Gallagher, R. Bro, J.M. Shaver, W. Windig, R.S. Koch, PLS-Toolbox

Version 3.5 for Use with MATLAB™, Eigenvector Research, Inc, Manson, WA, USA,2005.

142 M.P. Gómez-Carracedo et al. / Chemometrics and Intelligent Laboratory Systems 114 (2012) 132–142

[28] H. Mauser, O. Roche, M. Stahl, S. Müller, Prediction of UV and ESI-MS signalintensities, Journal of Chemical Information and Modeling 45 (2005) 1039–1046.

[29] L. Afzelius, C.M. Masimirembwa, A. Karlén, T.B. Andersson, I. Zamora, Discriminant andquantitative PLS analysis of competitive CYP2C9 inhibitors versus non-inhibitors usingalignment independent GRIND descriptors, Journal of Computer-Aided MolecularDesign 16 (2002) 443–458.

[30] C. Eker, R. Rydell, K. Svanberg, S. Andersson-Engels, Multivariate analysis oflaryngeal fluorescence spectra recorded in vivo, Lasers in Surgery and Medicine28 (2001) 259–266.

[31] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed. John Wiley & Sons,Inc., New York, NY, USA, 2000.

[32] B. Walczak, D.L. Massart, The radial basis functions-partial least squares approachas a flexible non-linear regression technique, Analytica Chimica Acta 331 (1996)177–185.

[33] B. Walczak, D.L. Massart, Application of radial basis functions—partial leastsquares to non-linear pattern recognition problems: diagnosis of process faults,Analytica Chimica Acta 331 (1996) 187–193.

[34] M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak,TOMCAT: a Matlab toolbox for multivariate calibration techniques, Chemometricsand Intelligent Laboratory Systems 85 (2007) 269–277.

[35] T. Czejak, W. Wu, B. Walczak, About kernel latent variable approaches and SVM,Journal of Chemometrics 19 (2005) 341–354.

[36] I.E. Frank, R. Todeschini, The Data Analysis Handbook, Elsevier, Amsterdam, 1994.[37] D. Coomans, D.L. Massart, Alternative k-nearest neighbour rules in supervised

pattern recognition: part 2. probabilistic classification on the basis of the kNNmethod modified for direct density estimation, Analytica Chimica Acta 138 (1982)153–165.

[38] R.G. Brereton, G.R. Lloyd, Support Vector Machines for classification and regression,Analyst 135 (2010) 230–267.

[39] H. Li, Y. Liang, Q. Xu, Support vector machines and its applications in chemistry,Chemometrics and Intelligent Laboratory Systems 95 (2010) 188–198.

[40] J. Luts, F. Ojeda, R. van-de-Plas, B. de Moor, S. van-Huffel, J.A.K. Suykens, A tutorial onsupport vector machine-based methods for classification problems in chemometrics,Analytica Chimica Acta 665 (2010) 129–145.

[41] J. Zupan, M. Novic, I. Ruisánchez, Kohonen and counterpropagation artificialneural networks in analytical chemistry, Chemometrics and Intelligent Labora-tory Systems 38 (1997) 1–23.

[42] D. Ballabio, V. Consonni, R. Todeschini, The Kohonen and CP-ANN toolbox: a collectionofMATLABmodules for Self OrganizingMaps andCounterpropagationArtificialNeuralNetworks, Chemometrics and Intelligent Laboratory Systems 98 (2009) 115–122.

[43] S. Díez, E. Jover, J.M. Bayona, J. Albaigés, Prestige oil spill. III. Fate of heavy oilin the marine environment, Environmental Science & Technology 41 (2007)3075–3082.


Recommended