# principal components analysis question

#### Ayatollah So

##### the spoof'll set you free
Math and science whizzes, please help. I have a science problem very much like this one where they are trying to distinguish two chemicals by their Raman spectra. They used Principal Components Analysis (PCA) to determine which features of the spectra to use and how much to weight them. Here's a graph from their website (the arrows indicate major differences between the two):

I'm also trying to distinguish chemicals by their spectra - not using Raman, but that shouldn't matter - and I've been pondering the use of PCA, though I don't understand it very well. Here's what really bothers me: PCA is often described as a tool to zero in on the data that's really important, yet in this very example on this website, they give the lie to that.

Notice on the above graph, that two of the arrows (major difference points in the spectra) cluster on the left, and the others cluster on the right. They divide their spectrum into two regions, accordingly. They run their Principal Components Analysis and regression analysis on each region separately, then on both regions combined. Region I gives a good reliable prediction, i.e., if you take the spectrum of an unknown sample made of X% of the one chemical and (100-X)% of the other, your analysis will almost certainly be within a few % of the actual concentration. Region II gives a crappy prediction. And Regions I+II combined gives a crappy prediction (!,?).

So apparently, in order to use PCA to zero in on the "right" subset of data, you already have to have (to some extent?) picked the right subset of data. Gee, great. How useful.

I'm not so much interested as to why this weirdness - more data gives you worse predictions - happens. I have a hunch why. But what I really want to know is, what to do instead of principal components analysis, or in addition. What mathematical analysis tools, if any, can I use to pre-select the most "predictive" sub-section(s) of the spectra?

dupe 20 chars

#### GoodGame

##### Red, White, & Blue, baby!
I'm not a math genius, but I did sleep at a Holiday Inn.

Looking at the wikipedia on it, there are assumptions for the method to hold. Are you sure the data (collection) meets those assumptions?

From the wikipedia (though apparently there is a lot of disagreement in the talk that the article is sound):

Assumption on Linearity
We assumed the observed data set to be linear combinations of certain basis. Non-linear methods such as kernel PCA have been developed without assuming linearity.

Assumption on the statistical importance of mean and covariance
PCA uses the eigenvectors of the covariance matrix and it only finds the independent axes of the data under the Gaussian assumption. For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlates the axes. When PCA is used for clustering, its main limitation is that it does not account for class separability since it makes no use of the class label of the feature vector. There is no guarantee that the directions of maximum variance will contain good features for discrimination.

Assumption that large variances have important dynamics
PCA simply performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance. It is only when we believe that the observed data has a high signal-to-noise ratio that the principal components with larger variance correspond to interesting dynamics and lower ones correspond to noise.

Essentially, PCA involves only rotation and scaling. The above assumptions are made in order to simplify the algebraic computation on the data set. Some other methods have been developed without one or more of these assumptions; these are briefly described below.

edit: this website seems like a proper tutorial on it. http://neon.otago.ac.nz/chemlect/chem306/pca/Theory_PCA/index.html EDIT: And oh, seems to be the one you used for your OP.

EDIT: Page 9 of the url seems to be where they give the breakdown on region 2. http://neon.otago.ac.nz/chemlect/chem306/pca/Spectroscopy_PCA/page9.html And I see now on the conclusion why you're pissed off. They say choose the best parts of the region, but don't say how.

It seems to me that since the "PC" or eigenvectors are basically analogous to slices of an orange, with the orange being sliced in multiple, different ways, trying to see how many slices are relevant to the problem. Only what they're doing in the example is unraveling a knot of strings into separate strands. I suspect, in region I, the knot is pretty simple. There's only a few strings in it. In region 2, I think the knot is pretty messy, with lots of strands. (The reason is that they're looking at nearly similar chemicals, and in region 2 there's more chemicals, so more strands similarities to isolate).

Based on those analogies, my hunch is that the best regions to focus on are where you see actual differences between species in your mixture. E.g. if one chemical is different from the other by the addition of 2 hydrogens, and that causes a slightly different spectra peak at wavelength XYZ, then an area around XYZ is a good place to focus on, is my hunch.

The hands-on method for estimating if a region is useful for PCA seems to be the scree plot, though I'm not comprehending this method fully. http://neon.otago.ac.nz/chemlect/chem306/pca/Theory_PCA/page6.html and page7. Though on the scree plot it seems that when the y-axis hits about 1 to 0.7, the corresponding x-axis value is the maximum number of PC's or 'slices' that you should use in your analysis. And that about corresponds to number of different species muddling together to make the 'knot', I'd assume. I assume once you've got a set of PC's and numbers attached to each one to give it's relative contribution to the data then I think one transforms that into the raw data spectra to infer where each contributing part actually is. Basically deconvulting the knot by estimating what the component strings are based on the estimate that the PC told you that there were so many strings of so much relative strength to each other.

#### Ayatollah So

##### the spoof'll set you free
Thanx game. I can pretty much guarantee that all 3 of those assumptions are wrong for my application but the question is how much wrong. Linearity is not much of a problem I think. If a Principle Component based on the raw spectrum is related, but not linearly related, to chemical composition, we could make adjustments. For example instead of predicting composition per se, we could predict a 2nd-order polynomial function of composition, then back-calculate composition.

Do you know what "class separability" means in Wiki's discussion of the second assumption? The last sentence of that paragraph sounds like it goes right to the heart of my problem: "There is no guarantee that the directions of maximum variance will contain good features for discrimination."

In our research, there are many "uninteresting" (to us) variables that influence our spectra, such as degree of oxidation of the sample, surface texture, and more.

Well, I'm going on a (no-internet) vacation for a week, but if you want to reply I'll be glad to read it when I get back.

#### GoodGame

##### Red, White, & Blue, baby!
So basically you have multiple 'experiment errors' in your system from the net influence of the 'uninteresting variables'? So basically you want to purify the signal, and eliminate the noise? I think you might want a deconvolution method instead of something like PCA.

I getting the feeling that PCA is perfect if you can control most of the variable in your system (or at least take them to extremes), and that it will tell you how the variables correlate. Like if you want to know how gender, race, age, culture influence a shopping decision. I think it would actually be pretty good for something I do at work sometimes----circular dichroism on proteins followed by secondary structure prediction, where we try to estimate the how the raw data fractions into secondary structure types (alpha helix, beta sheet, etc..). Though I usually use some kind of iterative, empirical prediction software (basically some kind of least squares fit of the raw data vs. previously solved protein structures) that is more of a brute force-empirical solution.

In our research, there are many "uninteresting" (to us) variables that influence our spectra, such as degree of oxidation of the sample, surface texture, and more.

I can't seem to google the term 'class separability" in a tutorial, only in literature. It seems to be a math/comp science term, and I'd guess it correlates with having strong control over variables in an experiment setup.
Given that there seem to be multiple mathematic ways of defining it using means and sums, I'm guessing that it's a measure of how well a mess of data actually resembles separate shapes/clumps of data that are distinct from each other.
One free article:
http://www.evolutionaria.com/publications/gecco04-fss.pdf

A relative to PCA, LDA, might be useful, except it's supposedly not good for regression, so probably useless to you.
http://en.wikipedia.org/wiki/Discriminant_analysis

I think if your problem is what I think it is (noise), then you should look for a deconvultion method. http://en.wikipedia.org/wiki/Deconvolution http://en.wikipedia.org/wiki/Independent_Component_Analysis#Linear_noisy_ICA Then once you know what the numerical contribution of your noisy variables are, correct the data by subtracting out the contribution of the noise.

#### Mr. Blonde

##### Dr. techn.
As I gained some knowledge on this issue during my chemistry study I´ll try to answer:

PCA is in principle nothing more than a reduction of a multidimensional space into less dimensions. What you do is to choose the directions of your new axis according to the most variation in this space. So in principle: x-axis has the most variance, y-axis is the direction orthogonal to it with the most variation and so on.
PCA can lead to cluster analysis one one hand.

The second important thing is to gain knowlegde on the directional vector of these axis. They contain the information where the most variance was hidden in your original dataset - in the most extreme case the x-axis would be composed of only one dimension in the original dataset - in your case this would for instance mean only one wavelength in the spectrum shows variation between the samples. The result of PCR, however is poor if you have a lot of "base noise".

As to your questions on alternatives:
Well, PCA / PCR is afaik considered rather robust to variations, in my company we use it for H2O content / API content quality control of pharmaceuticals. The alternative I know are neuronal nets but they are require more pre-knowledge / "luck" on how to set parameters to not overtrain your model but still getting predictive power out of a training set (which you have to choose really wisely).

Maybe you have a problem with your reference analytics / sample handling which results in a poor model?

#### Perfection

Completely uninformed idiot, but going to mouth off anyways:

Why divide? It seems to me that subtraction may be a better bet. That might explain why different regions do poorly, because division would skew results.

#### Ayatollah So

##### the spoof'll set you free
Thanks again folks. That evolutionaria article especially looks like a great place for me to start chasing down a chain of references.

#### Jan H

##### Prince
When reading the thread title, i though "finally a topic I know something about!" Unfortunately I have only used PCA in spectral analysis of noise & vibration (e.g. for example to separate the vibrations in a car, caused by a correlated force input at the four wheels). It clearly is the same mathematical principal, but I don't know anything about chemical spectra...

Replies
9
Views
1K
Replies
23
Views
5K
Replies
0
Views
386
Replies
20
Views
2K
Replies
7
Views
944