Ayatollah So
the spoof'll set you free
Math and science whizzes, please help. I have a science problem very much like this one where they are trying to distinguish two chemicals by their Raman spectra. They used Principal Components Analysis (PCA) to determine which features of the spectra to use and how much to weight them. Here's a graph from their website (the arrows indicate major differences between the two):

I'm also trying to distinguish chemicals by their spectra - not using Raman, but that shouldn't matter - and I've been pondering the use of PCA, though I don't understand it very well. Here's what really bothers me: PCA is often described as a tool to zero in on the data that's really important, yet in this very example on this website, they give the lie to that.
Notice on the above graph, that two of the arrows (major difference points in the spectra) cluster on the left, and the others cluster on the right. They divide their spectrum into two regions, accordingly. They run their Principal Components Analysis and regression analysis on each region separately, then on both regions combined. Region I gives a good reliable prediction, i.e., if you take the spectrum of an unknown sample made of X% of the one chemical and (100-X)% of the other, your analysis will almost certainly be within a few % of the actual concentration. Region II gives a crappy prediction. And Regions I+II combined gives a crappy prediction (!,?).
So apparently, in order to use PCA to zero in on the "right" subset of data, you already have to have (to some extent?) picked the right subset of data. Gee, great. How useful.
I'm not so much interested as to why this weirdness - more data gives you worse predictions - happens. I have a hunch why. But what I really want to know is, what to do instead of principal components analysis, or in addition. What mathematical analysis tools, if any, can I use to pre-select the most "predictive" sub-section(s) of the spectra?

I'm also trying to distinguish chemicals by their spectra - not using Raman, but that shouldn't matter - and I've been pondering the use of PCA, though I don't understand it very well. Here's what really bothers me: PCA is often described as a tool to zero in on the data that's really important, yet in this very example on this website, they give the lie to that.
Notice on the above graph, that two of the arrows (major difference points in the spectra) cluster on the left, and the others cluster on the right. They divide their spectrum into two regions, accordingly. They run their Principal Components Analysis and regression analysis on each region separately, then on both regions combined. Region I gives a good reliable prediction, i.e., if you take the spectrum of an unknown sample made of X% of the one chemical and (100-X)% of the other, your analysis will almost certainly be within a few % of the actual concentration. Region II gives a crappy prediction. And Regions I+II combined gives a crappy prediction (!,?).
So apparently, in order to use PCA to zero in on the "right" subset of data, you already have to have (to some extent?) picked the right subset of data. Gee, great. How useful.
I'm not so much interested as to why this weirdness - more data gives you worse predictions - happens. I have a hunch why. But what I really want to know is, what to do instead of principal components analysis, or in addition. What mathematical analysis tools, if any, can I use to pre-select the most "predictive" sub-section(s) of the spectra?