Technology - Comprehensive Statistical Spectrum Analysis Routine

Comprehensive Statistical Spectrum Analysis Routine

System for analyzing data, particularly for detecting patterns that distinguish multiple sets of data

Background:

In many systems or phenomena, naturally occurring or otherwise, distinctive patterns of data are often buried within the highly complex data sets that are created to characterize such systems or phenomena. Such patterns have been observed, for example, in the study of a wide variety of systems and phenomena such as diseases, environmental conditions, and financial conditions, to name a few. The distinctive patterns of data that may characterize certain conditions are often not obvious or apparent using existing classification methods and systems. The current classification systems and methods typically find or uncover a single known differentiating feature between sets of data or analyze only a subset of the data. A hidden pattern found in one dataset is generally not applicable to another dataset. That is, these systems generally require training on each new set of data and cannot completely characterize the dataset. In light of the known approaches and their limitations, it is clear that improved analysis methods and systems are required. Moreover, it is desirable to have a comprehensive method and system that considers each data point of the dataset to discover hidden patterns or markers, thereby enabling the method and system to detect subtle differences in multiple datasets without retraining on each dataset. Such a system and method trained, for example, to detect a toxin from an environmental dataset can be used on another environmental dataset to detect that toxin without retraining on the other environmental dataset.

Technology Overview:

Researchers at Stony Brook University propose a comprehensive spectrum analysis routine that will achieve thorough analysis and comparison of spectra (e.g. serum protein spectra) from several populations (e.g. diseased and disease free subjects). Out system will identify the best subset of k bio-markers (for any positive integer k) that can achieve maximum separation between the groups. These bio-markers are candidates for the subsequent biological examination that could potentially reveal disease mechanisms. Our algorithm achieved perfect separation (100% sensitivity, 100% specificity) between patients with ovarian cancer and the normal controls (including benign cases) for two ovarian cancer data sets. The data sets (Ovarian Dataset 8-7-02 and Ovarian Dataset 4-3-02) are provided at the NIH clinical proteomics program databank website.

Please note, header image is purely illustrative. Source: Jer Thorp, Flickr, CC BY 2.0.

Advantages:

- Point wise test is applied to ensure the analysis of the entire spectrum - Smoothing is applied to reduce the noise/signal ratio - Relative spectrum is obtained to enable cross-subject comparisons - Random field theory is applied to adjust for multiple comparisons - Variance stability is checked upon request to select stable markers - Best-subset discriminator selection is applied to find the best k markers - Re-sampling method is applied to ensure consistent performance - The best k-biomarkers from each sampling are compared and the persistent markers are chosen for the ensuing biological examinations and research - The proposed algorithmic routine is applicable to the analysis of any spectrum data including the serum protein spectrum or its variations - The algorithm is applicable to the classification of 2 or more groups

Applications:

- Applicable to a wide range of fields including, for example, biology, medicine, chemistry, and economics; - Environmental samples can be compared and analyzed to detect the presence of a particular substance or radiation in the environment, thereby providing a bio-hazard detector - Analyze and compare tissues and body fluids (such as serum) of diseased and control subjects so as to draw conclusions regarding the existence, progression, or regression of a diseased state