Title: | Finding features - variable extraction strategies for dimensionality reduction and marker compounds identification in GC-IMS data |
Author(s): | Christmann J; Rohn S; Weller P; |
Address: | "Institute for Instrumental Analytics and Bioanalysis, Mannheim University of Applied Sciences, Paul-Wittsack-Strasse 10, 68163 Mannheim, Germany; Hamburg School of Food Science, University of Hamburg, Grindelallee 117, 20146 Hamburg, Germany. Hamburg School of Food Science, University of Hamburg, Grindelallee 117, 20146 Hamburg, Germany; Department of Food Chemistry and Analysis, Institute of Food, Technology and Food Chemistry, Technische Universitat Berlin, TIB 4/3-1, Gustav-Meyer-Allee 25, 13355 Berlin, Germany. Institute for Instrumental Analytics and Bioanalysis, Mannheim University of Applied Sciences, Paul-Wittsack-Strasse 10, 68163 Mannheim, Germany. Electronic address: p.weller@hs-mannheim.de" |
DOI: | 10.1016/j.foodres.2022.111779 |
ISSN/ISBN: | 1873-7145 (Electronic) 0963-9969 (Linking) |
Abstract: | "Gas chromatography hyphenated to ion mobility spectrometry (GC-IMS) is a powerful, two-dimensional separation and detection technique for volatile organic compounds (VOC). Low detection limits, high selectivity and robust operation characterize it as an ideal tool for non-target screening (NTS) approaches. Combined with multivariate data analysis, it has been successfully applied to several areas in food science, such as authenticity control and flavor profiling. The recorded raw data feature high numbers of variables due to the high scan speeds of the instrument. Additionally, NTS approaches - by design - record more data than required. Therefore, reducing the number of variables is a key step in any machine learning pipeline to reduce overfitting, overlong training times and model complexity. The aim of the study is a comparison between the two most used dimensionality reduction techniques, PCA and PLS, regarding interpretability, as a tool to find marker compounds, and performance as a preprocessing step for supervised learning. Both feature per variable visualizations, which allows easy interpretation of results and retains a connection to the input data, which can lead to the discovery of marker compounds. A GC-IMS dataset about the botanical origin of honey is used, and all formatting steps necessary to apply PCA and PLS to higher dimensional data and obtain intuitive figures are explained. To evaluate effectiveness as a preprocessing step in a supervised pipeline four supervised algorithms were fitted with PCA or PLS variable reduction. PLS proved to be a more effective step in a supervised workflow in terms of accuracy, while PCA is highly effective for revealing preprocessing weaknesses such as misalignments" |
Keywords: | Gas Chromatography-Mass Spectrometry/methods *Honey/analysis Ion Mobility Spectrometry/methods Principal Component Analysis *Volatile Organic Compounds/analysis Chemometrics Food authenticity Non-target screening Python VOC profiling; |
Notes: | "MedlineChristmann, Joscha Rohn, Sascha Weller, Philipp eng Research Support, Non-U.S. Gov't Canada 2022/10/05 Food Res Int. 2022 Nov; 161:111779. doi: 10.1016/j.foodres.2022.111779. Epub 2022 Aug 23" |