Intelligent Analysis of Small Data Sets for Food Design

Vision blog
Food design (PhD)
Clustering with Matlab
Personal bits


This thesis compares the performance of machine learning techniques and statistics in the analysis of food design data. The goal of the analysis is to understand what makes people like (or dislike) a product, by building models relating sensory features (such as flavour or texture) to consumer preferences. One difficulty in analysing these data sets is that they are extremely small, due to ``taste-fatigue'' of consumer preference panels.

Feature selection is essential because food sensory data sets typically have many features and few records. Several feature selection algorithms are compared, and the results highlight the need to limit the number of features used. We therefore apply model order selection to feature selection. A semi-supervised feature selection method is introduced and compared with more traditional methods.

After the selection of a suitable set of features, the relationship between those features and consumers' preferences must be modelled. Two regression techniques are compared, focussing on their relative performance on very small data sets. A semi-supervised ensemble learning algorithm is introduced, and analysed.

Consumers have individual preferences, so rather than producing a single generic product, food designers must first discover homogeneous groups of consumers, and then target each group with a different product. Several clustering techniques are compared, and consideration of their inherent biases reveals further information regarding the structure of the data. A combination of regression and clustering is proposed, which allows evaluation of clustering results using the predictive power of the resultant models.

Preference data sets contain a significant number of misleading outliers owing to the way they are collected. An algorithm that combines clustering and outlier detection is introduced, which aims to produce an outlier-free cluster model, and also provides heuristic estimates of the number of outliers present.

Overall, machine learning techniques show performance similar to traditional statistical techniques, with small improvements in accuracy in some cases. Machine learning brings the benefit of typically being dependent on fewer assumptions: where these assumptions are invalid, results may be improved. Furthermore, machine learning makes use of considerable computational power, which is now cheaply available, in the search for improved solutions. In this thesis, we examine the efficacy of machine learning techniques when analysing food design data sets.

In summary, the main contributions of this thesis are:

  • A semi-supervised feature selection algorithm
  • A semi-supervised ensemble for regression
  • A clustering evaluation technique
  • A outlier detection technique for clustering

You can download a zipped PostScript (500k) or PDF (950k) copy of the entire theis or click below to download a single (uncompressed) chapter.

Front matter (Abstract, acknowledgements, contents) PostScriptPDF
Chapter 1IntroductionPostScriptPDF
Chapter 2Feature Selection and RegressionPostScriptPDF
Chapter 3Cluster AnalysisPostScriptPDF
Chapter 4Outlier DetectionPostScriptPDF
Chapter 5Future WorkPostScriptPDF
Chapter 6ConclusionsPostScriptPDF
End matter (Appendix, references, indices)PostScriptPDF

All files are less than 800k.

My thesis was examined on September 4th 2002, by Prof. David Hand (Imperial College, London) and Dr. Peter Flach (University of Bristol). My thanks go to both!


Creative Commons License
All contents are licenced under this Creative Commons Licence unless explicity stated otherwise.