Introduction - Model Selection & Curse of Dimensionality-Amit Rajan Blog

1.3 Model Selection

One of the commonly used technique for model selection is cross-validation. In a $S-fold$ cross-validation, training set is divided into $S$ equal subests where one of the subset is used for model vaidation and the remaining $S-1$ sets will be used for model training. This means that a total fraction of $\frac{S-1}{S}$ of dataset is used for model training. For a scarce dataset, it is common practice to set $S=N$, which is called as leave-one-out cross-validation technique. One of the major drawback of cross-validation is that we need to run $S$ training runs for $S$ subsets. The ideal model validation technique should rely only on training data and allow the comparison of multiple hyperparameters and model types in a single training run.

1.4 Curse of Dimensionality

One rudimentary way to design a classification or rgression algorithm is to divide the whole region containing the data points into small cells and take decision based on the distribution of training points in the individual cells. For example, in a calssification problem, we can find the cell to which a new point which has to be classified belongs, and assign it the highest occuring class of that cell. A simple $2D$ representation is shown in the middle figure below.

As we see in the above figure, if the dimension keeps on inreasing, the number of cells will increase exponentially given the interval is kept constant across all dimensions. To ensure that the cells are not empty, we will need exponentially large number of training data. One of the major reason for not understanding the problem related to higher dimension is our limitation in visualizing any space higher than $3D$. As an example, for a $D$ dimensional sapce, the fraction of the volume of sphere that lies between radis $r=1-\epsilon$ and $r=1$ can be given as $\frac{V_D(1) - V_D(1 - \epsilon)}{V_D(1)}$, where $V_D(r) = K_Dr^D$. This gives the fraction as $\frac{V_D(1) - V_D(1 - \epsilon)}{V_D(1)} = 1 - (1 - \epsilon)^D$. For large $D$, even for small value of $\epsilon$, this fraction will be close to $1$, for example, for a $30$ dimensional spce, $96%$ of it’s volume will be concentrated in the last $10%$ (outer most shell) of the sphere. This means that not all the intutions developed in spaces of low dimensionality will work well in a higher dimension space. Hence, to make the lower dimension pattern recognition techniques to work in high dimension settings, certain transformations to the dataset are needed.

FEATURED TAGS

alternate-hypothesis applied basis basis-function bayes-theorem-for-gaussian-variables bernoulli-distribution binomial-distribution bishop cdf classification column-space conceptual confidence-intervals conjugate-prior cross-validation determinant dimension eigenvalue-decomposition eigenvalues eigenvectors exercises expectation-maximization exponential-distribution feed-forward-network gaussian-distribution gilbert-strang graphical-models hypothesis-testing islr kernel-methods lagrange-multipliers least-squares linear-algebra linear-equations linear-model-selection linear-models linear-regression logistic-regression matrix-factorization matrix-multiplications matrix-space maximum-likelihood-for-the-gaussian maximum-margin-classifiers mean mixture-models mixtures-of-gaussians moving-beyond-linearity multinomial-distribution neural-networks normal-distribution null-hypothesis null-space one-tailed-test pattern-recognition pmf power probability-distributions projection random-variables regularization resampling statistical-learning students-t-distribution subspace support-vector-machines support-vectors think-stats tree-based-methods two-tailed-test unsupervised-learning variance vector-space

1.3 Model Selection

1.4 Curse of Dimensionality

CATALOG

FEATURED TAGS