1# Improving Linear Models Using Explicit Kernel Methods 2 3In this tutorial, we demonstrate how combining (explicit) kernel methods with 4linear models can drastically increase the latters' quality of predictions 5without significantly increasing training and inference times. Unlike dual 6kernel methods, explicit (primal) kernel methods scale well with the size of the 7training dataset both in terms of training/inference times and in terms of 8memory requirements. 9 10Currently, explicit kernel mappings are supported for dense features. Support 11for sparse features is in the works. 12 13We will use [tf.contrib.learn](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn) (TensorFlow's high-level Machine Learning API) Estimators for our ML models. The 14tf.contrib.learn API reduces the boilerplate code one needs to write for 15configuring, training and evaluating models and will let us focus on the core 16ideas. If you are not familiar with this API, [tf.estimator Quickstart](https://www.tensorflow.org/get_started/estimator) is a good place to start. We 17will use MNIST, a widely-used dataset containing images of handwritten digits 18(between 0 and 9). The tutorial consists of the following steps: 19 20* Load and prepare MNIST data for classification. 21* Construct a simple linear model, train it and evaluate it on the eval data. 22* Replace the linear model with a kernelized linear model, re-train and 23re-evaluate. 24 25## Load and prepare MNIST data for classification 26The first step is to prepare the data to be fed to the ML models. The following 27utility command from tf.contrib.learn loads the MNIST dataset: 28 29```python 30data = tf.contrib.learn.datasets.mnist.load_mnist() 31``` 32This loads the entire MNIST dataset (containing 70K samples) and splits it into 33train, validation and test data with 55K, 5K and 10K samples respectively. Each 34split contains one numpy array for images (with shape [sample_size, 784]) and 35one for labels (with shape [sample_size, 1]). In this tutorial, we only use the 36train and validation splits (to train and evaluate our models respectively). 37 38In order to feed data to a tf.contrib.learn Estimator, it is helpful to convert 39it to Tensors. For this, we will use an `input function` which adds Ops to the 40TensorFlow graph that, when executed, create mini-batches of Tensors to be used 41downstream. For more background on input functions, check 42[Building Input Functions with tf.contrib.learn](https://www.tensorflow.org/get_started/input_fn). 43In this example, we will use the `tf.train.shuffle_batch` Op which, besides 44converting numpy arrays to Tensors, allows us to specify the batch_size and 45whether to randomize the input every time the input_fn Ops are executed 46(randomization typically expedites convergence during training). The full code 47for loading and preparing the data is shown in the snippet below. In this 48example, we use mini-batches of size 256 for training and the entire sample (5K 49entries) for evaluation. Feel free to experiment with different batch sizes. 50 51```python 52import numpy as np 53import tensorflow as tf 54 55def get_input_fn(dataset_split, batch_size, capacity=10000, min_after_dequeue=3000): 56 57 def _input_fn(): 58 images_batch, labels_batch = tf.train.shuffle_batch( 59 tensors=[dataset_split.images, dataset_split.labels.astype(np.int32)], 60 batch_size=batch_size, 61 capacity=capacity, 62 min_after_dequeue=min_after_dequeue, 63 enqueue_many=True, 64 num_threads=4) 65 features_map = {'images': images_batch} 66 return features_map, labels_batch 67 68 return _input_fn 69 70data = tf.contrib.learn.datasets.mnist.load_mnist() 71 72train_input_fn = get_input_fn(data.train, batch_size=256) 73eval_input_fn = get_input_fn(data.validation, batch_size=5000) 74 75``` 76 77## Training a simple linear model 78We can now train a linear model over the MNIST dataset. We will use the 79[tf.contrib.learn.LinearClassifier](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn/estimators/linear.py) estimator with 10 classes (representing the 10 digits). 80The input features form a 784-dimensional (dense) vector which can be specified 81as follows: 82 83```python 84image_column = tf.contrib.layers.real_valued_column('images', dimension=784) 85``` 86 87The full code for constructing, training and evaluating a LinearClassifier 88estimator is shown below. 89 90```python 91import time 92 93# Specify the feature(s) to be used by the estimator. 94image_column = tf.contrib.layers.real_valued_column('images', dimension=784) 95estimator = tf.contrib.learn.LinearClassifier(feature_columns=[image_column], n_classes=10) 96 97# Train. 98start = time.time() 99estimator.fit(input_fn=train_input_fn, steps=2000) 100end = time.time() 101print('Elapsed time: {} seconds'.format(end - start)) 102 103# Evaluate and report metrics. 104eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1) 105print(eval_metrics) 106``` 107On eval data, the loss (i.e., the value of the objective function being 108minimized during training) lies between **0.25** and **0.30** (depending on the 109parameters used) while the accuracy of the classifier is approximately **92.5%** 110(training is randomized so the exact loss and accuracy will vary). Also, the 111training time is around 25 seconds (this will also vary based on the machine you 112run the code on). 113 114In addition to experimenting with the (training) batch size and the number of 115training steps, there are a couple other parameters that can be tuned as well. 116For instance, you can change the optimization method used to minimize the loss 117by explicitly selecting another optimizer from the collection of 118[available optimizers](https://www.tensorflow.org/code/tensorflow/python/training). 119As an example, the following code constructs a LinearClassifier estimator that 120uses the Follow-The-Regularized-Leader (FTRL) optimization strategy with a 121specific learning rate and L2-regularization. 122 123 124```python 125optimizer = tf.train.FtrlOptimizer(learning_rate=5.0, l2_regularization_strength=1.0) 126estimator = tf.contrib.learn.LinearClassifier( 127 feature_columns=[image_column], n_classes=10, optimizer=optimizer) 128``` 129 130Regardless of the values of the parameters, the max accuracy a linear model can 131achieve on this dataset caps at around **93%**. 132 133## Using explicit kernel mappings with the linear model. 134The relatively high error (~7%) of the linear model over MNIST indicates that 135the input data is not linearly separable. We will use explicit kernel mappings 136to reduce the classification error. 137 138**Intuition:** The high-level idea is to use a non-linear map to transform the 139input space to another feature space (of possibly higher dimension) where the 140(transformed) features are (almost) linearly separable and then apply a linear 141model on the mapped features. This is shown in the following figure: 142 143![image](./kernel_mapping.png) 144 145**Technical details overview:** In this example we will use **Random Fourier 146Features** (introduced in the 147["Random Features for Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) paper by 148Rahimi and Recht) to map the input data. Random Fourier Features map a vector 149\\(\mathbf{x} \in \mathbb{R}^d\\) to \\(\mathbf{x'} \in \mathbb{R}^D\\) via the 150following mapping: 151 152$$ 153RFFM(\cdot): \mathbb{R}^d \to \mathbb{R}^D, \quad 154RFFM(\mathbf{x}) = \cos(\mathbf{\Omega} \cdot \mathbf{x}+ \mathbf{b}) 155$$ 156 157where \\(\mathbf{\Omega} \in \mathbb{R}^{D \times d}\\), 158\\(\mathbf{x} \in \mathbb{R}^d,\\) \\(\mathbf{b} \in \mathbb{R}^D\\) and the 159cosine is applied element-wise. 160 161In this example, the entries of \\(\mathbf{\Omega}\\) and \\(\mathbf{b}\\) are 162sampled from distributions such that the mapping satisfies the following 163property: 164 165$$ 166RFFM(\mathbf{x})^T \cdot RFFM(\mathbf{y}) \approx 167e^{-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2 \sigma^2}} 168$$ 169 170The right-hand-side quantity of the expression above is known as the RBF (or 171Gaussian) kernel function. This function is one of the most-widely used kernel 172functions in Machine Learning and measures (implicitly) similarity in a 173different (much higher dimensional) space than the original one. See 174[Radial basis function kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel) 175for more details. 176 177**Kernel Classifier:** `tf.contrib.kernel_methods.KernelLinearClassifier` is a 178pre-packaged `tf.contrib.learn` estimator that combines the power of explicit 179kernel mappings with linear models. Its API is very similar to that of the 180LinearClassifier with the additional ability to specify a list of explicit 181kernel mappings to be applied to each feature used by the classifier. The 182following code snippet demonstrates how to replace LinearClassifier with 183KernelLinearClassifier. 184 185 186```python 187# Specify the feature(s) to be used by the estimator. This is identical to the 188# code used for the LinearClassifier. 189image_column = tf.contrib.layers.real_valued_column('images', dimension=784) 190optimizer = tf.train.FtrlOptimizer( 191 learning_rate=50.0, l2_regularization_strength=0.001) 192 193 194kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper( 195 input_dim=784, output_dim=2000, stddev=5.0, name='rffm') 196kernel_mappers = {image_column: [kernel_mapper]} 197estimator = tf.contrib.kernel_methods.KernelLinearClassifier( 198 n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers) 199 200# Train. 201start = time.time() 202estimator.fit(input_fn=train_input_fn, steps=2000) 203end = time.time() 204print('Elapsed time: {} seconds'.format(end - start)) 205 206# Evaluate and report metrics. 207eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1) 208print(eval_metrics) 209``` 210The only additional parameter passed to `KernelLinearClassifier` is a dictionary 211from feature_columns to a list of kernel mappings to be applied to the 212corresponding feature column. In this example, the lines 213 214```python 215kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper( 216 input_dim=784, output_dim=2000, stddev=5.0, name='rffm') 217kernel_mappers = {image_column: [kernel_mapper]} 218estimator = tf.contrib.kernel_methods.KernelLinearClassifier( 219 n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers) 220``` 221instruct the classifier to first map the initial 784-dimensional images to 2222000-dimensional vectors using random Fourier features and then learn a linear 223model on the transformed vectors. Note that, besides the output dimension, there 224is one more parameter (stddev) involved. This parameter is the standard 225deviation (\\(\sigma\\)) of the approximated RBF kernel and controls the 226similarity measure used in classification. This parameter is typically 227determined via hyperparameter tuning. 228 229Running the code above yields a loss of approximately **0.10** while the 230accuracy is increased to approximately **97%** on eval data (an increase of 4% 231over the plain linear model). The training time hovers around 35 seconds. We can 232increase the accuracy even more, by increasing the output dimension of the 233mapping and tuning the standard deviation even more. 234 235**On the role of stddev:** The classification quality is very sensitive to the 236value of the stddev parameter used to define the similarity measure between the 237pairs of input features. The following table shows the accuracy of the 238classifier on the eval data for different values of stddev (for all experiments 239the output dimension was fixed to 3000). The optimal value is stddev=5.0. Notice 240how too small or too high stddev values can dramatically decrease the accuracy 241of the classification. 242 243stddev | eval accuracy 244:----- | :------------ 2451.0 | 0.1362 2462.0 | 0.4764 2474.0 | 0.9654 2485.0 | 0.9766 2498.0 | 0.9714 25016.0 | 0.8878 251 252**On the role of the output dimension:** Intuitively, the larger the output 253dimension of the mapping, the closer the inner product of two mapped vectors 254approximates the kernel which typically translates to better classification 255accuracy. Another way to think about this is that the output dimension equals 256the number of weights of the linear model (the larger this dimension, the larger 257the "degrees of freedom" of the model). However, after a certain threshold, 258higher output dimensions increase the accuracy by very little (while still 259increasing the training time). This is shown in the following 2 Figures which 260depict the eval accuracy as a function of the output dimension and the training 261time respectively. 262 263![image](./acc_vs_outdim.png) ![image](./acc-vs-trn_time.png) 264 265 266## Explicit kernel mappings: summary and practical tips 267* Explicit kernel mappings combine the predictive power of non-linear models 268with the scalability of linear models. 269* Unlike traditional dual kernel methods, they can scale to millions or hundreds 270of millions of samples. 271* Random Fourier Features can be particularly effective for datasets with dense 272features. 273* The parameters of the kernel mapping are often data-dependent. Model quality 274can be very sensitive to these parameters. Use hyperparameter tuning to find the 275optimal values. 276* If you have multiple numerical features, concatenate them into a single 277multi-dimensional feature and apply the kernel mapping to the concatenated 278vector. 279 280