1# Improving Linear Models Using Explicit Kernel Methods
2
3In this tutorial, we demonstrate how combining (explicit) kernel methods with
4linear models can drastically increase the latters' quality of predictions
5without significantly increasing training and inference times. Unlike dual
6kernel methods, explicit (primal) kernel methods scale well with the size of the
7training dataset both in terms of training/inference times and in terms of
8memory requirements.
9
10Currently, explicit kernel mappings are supported for dense features. Support
11for sparse features is in the works.
12
13We will use [tf.contrib.learn](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn) (TensorFlow's high-level Machine Learning API) Estimators for our ML models. The
14tf.contrib.learn API reduces the boilerplate code one needs to write for
15configuring, training and evaluating models and will let us focus on the core
16ideas. If you are not familiar with this API, [tf.estimator Quickstart](https://www.tensorflow.org/get_started/estimator) is a good place to start. We
17will use MNIST, a widely-used dataset containing images of handwritten digits
18(between 0 and 9). The tutorial consists of the following steps:
19
20* Load and prepare MNIST data for classification.
21* Construct a simple linear model, train it and evaluate it on the eval data.
22* Replace the linear model with a kernelized linear model, re-train and
23re-evaluate.
24
25## Load and prepare MNIST data for classification
26The first step is to prepare the data to be fed to the ML models. The following
27utility command from tf.contrib.learn loads the MNIST dataset:
28
29```python
30data = tf.contrib.learn.datasets.mnist.load_mnist()
31```
32This loads the entire MNIST dataset (containing 70K samples) and splits it into
33train, validation and test data with 55K, 5K and 10K samples respectively. Each
34split contains one numpy array for images (with shape [sample_size, 784]) and
35one for labels (with shape [sample_size, 1]). In this tutorial, we only use the
36train and validation splits (to train and evaluate our models respectively).
37
38In order to feed data to a tf.contrib.learn Estimator, it is helpful to convert
39it to Tensors. For this, we will use an `input function` which adds Ops to the
40TensorFlow graph that, when executed, create mini-batches of Tensors to be used
41downstream. For more background on input functions, check
42[Building Input Functions with tf.contrib.learn](https://www.tensorflow.org/get_started/input_fn).
43In this example, we will use the `tf.train.shuffle_batch` Op which, besides
44converting numpy arrays to Tensors, allows us to specify the batch_size and
45whether to randomize the input every time the input_fn Ops are executed
46(randomization typically expedites convergence during training). The full code
47for loading and preparing the data is shown in the snippet below. In this
48example, we use mini-batches of size 256 for training and the entire sample (5K
49entries) for evaluation. Feel free to experiment with different batch sizes.
50
51```python
52import numpy as np
53import tensorflow as tf
54
55def get_input_fn(dataset_split, batch_size, capacity=10000, min_after_dequeue=3000):
56
57  def _input_fn():
58    images_batch, labels_batch = tf.train.shuffle_batch(
59        tensors=[dataset_split.images, dataset_split.labels.astype(np.int32)],
60        batch_size=batch_size,
61        capacity=capacity,
62        min_after_dequeue=min_after_dequeue,
63        enqueue_many=True,
64        num_threads=4)
65    features_map = {'images': images_batch}
66    return features_map, labels_batch
67
68  return _input_fn
69
70data = tf.contrib.learn.datasets.mnist.load_mnist()
71
72train_input_fn = get_input_fn(data.train, batch_size=256)
73eval_input_fn = get_input_fn(data.validation, batch_size=5000)
74
75```
76
77## Training a simple linear model
78We can now train a linear model over the MNIST dataset. We will use the
79[tf.contrib.learn.LinearClassifier](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn/estimators/linear.py) estimator with 10 classes (representing the 10 digits).
80The input features form a 784-dimensional (dense) vector which can be specified
81as follows:
82
83```python
84image_column = tf.contrib.layers.real_valued_column('images', dimension=784)
85```
86
87The full code for constructing, training and evaluating a LinearClassifier
88estimator is shown below.
89
90```python
91import time
92
93# Specify the feature(s) to be used by the estimator.
94image_column = tf.contrib.layers.real_valued_column('images', dimension=784)
95estimator = tf.contrib.learn.LinearClassifier(feature_columns=[image_column], n_classes=10)
96
97# Train.
98start = time.time()
99estimator.fit(input_fn=train_input_fn, steps=2000)
100end = time.time()
101print('Elapsed time: {} seconds'.format(end - start))
102
103# Evaluate and report metrics.
104eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1)
105print(eval_metrics)
106```
107On eval data, the loss (i.e., the value of the objective function being
108minimized during training) lies between **0.25** and **0.30** (depending on the
109parameters used) while the accuracy of the classifier is approximately **92.5%**
110(training is randomized so the exact loss and accuracy will vary). Also, the
111training time is around 25 seconds (this will also vary based on the machine you
112run the code on).
113
114In addition to experimenting with the (training) batch size and the number of
115training steps, there are a couple other parameters that can be tuned as well.
116For instance, you can change the optimization method used to minimize the loss
117by explicitly selecting another optimizer from the collection of
118[available optimizers](https://www.tensorflow.org/code/tensorflow/python/training).
119As an example, the following code constructs a LinearClassifier estimator that
120uses the Follow-The-Regularized-Leader (FTRL) optimization strategy with a
121specific learning rate and L2-regularization.
122
123
124```python
125optimizer = tf.train.FtrlOptimizer(learning_rate=5.0, l2_regularization_strength=1.0)
126estimator = tf.contrib.learn.LinearClassifier(
127    feature_columns=[image_column], n_classes=10, optimizer=optimizer)
128```
129
130Regardless of the values of the parameters, the max accuracy a linear model can
131achieve on this dataset caps at around **93%**.
132
133## Using explicit kernel mappings with the linear model.
134The relatively high error (~7%) of the linear model over MNIST indicates that
135the input data is not linearly separable. We will use explicit kernel mappings
136to reduce the classification error.
137
138**Intuition:** The high-level idea is to use a non-linear map to transform the
139input space to another feature space (of possibly higher dimension) where the
140(transformed) features are (almost) linearly separable and then apply a linear
141model on the mapped features. This is shown in the following figure:
142
143![image](./kernel_mapping.png)
144
145**Technical details overview:** In this example we will use **Random Fourier
146Features** (introduced in the
147["Random Features for Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) paper by
148Rahimi and Recht) to map the input data. Random Fourier Features map a vector
149\\(\mathbf{x} \in \mathbb{R}^d\\) to \\(\mathbf{x'} \in \mathbb{R}^D\\) via the
150following mapping:
151
152$$
153RFFM(\cdot): \mathbb{R}^d \to \mathbb{R}^D, \quad
154RFFM(\mathbf{x}) =  \cos(\mathbf{\Omega} \cdot \mathbf{x}+ \mathbf{b})
155$$
156
157where \\(\mathbf{\Omega} \in \mathbb{R}^{D \times d}\\),
158\\(\mathbf{x} \in \mathbb{R}^d,\\) \\(\mathbf{b} \in \mathbb{R}^D\\) and the
159cosine is applied element-wise.
160
161In this example, the entries of \\(\mathbf{\Omega}\\) and \\(\mathbf{b}\\) are
162sampled from distributions such that the mapping satisfies the following
163property:
164
165$$
166RFFM(\mathbf{x})^T \cdot RFFM(\mathbf{y}) \approx
167e^{-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2 \sigma^2}}
168$$
169
170The right-hand-side quantity of the expression above is known as the RBF (or
171Gaussian) kernel function. This function is one of the most-widely used kernel
172functions in Machine Learning and measures (implicitly) similarity in a
173different (much higher dimensional) space than the original one. See
174[Radial basis function kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel)
175for more details.
176
177**Kernel Classifier:** `tf.contrib.kernel_methods.KernelLinearClassifier` is a
178pre-packaged `tf.contrib.learn` estimator that combines the power of explicit
179kernel mappings with linear models. Its API is very similar to that of the
180LinearClassifier with the additional ability to specify a list of explicit
181kernel mappings to be applied to each feature used by the classifier. The
182following code snippet demonstrates how to replace LinearClassifier with
183KernelLinearClassifier.
184
185
186```python
187# Specify the feature(s) to be used by the estimator. This is identical to the
188# code used for the LinearClassifier.
189image_column = tf.contrib.layers.real_valued_column('images', dimension=784)
190optimizer = tf.train.FtrlOptimizer(
191   learning_rate=50.0, l2_regularization_strength=0.001)
192
193
194kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper(
195  input_dim=784, output_dim=2000, stddev=5.0, name='rffm')
196kernel_mappers = {image_column: [kernel_mapper]}
197estimator = tf.contrib.kernel_methods.KernelLinearClassifier(
198   n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers)
199
200# Train.
201start = time.time()
202estimator.fit(input_fn=train_input_fn, steps=2000)
203end = time.time()
204print('Elapsed time: {} seconds'.format(end - start))
205
206# Evaluate and report metrics.
207eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1)
208print(eval_metrics)
209```
210The only additional parameter passed to `KernelLinearClassifier` is a dictionary
211from feature_columns to a list of kernel mappings to be applied to the
212corresponding feature column. In this example, the lines
213
214```python
215kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper(
216  input_dim=784, output_dim=2000, stddev=5.0, name='rffm')
217kernel_mappers = {image_column: [kernel_mapper]}
218estimator = tf.contrib.kernel_methods.KernelLinearClassifier(
219   n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers)
220```
221instruct the classifier to first map the initial 784-dimensional images to
2222000-dimensional vectors using random Fourier features and then learn a linear
223model on the transformed vectors. Note that, besides the output dimension, there
224is one more parameter (stddev) involved. This parameter is the standard
225deviation (\\(\sigma\\)) of the approximated RBF kernel and controls the
226similarity measure used in classification. This parameter is typically
227determined via hyperparameter tuning.
228
229Running the code above yields a loss of approximately **0.10** while the
230accuracy is increased to approximately **97%** on eval data (an increase of 4%
231over the plain linear model). The training time hovers around 35 seconds. We can
232increase the accuracy even more, by increasing the output dimension of the
233mapping and tuning the standard deviation even more.
234
235**On the role of stddev:** The classification quality is very sensitive to the
236value of the stddev parameter used to define the similarity measure between the
237pairs of input features. The following table shows the accuracy of the
238classifier on the eval data for different values of stddev (for all experiments
239the output dimension was fixed to 3000). The optimal value is stddev=5.0. Notice
240how too small or too high stddev values can dramatically decrease the accuracy
241of the classification.
242
243stddev | eval accuracy
244:----- | :------------
2451.0    | 0.1362
2462.0    | 0.4764
2474.0    | 0.9654
2485.0    | 0.9766
2498.0    | 0.9714
25016.0   | 0.8878
251
252**On the role of the output dimension:** Intuitively, the larger the output
253dimension of the mapping, the closer the inner product of two mapped vectors
254approximates the kernel which typically translates to better classification
255accuracy. Another way to think about this is that the output dimension equals
256the number of weights of the linear model (the larger this dimension, the larger
257the "degrees of freedom" of the model). However, after a certain threshold,
258higher output dimensions increase the accuracy by very little (while still
259increasing the training time). This is shown in the following 2 Figures which
260depict the eval accuracy as a function of the output dimension and the training
261time respectively.
262
263![image](./acc_vs_outdim.png)  ![image](./acc-vs-trn_time.png)
264
265
266## Explicit kernel mappings: summary and practical tips
267* Explicit kernel mappings combine the predictive power of non-linear models
268with the scalability of linear models.
269* Unlike traditional dual kernel methods, they can scale to millions or hundreds
270of millions of samples.
271* Random Fourier Features can be particularly effective for datasets with dense
272features.
273* The parameters of the kernel mapping are often data-dependent. Model quality
274can be very sensitive to these parameters. Use hyperparameter tuning to find the
275optimal values.
276* If you have multiple numerical features, concatenate them into a single
277multi-dimensional feature and apply the kernel mapping to the concatenated
278vector.
279
280