Saturday, December 20, 2014

The Kernel Trick for the Layman

Kernel methods, for those who have been engaged in machine learning, statistics, data mining and stuff like that, is a thrown-around word, without a lot of understanding going on. Since I have some experience dealing with these, thought I should do a write-up.

The Kernel Methods in Machine Learning, is employed by a methodology which is dubbed the "kernel trick". This is just fancy talk to say that we are using a round-about method to exploit kernel metrics for non-euclidean data manifolds, which leads to the obvious question... 

What the hell did I just read? 

To answer that question, my good sir/ madam, we need to look at three main things. What is a non-euclidean data manifold, what is a kernel and what is the kernel trick? Let's briefly look at these things very very briefly. 

Imagine you have collected some data on, say, the traffic behaviors of your nearby (sub)urban area. You have collected two fields, the day of the week and the amount of traffic transferred (ingress or egress) to and fro. Now, imagine trying to find a simple correlation. Assuming that the week starts from Sunday, you will see that on the first day of the week (i.e. Sunday) in the evenings, there will be a lot of traffic flowing outside of the city and on a Friday, a lot of traffic will flow out of the city. But the weekends and the rest of the weekdays, will be having their own localized traffic behavior. Now, suppose you are asked to find a model that describes the correlation between the day of the week and the traffic-flow. 

You will start by encoding the weeks as Sunday = 1, .... , Saturday = 7. And suppose you will take the 'Euclidean distance' between the pairs of days that you consider (in this case, simply the numerical difference) to describe the 'similarity' between the two days.  

Now, you will notice that the traffic on the days {1, 7} will be very similar. And we will also notice that the similarity between the days {1, 4} and {4, 7} will also be roughly equal. But, this will be larger than the similarity of 1 and 7. If this behaves in a Euclidean manner, or simply, it can be described adequately by a euclidean space, adding the similiarities of {1,4} and {1,7} should give a something closer to the similarity of {1,7}. Therefore, we can see that this data, is not behaving in a very Euclidean manner. 

Next, let's talk about kernels. The kernels I am talking about here, are the kernel functions. For this, an initial description about an 'embedding function' needs to be made. Suppose you have a vector-space, say X. Now, imagine a function φ(x) , which maps the data from X to some Hilbert Space H.

                                                   φ : X → H 

 Note that this Hilbert Space can have any dimensionality. Now suppose that you define a function, K such that,
                                                   K : X2 → ℜ 
         
and, K(x,y) = φ(x).φ(y) {i.e. the inner product} 

A function that satisfies these properties is called a kernel function. Now, those who are familiar with basic vectors, will also know that the larger the inner product between two vectors, the more similar they are.

Keeping these in mind, let's move on to the kernel trick.

What we do in the kernel trick , is if we cannot find a good similarity measure in the initial data field, we use the embedding function to map the data space into a Hilbert-space where it may show euclidean properties, and then, find the similarities in that space. So how do we do this?


  1. We have a function that maps the data-space into a Hilbert-space(we have  φ  )
  2. We have a way to find the similarity ( the inner product) 
  3. Best of all, we have a way to find the similarities in the mapped hilbert-space! (the kernel function does them both at the same time!). 

Now all that remains is to use this in a model that we can train for machine-learning. Prepare for the most simple and elegant algorithm you've ever seen in your lives!

Suppose you have to classes that you want to classify the data into. Let's encode these two classes as +1 and -1 , you know, just because...

Now, all you need is the sign from the algorithm. So the predicted class will be,

           class = sign [ Σ K(x,xi ). si wi]

Let's see what this means. Here x  is the data point or entry (vector) that you want to predict. K is the kernel function defined on the model, and the part with the kernel gives the 'similarity' between a point from the dataset, and the point we need to check. s  will be the sign (+ or - ) of the point we're checking with, and w  will be the weight of that instance. This simply means that the more relevant and more valid the point ( by the weight), and the larger the similarity between that point ( by the kernel), the more likely that the sign of that point ( by the sign ) will be the target sign. !


This is the simplest idea behind kernel methods. There are many variations of this and how this is used. But for the sake of simplicity, I will stop at this.

Please feel free to leave a comment or a question if you need any clarifications... Adios!

No comments:

Post a Comment