Saturday, December 20, 2014

The Kernel Trick for the Layman

Kernel methods, for those who have been engaged in machine learning, statistics, data mining and stuff like that, is a thrown-around word, without a lot of understanding going on. Since I have some experience dealing with these, thought I should do a write-up.

The Kernel Methods in Machine Learning, is employed by a methodology which is dubbed the "kernel trick". This is just fancy talk to say that we are using a round-about method to exploit kernel metrics for non-euclidean data manifolds, which leads to the obvious question... 

What the hell did I just read? 

To answer that question, my good sir/ madam, we need to look at three main things. What is a non-euclidean data manifold, what is a kernel and what is the kernel trick? Let's briefly look at these things very very briefly. 

Imagine you have collected some data on, say, the traffic behaviors of your nearby (sub)urban area. You have collected two fields, the day of the week and the amount of traffic transferred (ingress or egress) to and fro. Now, imagine trying to find a simple correlation. Assuming that the week starts from Sunday, you will see that on the first day of the week (i.e. Sunday) in the evenings, there will be a lot of traffic flowing outside of the city and on a Friday, a lot of traffic will flow out of the city. But the weekends and the rest of the weekdays, will be having their own localized traffic behavior. Now, suppose you are asked to find a model that describes the correlation between the day of the week and the traffic-flow. 

You will start by encoding the weeks as Sunday = 1, .... , Saturday = 7. And suppose you will take the 'Euclidean distance' between the pairs of days that you consider (in this case, simply the numerical difference) to describe the 'similarity' between the two days.  

Now, you will notice that the traffic on the days {1, 7} will be very similar. And we will also notice that the similarity between the days {1, 4} and {4, 7} will also be roughly equal. But, this will be larger than the similarity of 1 and 7. If this behaves in a Euclidean manner, or simply, it can be described adequately by a euclidean space, adding the similiarities of {1,4} and {1,7} should give a something closer to the similarity of {1,7}. Therefore, we can see that this data, is not behaving in a very Euclidean manner. 

Next, let's talk about kernels. The kernels I am talking about here, are the kernel functions. For this, an initial description about an 'embedding function' needs to be made. Suppose you have a vector-space, say X. Now, imagine a function φ(x) , which maps the data from X to some Hilbert Space H.

                                                   φ : X → H 

 Note that this Hilbert Space can have any dimensionality. Now suppose that you define a function, K such that,
                                                   K : X2 → ℜ 
         
and, K(x,y) = φ(x).φ(y) {i.e. the inner product} 

A function that satisfies these properties is called a kernel function. Now, those who are familiar with basic vectors, will also know that the larger the inner product between two vectors, the more similar they are.

Keeping these in mind, let's move on to the kernel trick.

What we do in the kernel trick , is if we cannot find a good similarity measure in the initial data field, we use the embedding function to map the data space into a Hilbert-space where it may show euclidean properties, and then, find the similarities in that space. So how do we do this?


  1. We have a function that maps the data-space into a Hilbert-space(we have  φ  )
  2. We have a way to find the similarity ( the inner product) 
  3. Best of all, we have a way to find the similarities in the mapped hilbert-space! (the kernel function does them both at the same time!). 

Now all that remains is to use this in a model that we can train for machine-learning. Prepare for the most simple and elegant algorithm you've ever seen in your lives!

Suppose you have to classes that you want to classify the data into. Let's encode these two classes as +1 and -1 , you know, just because...

Now, all you need is the sign from the algorithm. So the predicted class will be,

           class = sign [ Σ K(x,xi ). si wi]

Let's see what this means. Here x  is the data point or entry (vector) that you want to predict. K is the kernel function defined on the model, and the part with the kernel gives the 'similarity' between a point from the dataset, and the point we need to check. s  will be the sign (+ or - ) of the point we're checking with, and w  will be the weight of that instance. This simply means that the more relevant and more valid the point ( by the weight), and the larger the similarity between that point ( by the kernel), the more likely that the sign of that point ( by the sign ) will be the target sign. !


This is the simplest idea behind kernel methods. There are many variations of this and how this is used. But for the sake of simplicity, I will stop at this.

Please feel free to leave a comment or a question if you need any clarifications... Adios!

Thursday, December 18, 2014

Using IPython in your Python Codes

After a while, I thought I should jot this down for anyone who might want to write a certain kind of python script, where they may want to keep the code cached for a significant duration. One of the best tools for this is IPython.

For those familiar with python programming, IPython needs no introduction. I will forgo the niceties and just cut to the chase for the uninitiated. IPython is a python framework which provides an interactive real-time programming experience for the users. It is complete with an interactive shell, a QT graphical user interface and a web interface; the infamous IPython Notebook. Installation guides are all around the web so a simple Google will do.

However, imagine a situation where you have certain piece of code running, say, a server (take Tornado or Django for instance). And you want to run your python codes in a dynamic manner; you want to create  a variable during breakfast, and want to use it as a count for your web-site's visitors, and during lunch, you want to check how many visitors have come to your site. One approach will be to go full caveman and keep a thread running in the background and, is there anything we hate more than dealing with implementing a running thread with an interactive interface?

IPython provides a simple, elegant solution for this. You can initiate a 'kernel' of IPython, and the kernel will act as a single, independent and contiguous runtime instance for Python. Imagine the possibilities with this. However, harnessing the power of this extremely useful tool is not easy for a beginner with the lack of detailed documentation, for understandable reasons.

Now for the fun part; There is only one class that you have to import in your code to do this (note that I assume you already have IPython framework installed in your computer). That is the "KernelManager" object provided in IPython. I'll boil it down to the simple steps you have to follow.


  1. Create a Kernel Manager
  2. Use the Kernel Manager to create a kernel instance and run it
  3. Use the Kernel Manager to create a client which communicates with the said kernel
  4. Open a channel between the client and the kernel using the client (four kinds of channels, go through the documentation for detailed information on them)
  5. Feed your code through the ShellChannel of the client. 

The Sample code looks something like this. 

from IPython.kernel.manager import KernelManager

km = KernelManager()
km.start_kernel()
cl=km.client()

chan = cl.shell_channel()

code_str= "f = open ('/home/yourname/yourfile.txt','w')\
f.write('the sample works')\
f.close()"

chan.execute(code_str)

et voila.... You have a code!

If you need any clarifications, please ask below. Have an interview. Gotta go...

Monday, June 23, 2014

A quick intro to neural networks

I first heard about Neural Networks, as many of those in our generation, from "The Terminator"...

The Terminators had a 'neural net processor' as their brains, that allowed them to learn and reason as humans do. I was just fourteen at the time, and I had barely heard of what a neuron was. And well, it wasn't really catchy back then. 

Oddly enough, the damaged and recovered chip from the first Terminator movie looked rather delicious


About five years later, I found out that it's actually a thing... 

After spending a couple of years trying to understand what a neural network is, following multiple online courses, and reading a big book, I decided it should be better studied once I was better versed in statistics and intelligent systems. Now that it's partly true, I thought I should help anyone who might be looking for a simple explanation of what a neural network is to get a quick helicopter view of what all the fuzz is about. 

The Human Brain

Probably one of the most remarkable outcomes of natural selection and evolution, the human brain is a system of co-existing organisms called 'neurons' that form the ultimate problem solving system. It is comprised of a large number of neurons ( somewhere around twenty billion of them, 20,000,000,000 in digits...) of an average human being. Various animals have various numbers of neurons in their central information processing area, which is dubbed the brain. However, there are animals with no "brain" and yet a system that acts on and reacts to the environment (Spongebob for instance). 

Probably a gift from Plankton


The brain, as I said is made of the basic cells known as  Neurons. The neurons are specialized in dealing in electrochemical  signals which are passed from one neuron to another using a system called synapses. A neuron has two categories of .... arm like thingies that allow the reception and transmission of those electrical signals: One long arm called the axon, which is analogous to an 'output' and a bunch of input tentacles known as dendrites. The following diagram which I took from the internet is a simple enough representation of this (thank the owner... :) )


The Synapses are like the meeting points of dendrites and axons
While it is open for scientific debate as I heard, the common belief, at least as far as computer scientists are concerned , of how a neuron functions is, that when it receives input signals from their dendrites, each input signal is multiplied by a 'weight' given to each dendrite (actually, I'm guessing some kind of natural amplification of the signal happens according to the size and composition of each dendrite ) and summed. Only if the resulting charge / current  exceeds a certain threshold value , the axon is triggered to give an output. We can easily see how this could be the case by how we have heard the brain cells create and break connections when we were children and were learning really really fast. So basically, the 'information' in our brains are made of these electrical signals, and by making and breaking new connections the 'learning' of information happens - which is basically a correction of electrical charges to fit the information and a standard/metric. 

An Artificial Neuron - Trying to mimic the brain

Given the advanced power of information processing shown by the natural neural networks (even a dust-mite is more 'intelligent' than our smartest cognitive systems ) it is only a matter of time before we tried to exploit it in our artificial systems. For this the basic abstraction of the neuron is of great importance. Here's how it goes. 

An artificial neuron(a.k.a a perceptron ) , not much unlike the natural counterpart, consists of a set of inputs(dendrites) and an output(axon). Each input then has a weight attached to it. We can define a vector for these weights (let's be a bit imaginative and call it the weight vector) - w. Now suppose we gather all the inputs (let's say we have k inputs) and put them in another vector i . I will try to use a diagram to make it a bit more clear... 


Here, i  stands for the input vector of the form i = a u + b v + c x + d y  where u,v,x,y are unit vectors and each w stands for a component of the weight vector...
Basically what happens in the given perceptron is along each of the arms, scalar inputs of size a, b, c and d are coming into the perceptron. And we can form an input vector using these inputs, as is captioned in the diagram. Next, we can also have the weight vector w made from each weight attached to each input.

Now we see that the output is a function of the input. This is called in neural network terminology, an activation function. In a linear perceptron (we'll only talk about this particular kind of simple perceptrons for the moment...), this is typically defined as follows.



                                              1               for w.i > t for some scalar threshold value

                   f( i ) = 
                                 
                                              0              otherwise




As you can see, we fire an output of a binary 1 if the scalar product of the input vector and the weight vector is larger than some threshold value, which is a step activation function (also please note that there are continuous functions which can operate as an activation function, such as a sigmoid function or a logistic function bar the threshold) . So this is how we mimic a neuron in a program.


The Network

Now it comes down to networking each of these functional neurons. As you can see, there are several parameters that we have to decide in a neuron, namely, the weight vector (value of each component of the weight vector, or actually, the weights of each input), the activation function, the threshold value and the dimensionality of the weight vector (the number of inputs per neuron). In fact, you can see that the output is dependent on these parameters, and by changing these values, we can change the behavior of a neuron to an input. Next comes the point where each of these neurons are connected to form a network.

There are many intricate types of neural networks. The simplest of them are called a 'feed-forward network'. As the name suggests, what it does is, it keeps pushing the signals forward in a network till it is put out in an output.  (There are much more complicated and dynamic neural networks, and some pretty interesting derivations of them such as Self-Organizing-Maps, or Growing-Self-Organizing-Maps. Refer to Kohonen's neural networks and work by Prof. Damminda Alahakoon if you are interested).

Let's take a simple feed forward network. It is going to look something like this.

This simple feed-forward net has three neural layers. The output from each layer is fed as input into the layer in front. 

Suppose you have two separate inputs and you want to classify them as 0 or 1 according to the input. Then the simplest neural network or this learning purpose is shown as above. From the left, two inputs come to the green neurons. And the outputs of that layer (actually a bidimensional vector) is fed into each neuron in the magenta layer. Lastly, the outputs of the magenta layer is fed to the black layer, which is the output. We usually tend to use the same activation function and the threshold in each of the neurons, however, the weights of the neurons will depend on the output.

Firstly, we feed what we call a training data set into the network. Here, the inputs are fed, and the outputs are taken at the initial configuration of the network (with random input arm weights and a user-defined number of layers/ inputs per neuron). Then we get the output. In the training data, we will have the expected output as well. Now, we calculate using some metric, the accuracy of the predictions. Now, typically, we will have a high rate of error from the expected and found values. Now we use some kind of a state-space-search to find a configuration of the network (typically, we only change the weight of the input arms, not the activation function or the number of neurons ) that would lower the error. Then we use this new configuration to get the new errors, and change the configuration again! This goes on for a while till we find some configuration that we're happy with.

Here, there are some things that we ought to be mindful about. First of all, we have to figure out a way to get some kind of a feel about how many neurons we need for the network, the number of layers or even the number of inputs per neuron. I honestly am quite new to the field so I am yet to find out any rules-of-thumb defining these. Also, there's the omnipresent issue of overfitting the training data! We must also have some kind of an idea about just how much data we feed it to train it!

The state-space-search that I mentioned earlier can take many forms. One of my favorites is using a genetic algorithm, but we can use a simulated annealing search as well. Usually it's good not to get stuck at a local optimum of a solution, so I prefer the above two. And if we use genetic algorithms, not only can we manipulate the input weights, but the other parameters like the number of neurons as well. But one thing is for sure, that it will take a very, very, very long time to train a neural network to a specific problem.

Usually, neural networks are resilient to noisy data and outliers rarely screw it up. However, given the complexity of things, the Skynet and the Terminators are as of yet, a mere dream!

Please leave your comments below, and have a merry time with trying to code a neural network.. :D