Continuing further with Deep Learning, here I will briefly describe what I learned on convolutional network (CNN). If you understand the basics of a simple 2-layer network (fully connected) and can implement it yourself from scratch you are all set to understand the mighty daddy (ie. CNN). Again it is important to understand that CNN, is yet another way to model a function, given sample input-output (neural).
You may like to refer to my earlier post on implementing a toy-neural network.
Before we get into the technicalities of convolutional network let me introduce a famous work by Yann LeCun . He was the one to introduce the convolutional layer. Have a look at this 1993 video on Youtube and read the video description to get an idea. It is also worthy to note that his MNIST dataset is extremely popular dataset for playing with neural nets. Before you even read on convolutional net, I would recommend you to download this dataset and
imshow few of the images. Another popular toy dataset is cifar10. To get properly motivated, one could also attempt a 2-layer fully connected network (similar to toy-neural-net) on these dataset.
Moving onto convolutional networks. Basically, there are two fundamentals to understand. a) The convolutional layer and b) Pooling layer.
A) Convolutional Layer
The convolution layer takes a volume as input (Tensor of dim=3). Width x Height x input_Channels and produces output as width x height x output_channels. Here the number of output channels and number of input channels may or may not be the same. Usually are different. The trainable kernel K here is a 4D tensor (k1 x k2 x input_channels x output_channels). k1 and k2 usually small numbers like 3,5,7. For those who understand convolution operation, in the tensor K there are input_channels * output_channels number of k1 x k2 kernels. To visualize this operation further have a look at the convolution demo in the CS251n @ stanford lecture notes.
Another important thing to grasp here is the strides and padding. Usually in practical implementations of convolutional networks we often have either overlapping convolutions or non-over lapping convolutions. This can be set with strides. For example strides=1 means overlapping convolution. Padding is often used to make sure the input image and output image have same spatial dimension. For example, setting padding=1 with kernel 3×3 will ensure that the output of convolution is same input (spatially). A nice annimation of this concept is available here.
B) Pooling Layer
Very often a max-pooling layer is used. This is basically down sampling by retaining the maximum number in the neighbourhood. Max-pooling is often used as it introduces non-linearity, quicker to compute and easy for back-propagation (computing derivative). This layer is often used after one or several convolutional layers for reduced computation.
This is just a basic overview. As a next step I would highly recommend you to throughly read the cs251@stanford lecture notes on convolutional networks. After having done that I would suggest reading up chapter 6,9,7, and 8 (in this order) of the deep-learning-book (available for free online) for a more indepth motivation, understanding of these concepts.
One can implement a basic LeNet (see cs251n notes to know what is LeNet) from scratch and try running it to MNIST dataset. For reference, one may have a look at my implementation, available here.
After, having done this, one might realize that this way of working is not flexible enough and will be very slow on larger networks and larger datasets. Also you might want to use GPUs to speed up training. One way could be to use CUDA and implement convolution layer, max pooling layers and its derivative computation as cuda functions. But this is what exactly one of the several available toolkits for deep learning provide.
Some of the popular toolkits are Caffe, Tensorflow, Torch, Theano etc. A comparison list is maintained by Wikipedia. These toolkits usually have some way to specify a network architechture. Has atleast convolutions, max-pooling layers implemented on GPUs (both forward and backward pass). And several variants of Stocastic Gradient Descent implemented. Common ones are ADAM, Adagrad, Adadelta, RMSProp. There is an easy to understand and concise blog post which explains the workings of ADAM, Adagrad etc.
My Take on Toolkits
Initially, I started to use Berkeley’s Caffe. In particular, I used it’s python bindings. Fairly easy to learn. One can follow examples on their website and possibly get it all working in say 2-3 days. Soon realize it was not as easy to make changes to it. Although, I must admit it is just perfect if you want to do just object classification. However, let me warn you, it is not as professionally maintained (compared to tensorflow) and severely lacking in documentations. I reckon, with increasing popularity of tensorflow caffe is going to have a slow death.
One thing that might be confusing at first with caffe is that, it defines two .prototxt files. One to define the SGD parameters like learning rate etc and another one to define the network architecture. It recommends the image data be stored in LMDB file format or the HDF5 file format for bulk loading images. This format is just a fast-format. If you are just a beginner it can create unnessary hurdles. It is possible to give it inputs from a file. You might want to have a look at my caffe implementation (pycaffe) here.
Google’s Tensorflow is fundamentally very different than caffe. It lets you define a computational graph. It can compute derivative of this graph and do the gradient descent. It is extremely flexible with respect to the operations you want to perform and gives you lot of control on your computations. Needless to mention it is extremely professionally maintained and currently used inside and outside google on some very cool projects.
However, beware it has a steep learning curve. Possibly about 10-15 days and you better know background theory as it will come in handy when you build a computational graph yourself. It is now about only 1 year old (launched in Dec 2015), so lot of things change rapidly. I would recommend follow their tutorial and have a look at their cifar10 example code. Did I mention, best way to use tensorflow is through its python interface…oops. Good idea to brush up on python and numpy.
In addition to tensorflow’s official tutorials and examples, there are more tutorial. Awesome tensorflow is a very resourceful list of things related to tensorflow. You may also like to follow my example on cifar10 using tensorflow.
This is it, for this post. Further ahead, I plan to talk about my research work. After which I might explore recurrent networks and talk about them. Until then, happy deep-learning..!