Convolution Neural Networks with a focus on Image Processing

A Mental Model for Convolution Neural Networks with a focus on Image Processing

There is a need to process lakes and warehouses filled with data in order to make critical descisions about business, heath, politics and the list goes on. The solution requires systems that can seek out information effectively and quickly tell us what we need to know, not just what we want to know.

Neural network architecture and machine learning algorithms are used to enable machines applying simple funtions to sift through the data to identify patterns. The associations learned can then be interpreted and utilized. Here are the key definitions:

A Neural Network refers to a sequence of instructions that are programmed and carried out by processors that execute step functions or sigmoid functions organized in layers, with layers communicating combined results to another layer either linearly or recursively with the goal of recognizing patterns and building rules for catagorizing/classifying information.
Neural networks have an input layer and an output layer, but the architechture of neural networks can include hidden layers, allowing for more in-depth, deep learning, discorvery. The information flow between these layers can be linear or cyclical. Feed-forward neural networks move information from a previous layer forward to the next layer, and recursive neural networks orchestrate the recieving and sending of signals allowing for information to be passed back to a previous layer in a cyclical learning pattern.
A Convolution Neural Network, (CNN), is neural network that has an added convolution layer used before the input layer of the neural network. In general, the convolution improves the performance of the algorithm on large batches of information.

Since our brains are darn good at making sense of large amounts of input, it stands to reason that Computer Scientists and Mathematicians attempt to automate solutions to these tasks taking inspiration from biological nueral networks. In some use cases, the goal is for "humans" to already know the patterns that they are interested in or that exist within the data and then train a system to recognize these patterns. In the context of a system, there are known signatures of past attacks or bugs within similar systems. As part of security, it is essential that we efficiently monitor code and transactions for these signatures. However, attackers are creative and think of novel ways to breach security of a system including exploiting bugs in code. An ideal network security monitoring system would also be able to learn the difference between expected patterns and an outlying one or even identify uninteded behavior in code. A neural network could be used to identify unexpected behaviors of the system by using as input random states, an action and resulting behavior to generate an association for classifying outlying behavior. This rule could then be used to monitor the states and behaviors of the system in real time.

Could we Fuzz Pac-Man Using a Neural Network to find the AI bug? https://strategywiki.org/wiki/Pac-Man/Tips

The need for a system to learn, or at least the ablility for task specific learning, involves narrow Artificial Intelligence, AI, or more specifically on the methods of Machine Learning, ML. As with all technologies in AI that would allow processors to exhibit intelligence, machine learning is a collection of approaches that synthesize the learning process of humans in machines.

For now, it seems that one algorithm, a Convolution Neural Network or CNN, is working overtime to extend the capabilities of machines to see and reason about the world as a human brain would. Here are some quick points:

This blog entry will be part one of a non-technical overview of CNNs as I recount what I have gleaned so far from visiting the amazing resources I list at the end.

The basic idea of a neural network is to take an input stream of numeric data and generate a rule for cataloging the information that demonstrates an understanding from the learning process, i.e. when presented with new data the rule will allow the classification of characteristics of the signal "close" to what a human would give. There are many different designs for neural networks used with the intent to allow machines to see, hear and commmunicate like humans and this entry focuses on the use of CNNs as applied to Image Processing because the pre-process steps are a little more intuitive than those associated with say Natural Language Processing. The models take inspiration from modalities of human learning, i.e. learning information using a waterfall like design with the next concept building on the previous concept after it has been mastered or a cyclical learning approach that allows concepts to be revisited. As with every automated process, the desing of a CNN must address the requirements of speed, effectiveness and storage.

Ok, I said that this would be a high level description, so here it goes. The discussion assumes that there is a large collection of data to be analyzed, thus the need to automate the process, and there is a metric agreed upon to measure the performance.

To model the process, we start with the data itself (focusing on the image processing use case). The wavelengths constructing an image at one point in time is sampled at regular intervals from the top left of the boundary of the image to the bottom right and the measures will be recorded in a 2D array, think an n x m matrix, where n and m are positive integers. This discrete representation of the image will be an approximation to the real image and lies in our preset spatial domain. For black and white images, the entries are 0s and 1s and for grayscale images, the values are between 0 and 1 inclusive. The representation of color images requires a much larger number of possible combinations of values and a storage increase in the number of channels needed from the one channel used for black and white or grayscale images. Specifically, using RGB notation: (red, green, blue), we have three primary coponents using at minimum 8 bits per component. These values are typically scaled to [0,1] and operations are applied to each of these three channels seperately.

Now that we have an idea of how to visualize the represention of one type of data, we begin the overview of how to create a set of steps that could allow a machine to seemingly make decisions. There are three approaches in machine learning applied to collections of data: supervised, unsupervised and semi-supervised methods. With supervised learming techniques, images need to be selected from the collection as a training set, labeled and then fed into the algorithm to generate our association. The association can then be assessed with the remaining images. This approach comes with some upfront costs of curating a set of data that has already been classified/catagorized in order to measure the recall and precision of the algorithm. Once trained though, the algorithm can then tweaked with less effort to learn on other data sets. There are certainly some use cases for language or image processing where this is appropriate. As mentioned in Unsupervised Image Classification: A Review, if the goal is for AI to consume data indiscriminately from the web and to discover patterns as a part of rule generation, an unsupervised method allows for the opportunity to discover these patterns and principle components with appropriate weights. With unsupervised methods, there is no need for a training set but the requirements on time, space and computing power are more restrictive. As expected, semi-supervised methods combine these two approaches in attempt to strike a balance between the costs and benefits of the two.

In any of ML methods used with data, in most cases one piece of information translates to a huge array and a some pre-processing of these representations is necessary in order to improve efficiency and accuracy. Flashback to our idea to create a mental model, a presenter will begin a talk by introducing a few key concepts in their slide deck to help the audience prepare for the later slides that will develop these concepts. A convolution is such a pre-processing layer, with the purpose of identifying key components. A convolution is traditionally a binary function taking as input the data matrix and a helper matrix known as the kernel. The operation between these two matrices is either a matrix of the same size as the image or slightly smaller (dimensionality reduction) dependent on the choice of kernel. We can say that this function filters the image with the idea to highlight aspects of the image that we are interested in. Filters are used all of the time in our social media and advertising to soften or sharpen images. So keeping this high level, To be honest, I think of this process in terms of a word search puzzle. My method is to start at the top left corner and look at a block of letters and then slide my gaze to the right until the end of the puzzle. Then I pick back up in an overlapping block of the very first one on the left side of the puzzle and continue ot move left to right down the puzzle. In the case of the convolution operation, a kernel is a smaller matrix that will operate on blocks of the image matrix (in the spatial domain) and then the convolution layer is finished off with a pooling operator. After the convolution layers are complete, the information will need to be taken from the spatial domain to the frequency domain by a (Fast) Fourier Transformation, FFT, to make the computations easier, and once we are done with the filtering in the frequency domain, find the inverse FFT to return to the spatial domain. Here is a great blog on filtering with convolution in the spatial domain and using FFTs to enable the use of high and low pass filters on images.

With or without a convolution layer, when applying a feed-forward nerual network the ultimate goal to create a rule for comprehending important features of the image, we have to take a second to think about what we might consider relavent characteristics in the image. At first, it might seem like a great idea to look for specific shapes like a face or a chair. However, this process is very limiting to the learning process since we would like to generate a way of looking at never-seen-before images that could be very different from what we would use as a predetermined definition of a chair or a face. There are too many variations we would need train on to be practical and effective. So instead, we will filter images by the patterns that are common to "similar" objects and that are invariant to the orientation of the object in an image. To this point, a neural network consists of one or more layers of perceptrons (or more complex sigmoid neurons). Each perceptron will recieve information as a vector and the output of each perceptron's result is then combined and sent to the next layer of perceptrons. More precisely, each layer consists of applying step functions using weights and a bias in the case of perceptrons, sigmoid function when using sigmoid neurons, in parallel. Weights can be learned in this process by using a metric that measures correctness as compared to the known answer, with the sigmoid neurons offering more flexibility using small changes in input to result in small changes in the output. The metric can then be used to give a cost assessed when a decision is incorrect and to adjust the process with small tweaks to gradually get better (lower cost). There can be hidden layers of perceptrons sandwiched between the input layer and the one perceptron in the outer layer depending on the desired complexity of patterns to be discorvered and time.

References:

A playground to visualize neurons and filters, https://playground.tensorflow.org/
Another playground allowing you to see effects of applying kernel to an image as part of a convolution, https://playground.tensorflow.org/
A guide to CNN use in Natural Language Processing (NLP) and instruction to build a python program accessing the Keras API, https://towardsdatascience.com/nlp-with-cnns-a6aa743bdc1e
Well written e-book that carefully develops neural networks and walks through a small python program and libraries: random and numpy. http://neuralnetworksanddeeplearning.com/index.html.
A good rescourse on Low and High pass filters in the frequency domain, https://fairyonice.github.io/Low-and-High-pass-filtering-experiments.html
A recent colloqium paper focusing on the effectiveness of deep learning neural networks that is on my "nightstand", https://www.pnas.org/content/117/48/30033
An animation of convolutions and the canonical image of Lena Forsén https://www.opto-e.com/basics/kernel