In Part I, I wrote that, in Rebel Cortex (RC), only the bottom level of the memory hierarchy handles concurrent inputs and that all sequences have a maximum capacity of seven nodes. I claimed that Numenta's HTM design is fundamentally flawed because, unlike the RC memory model, every level in Numenta's memory hierarchy handles both concurrent and sequential signals. I revealed the surprising source of my knowledge about the brain's memory architecture and I promised to answer two questions: a) why should only the bottom level of the memory hierarchy receive concurrent signals; and b) why should all sequences have a maximum capacity of seven nodes? So here are my answers, as promised.
Visual Recognition Is a Sequential and Predictive Process
Most visual recognition researchers have a more or less static view of vision. This is true even among those who profess a belief in the fundamental temporal nature of sensory learning and recognition. The usual assumption is that visual processors at the bottom level of the memory hierarchy recognize only small areas of the image and that, as one goes up the hierarchy, bigger and bigger areas are recognized in terms of lower level patches, and so on. At the top of the hierarchy, the entire scene is recognized all at once. Here is how Jeff Hawkins and Dileep George of Numenta describe (pdf) their hierarchical model:
The level 1 modules have small receptive fields compared to the size of the total image, i.e., these modules receive their inputs from a small patch of the visual field. Several such level 1 modules tile the visual field, possibly with overlap. A module at level 2 is connected to several adjoining level 1 modules below. Thus a level 2 module covers more of the visual field compared to a level 1 module. However, a level 2 module gets it information only through a level 1 module. This pattern is repeated in the hierarchy. Thus the receptive field sizes increase as one goes up the hierarchy. The module at the root of the tree covers the entire visual field, by pooling inputs from its child modules.This seems to make perfect sense and it would appear that this is the way it should work in the brain's visual cortex. But is this really what happens when we look at a scene? Do we really see an entire visual field in terms of smaller receptive fields? My experience is that it is simply not true. In fact, there is every reason to doubt the claim that the size of the visual receptive field changes at all as one goes up the hierarchy. Let's take the above paragraph as an example. Try as we may, we find it impossible to see a sentence all at once, let alone the entire paragraph or the entire computer screen. Even though our peripheral vision allows us to sense a big picture, our visual acuity is limited to a very small part of the visual field at a time. The size of this small part is invariant. It is a fallacy that we see a big pattern in terms of dozens or hundreds of small patterns.
The above begs the question, how do we understand a visual scene if we can't combine small pieces into bigger pieces? I believe that visual understanding, like everything else, is entirely dependent on the temporal expectations that we have of our environment. In this light, visual recognition is not unlike speech or music recognition. A spoken sentence or a musical tune is a sequence of sequences. We never sense a whole sentence or a whole song at once but only snippets at a time. Certainly, at the bottom level of the auditory hierarchy, we find many sequences of learned patterns, each of which consists of a set of concurrent signals generated by various audio sensors tuned to different frequencies. All concurrent patterns reside at the bottom level. This is true of both the visual and auditory cortices.
So yes, of course, there is a hierarchy, but it is not a hierarchy of patterns (see note below) but one of sequences. The main difference between audio and visual recognition is that the latter requires that we frequently move our gaze from one spot of the visual field to another, i.e., from one pattern to another. If Hawkins et al were correct, eye movements would be rarely necessary. Hawkins should know better because the ability to make predictions is the cornerstone of his theory of intelligence.
Note: There is a pattern hierarchy in the brain but it is used for learning and recognition purposes, not for enlarging the receptive field.
Sequences of Sequences
The two main functions of hierarchical memory are to classify sensory knowledge and to predict the future. Classification is needed to make predictions. It consists of grouping incoming sensory signals into various families and sub families. There are only two kinds of signal families: concurrent and sequential. A concurrent family is what some call a pattern. Hawkins and others refer to it as a spatial pattern but I think it is a seriously confusing misnomer since there is nothing spatial about it. From the point of view of making predictions, a pattern is a single and unique event in time.
In the Rebel Cortex model (see illustration below), a bottom level node (BLN) is just a pattern. Every upper level node (ULN) is a sequence of lower level nodes.
Rebel Cortex Memory Hierarchy
It is easy to understand a sequence of patterns. Every pattern can be seen as a single discrete signal arriving one after another. But what is a sequence of sequences? Can two succeeding sequences overlap? I'll return to this important topic in an upcoming post.
Fundamental Building Blocks of Memory
In order to facilitate the creation of as many combinations of sequences as possible, we should make them as short as possible, i.e., with just two nodes. The problem with short sequences, however, is that they require many levels in the hierarchy. As such, they slow down learning, recognition and prediction. At the other extreme, long sequences would be too coarse and would result in overlooked combinations. It is a good bet to suppose that the human cortex uses seven node sequences. After all, psychology teaches us that human short-term memory has a capacity of about seven items. But why seven? I think seven is a compromise, one that is fine-grained enough without being too taxing on the nervous system. However, I chose to use seven-node sequences in RC for a different reason altogether. My choice had to do with my being a Christian and, more specifically, with my on-going research in decoding certain ancient Biblical metaphorical texts. There's more to come. Stay tuned.
How Jeff Hawkins Reneged on his Own Principles
Invariant Visual Recognition, Patterns and Sequences
Missing Pieces in Numenta's Memory Model