Tuesday, October 29, 2013

Did Vicarious Achieve an AI Breakthrough?

Did Vicarious Solve the Cocktail Party Problem?

There is something a little weird about Vicarious's recent announcement in which they claim to have developed a machine learning program that can solve CAPTCHAs at 90% accuracy. What is interesting, from my vantage point, is that some CAPTCHAs print irrelevant words behind the actual text to be recognized. Paypal, for example, displays copies of the word 'PayPal' behind the CAPTCHA text. Take a look at this CAPTCHA, which Vicarious claims to be able to solve.
Quick and dirty techniques such as thresholding can be used to reduce or eliminate background noise but that would be cheating. Is this what Vicarious is using here? I don't know but it would seem that thresholding would not work very well in this case unless you knew in advance what to look for. This means that, if we are to believe their claim, Vicarious's AI program is sophisticated enough to be able to focus on certain things while ignoring others. Wow. Really? Are we to understand that Vicarious solved a visual analog of the cocktail party problem, which is essentially the ability to pay attention to one object within a group of many others? If the answer is yes, it would be a monumental breakthrough because this is one of the hardest unsolved problems in computer science. Even so, the question becomes, how can the program tell which letters in the picture are relevant and which are not? There is something either fishy or missing in this story.

See Also:

Vicarious Has a Winner

Vicarious Wakes up with a Bang

It has been a while since we heard anything from Vicarious. I was beginning to wonder if the company had fallen asleep. Then suddenly out of nowhere, they announced that they have a machine learning program that can solve CAPTCHAs, the sometimes hard to read letter puzzles that are meant to ward off those pesky computer bots. That's a rather dramatic awakening, I would say. Although I do not agree with Vicarious's Bayesian or probabilistic approach to AI, I have to admit that this is very impressive.

A Few Observations

There are a few things about this new development that intrigue me. First of all, why didn't Vicarious host an online demo somewhere in the cloud and release a free app that others can use to test their claim? How hard would that be? It would have added some meat to the sauce, so to speak. Second, and this is more a question than a criticism, my understanding is that the recursive cortical network (RCN) works best with moving pictures. It is hard to imagine how it learns using static pictures. Third, Vicarious's CEO, D. Scott Phoenix, claimed that RCN needs less than 10 training examples per letter whereas other visual recognition programs require thousands of examples. This is truly amazing and, if true, it tells me that they must have figured out an efficient way to do invariant pattern recognition.

Why I Still Don't Think Vicarious Is on the Right Track

Yes, I still think that the Bayesian approach to AI is a red herring. Vicarious's CTO and co-founder, Dileep George, is convinced that intelligence is based on probabilistic math. I believe that neither human nor animal intelligence uses probability for reasoning, prediction or planning. We are cause/effect thinkers, not probability thinkers. The brain has a fast and effective way of compensating for the uncertain or probabilistic nature of the sensory stream by filling in any missing information and filtering out the noise. I see essentially two competing models. The Bayesian model assumes that the world is inherently uncertain and that the job of an intelligent system is to calculate the probabilities. The Rebel Science model, by contrast, assumes that the world is perfect and that the job of the intelligent system is to discover this perfection.

In Secrets of the Holy Grail, I wrote, "nobody can rightfully claim to understand the brain’s perceptual learning mechanism without also knowing exactly what the brain does during sleep and why." I'll say it again. If the guys at Vicarious don't know why the brain's neural network needs sleep, then they are not doing it right.

See Also:

Did Vicarious Achieve an AI Breakthrough
The Second Great Red Herring Chase
Vicarious Systems' Disappointing Singularity Summit Talk
The Myth of the Bayesian Brain
The Perfect Brain: Another Nail in the Coffin of the Bayesian Brain

Friday, October 11, 2013

The Problem with Speech Recognition Models

Abstract

I recently read a paper (by way of PhysOrg) about a new speech recognition model called the Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition. The good news is that HCNF outperforms existing models. The bad news is that it does not come close to solving any of the pressing problems that plague automatic speech recognition. Problems like noise intolerance and the inability to focus on one speaker in a roomful of speakers continue to vex the best experts in the field. While reading the paper, it occurred to me that the main problem with speech recognition models is that they exist in the first place. I will argue that the very fact that we have such models is at the root of the problem. Let me explain.

Not Good Enough

Wouldn't it be nice if we could talk directly to our television sets? We should be able to say things like, "turn the volume down a little" or "please record the next episode of Game of Thrones". Indeed, why is it that we can talk to our smartphones but not to our TVs? The reason is this. Current speech recognizers are pretty much useless in the presence of noise or multiple voices speaking at the same time. The sounds coming from the TV alone would confuse any state of the art recognizer. Sure, we could turn the remote control device into a noise reduction microphone and hold it close to the mouth when speaking but that would defeat the purpose of having a hands-free and intuitive way of interacting with our TVs. What we need is a recognizer that can focus on one or two voices in the room while ignoring everything else including other voices, e.g., from children, pets or guests, or from the TV. A good TV speech recognizer should respond only to the voices of those it was instructed to pay attention to. It should also ignore any conversation that does not concern its function. Unfortunately, these capabilities are way beyond what current speech recognition technology can achieve and there are no solutions in sight.

Limited Domain

I am arguing that speech recognition models are not up to the task simply because they are limited domain models, i.e., they only work with speech. But why shouldn't they, you ask? It is because the brain does not use a different representation or learning scheme for different types of knowledge. To the brain, knowledge is knowledge, regardless of whether its origin is auditory, tactile or visual. It does not matter whether it has to do with language, music, pictures, food, houses, trees, cats or what have you. The cortical mechanism that lets you recognize your grandmother's face is not structurally different than the one that lets you recognize your grandmother's name. A good speech recognition model should be able to learn to recognize any type of sensory data, not just speech. It should also be able to recognize multiple languages, not just one. And why not? If the human brain can do it, a computer program can do it too, right? After all, it is just a neural mechanism. However, as such, the model would no longer be a speech recognition model but a general perceptual learning model.

There Is a Pattern to the Madness

The brain learns by finding patterns in the stream of signals that it continually receives from its sensors. The origin of the signals does not matter because a signal arriving from an audio sensor is no different than a signal arriving from a light detector. It is just a transient pulse, a temporal marker that signifies that something just happened. This begs the question, how can the brain use the same model to learn different types of knowledge? In other words, how does the brain extract knowledge from a stream of unlabeled sensory pulses? The answer lies in the observation that sensory signals do not occur randomly. There is a pattern to the madness. In fact, there are millions of patterns in the brain's sensory stream. The key to learning them all has to do with timing. That is, sensory signals can be grouped and categorized according to their temporal relationships. It turns out that signals can have only two types of temporal relationships; they can be either concurrent or sequential. The learning mechanism of the brain is designed to discover those relationships and recognize them every time they occur. This is the basis of all learning and knowledge.

The Holy Grail of Perceptual Learning

Many in the business assume that the cocktail party problem is relevant only to speech recognition. In reality, it is a problem that must be solved for every type of sensory phenomena, not just speech sounds. Humans and animals do it continually when they shift their attention from one object to another. The brain's ability to pay attention to one thing at a time is the holy grail of perceptual learning. In conclusion, let me reiterate that we don't need different models for visual and speech recognition. We need only one perceptual learning model for everything.

PS. I am continuing to write code for the Rebel Speech recognizer and incorporating the principles of perceptual learning that I have written about on this blog. I am making steady progress and I will post a demo executable as soon as it is ready. Hang in there.