How can Desirable Apps help your app recognise sound and images?
The good news is it is possible, but it is more difficult than most people realise.
A while ago I was asked (by another company) to quote a sound recognition app on their behalf – similar to apps which give you the name and artist of a song, after listening to a few seconds of music (note for confidentiality, I am not providing details of what the client requested).
I provided a quote, but the quote in this case was higher than the client expected. The client, naturally, solicited other quotes, which were also higher than they expected.
The reason – sound recognition, like image recognition, activities which seem so simple and natural to us, are computationally incredibly difficult.
The best estimate I have seen for the amount of computation power required to create a silicon version of the human brain is 36 Petaflops, backed by 3 Petabytes of memory. Don’t worry, I also had to look up the meaning of the word Petaflop – it works out at 36 thousand trillion computation operations per second.
About 1/2 of the brain is devoted to processing what your eyes see. Around 1/10th of your brain is devoted to sound processing.
The best desktop computers perform at around 0.000001 Petaflops – 1 billion computation operations per second (apps are a little slower – iPads for example perform around 1 hundred million operations per second). To match the Visual processing power of the human brain would require 180 million iPads, all linked together. Sound processing would be a little easier – you would only need 36 million iPads to do as good a job of understanding sound as your brain and ears can do (though of course someone would also have to write the software… ).
The human brain makes hearing and seeing seem easy, by throwing almost unimaginable amounts of computing power at the problem.
BUT there are apps which can read text and recognise sound – how can this fact be reconciled with what I just said about computation power?
The secret to solving the problem of image and sound recognition is to cut the problem down to size, by redefining the requirement in as narrow a way as possible.
For example, when recognising a song, instead of having to compare a sample of the song to billions of different sounds recorded in a human’s memory, the song is passed to a powerful server, which compares the music sample to at most a few thousand music tracks. By narrowing the range of sounds the computer is expected to be able to recognise, rather than expecting it to make sense of the full range of sounds we encounter in our daily lives, the computation problem is simplified to the point that the most powerful silicon computers can just about handle the task.
Similarly, apps are not very good at interpreting images the way our eyes do, but they can recognise letters and symbols – by narrowing the problem down to 36 different symbols (26 letters and 10 numbers), instead of expecting the app to make sense of any random image presented to it, apps and computers can handle reading text from images – they do it poorly, they make mistakes, but they can just about handle the job.
How long do we have to wait, before computers and mobile devices have similar computation abilities to humans? The answer, surprisingly, is not very long at all – decades rather than centuries. The reason – the power of computers and mobile devices is doubling every 18 months. Your iPhone 5, or your new Samsung Galaxy, is a far more powerful computer than last year’s model, let alone phones which were available a decade ago. Next year’s model will be more powerful still.
If you would like to discuss your image or sound recognition requirement, and how new computer capabilities might help you to solve your business requirement, please contact me at eworrall1@gmail.com.