You’re Probably Seeing Something Like This

Top: RAISR algorithm at run-time, applied to a cheap upscaler’s output. Bottom: Low-res original (left), bicubic upsampler 2x (middle), RAISR output (right).

A recent blog post from Google’s Research Lab described an important new technique for increasing image sharpness; the tech press was abuzz about similarities to the fictional software on TV shows like CSI, which can magically generate a crisp rendition of the numbers on a license plate photographed from miles away. But this simplification overlooks a more complicated but much more interesting story: how machine learning is leveraging probability to mimic the eye-brain system’s ability to create visual reality.

The software prototype, called RAISR (Rapid and Accurate Image Super-Resolution), applies machine learning to upsampling, a set of computing techniques developed over the past few decades for producing larger, higher-quality images from low-resolution originals.

Existing methods for upsampling typically fill in missing pixel values by applying simple, linear combinations of nearby existing pixel values. These methods are fast, but not especially effective at bringing out image detail.

RAISR is trained on pairs of low-res and high-res images to find filters that, when applied selectively to each pixel of the low-res image, will recreate details that are of comparable quality to the original. The filters are trained for detecting brightness / color gradients and other characteristics of edges in small patches of an image. The learned filters – hundreds of them – can be generated in about an hour from a database of 10,000 high and low resolution image pairs.

What’s cool is that at runtime, RAISR selects and applies the most relevant filter from the list of learned filters to each pixel neighborhood in the low-resolution image. Machine learning also enables the RAISR code to get rid of aliasing artifacts such as moiré patterns and jagged edges. To be clear, RAISR isn’t “restoring” lost data to faithfully recreate an original image. The details it adds, known as hallucinations in image processing lingo, are the system’s best guesses, and nothing more.

For Google, the biggest gain comes from saving bandwidth, which is crucial when delivering images to mobile devices, where users may have bandwidth limits or simply prefer a snappier image refresh. By scaling down images before sending them out, then applying RAISR to scale them back up on the receiving end, Google can send significantly less image data without a visible degradation in the user experience. For instance, it’s currently applying the algorithm to more than a billion images per week in Android device streams, reducing their total bandwidth requirements by more than a third.

So what does this have to do with way our eye-brain system creates reality?

The human brain consumes more than a fifth of the body’s energy, and about a third of the brain’s volume is devoted to vision, so this structure must confer huge evolutionary advantage. Human vision is a lot more complex than simply measuring the brightness and color of incoming light, the way a camera or photometer would. And as optical illusions show, it is easy to break the connection between our perceptions and the physical world.

To understand this complexity, neuroscientists have tried to model the rules by which the eye interprets incoming light, in particular given the survival value of real-time image processing – is that a snake in the grass or just a twig?

Among the first to synthesize these rules was Donald Hoffman, a cognitive scientist at UC Irvine, whose 1998 book, Visual Intelligence: How We Create What We See, provides a detailed description of how we construct our world from incomplete and constantly changing input. As Hoffman states, “We are finite creatures without the memory to store countless sentences or images, so learning a language or learning to see can’t just be a matter of storing sentences or images. It must instead be a matter of acquiring a finite set of rules that endow an infinite capacity.”

Hoffman’s 35 rules start with the simple, “Always interpret a straight line in an image as a straight line in 3D” and grow increasingly complex, i.e.: “Interpret the highest luminance in the visual field as white, fluorent, or self-luminous.” Building from this set of rules, Hoffman shows that rather than being a passive recorder of a preexisting world, the eye actively constructs every aspect of our visual experience. The crucial concept here is computational efficiency – as we move through the natural world, the only way for our brains to make sense of the firehose of incoming sensory data is to apply rules that divide and conquer.

But here’s where it gets interesting. Even this rules-based system isn’t fast enough to deal with the real world of plants and animals, predators and prey. The eye-brain system doesn’t start with the photons that hit the retina, then build up from there to a picture of the surrounding world. In fact, it’s the other way around: our perceptual mechanisms are predictive. The brain constructs a model of what it expects to find from moment to moment in the real world, then subsequently validates (or invalidates) that model based on input from the eyes.

Naturally, the world your brain predicts depends on everything in your neurophysiology, experience, and memory. So, when a software algorithm uses probability to create a particular visual sensation, remember, the same thing is happening all the time in your brain.

 

 

Links:

Google Research Labs blog post on RAISR

Visual Intelligence: How We Create What We See, by Donald Hoffman