Capsule Theory – A Boon for Deep Learning or Distraction?

Written by Tarry Singh · 5 min read >
Capsule Networks are throwing CNN out of the window

UPDATE: Webcast was a huge success!

We (me and Huadong) cannot thank you enough for coming in hundreds to listen to our view and our vision on Capsule Networks.

You can watch the recording and download PPT here or on BrightTalk platform.

If you have any questions, do please drop in a message here.

Capsule Theory Webcast on Dec 7th, 15:00 Amsterdam Time

I will be talking about Capsule Theory which Geoffrey Hinton has been talking about for quite a while and recently in Oct published their paper on arXiv about


The talk will be primarily about the following three topics:

  1. What is right and wrong about CNN (Convolutional Neural Networks)
  2. How does Capsule Network Theory or CapsNET (as they are already being called) intends to fix/replace or improve upon CNNs, and finally
  3. What we (me and Liao) are planning ahead in terms of building and developing an advanced library for CapsLayer for use beyond the MNIST datasets.

Here is the link to the webinar, don’t miss it!


Capsules — What are they and what can they do?

Hinton, as my previous webcast describes, is considered the go-to-guy when it comes to Deep Learning. He has been constantly at it since 2006, hitting hard every time he strikes and it doesn’t seem to stop!

A few years ago, he became extremely suspicious of the weaknesses of CNN (Convolutional Neural Networks), especially when it comes to understanding spatial relationships between components, of say a face which we may want to identify.

The face has a set of components such as eyes, ears, nose ,eyebrows and so on. Their geometric position is not what CNN1 seems to care much about.

It’s job is to detect features in an image and tell you that it’s a human face, that it’s your or someone elses’ face and more things such as you’re smiling or sad or happy and so on.

BTW2: I will update the sources at the end of the blog, personally find it irritating all the links that refer to each other. I’ll update them accordingly.

So, what’s the problem with CNN?

CNN are pretty bad in taking into account of spacial hierarchies — for instance in a face your two eyes are there — side by side, nose slightly below and in between, ears on the side and so on.

Now, things get even worse when we start doing things like taking things out of context or changing stuff on the face or simply rearrange as well add / remove things / objects in our homes, living room and other spaces — both indoor and outdoors.

Of course, in reality we might see symmetry in a face not change much so CNN would output properly but recognition from 3D or more complex geometric representational hierarchies might pose challenges.

Or, to make matters worse, what if we start doing things like hanging upside down (like kids do and learn — hint hint) or changing or manipulating the spatial as well as hierarchical properties of the face components?

What if we do or need to do other things that may require a whole new class of facial hierarchies that we aren’t aware of that we need them?

How are Capsules set to solve this puzzle?

Hinton is not the only one thinking about it but Hinton probably is the only one who made it mathematically possible for us to test it. This is a huge difference and we must give Geoff Hinton due credit for this effort!

There are enough talks out on the internet on how brain thinks in multiple dimensions, sensorimotor multiple view theory or even Joshua’s theory of Deep Learning and whether machines can do TUFA learning.

Whatever the case, the problem we are still trying to grasp is the understanding and representation of the multi-dimensional world around us. You know, the stuff you see around you.

The whole field of computer vision is focusing on building some form of visual constructs in much a similar way how our brain constructs that in its visual cortex.

We can very quickly make sense of the world about the geometric position and hierarchies of everything we see around us.

In our brain, it is our hypothalamus that stores this image — (which many neuroscientists would like to believe) as an array of geometric objects and matrices representing relative positions such as height, width, depth, breadth ($b_h, b_w, b_d, b_b, …$) and then also other associations such as  texture, flavor, smell, event compositions (things we either call deja vu are basically memory blocks) ($b_t, b_f, b_s, b_e$).

Get my drift? ?.

Inverse Graphics
Inverse Graphics, Courtesy:

As of today, we do rendering in computer vision to basically show us back what our eyes supposedly see. And it (currently) restricts itself to storing this in an array of geometrical objects and matrices, stacked up with hierarchical representations and spatial positions.

Hinton argues that there is a deep difference as well as correlation between representation and recognition and our brain. This happens because besides the component recognition, we can also see how it is positioned in our representation field (= what we see).

This is called the pose.

So ideally an object could be in a funny position with itself or in correlation with others and it can still recognize what it is.

For instance, I am curious what a typical computer vision task would do (or not do!) in recognizing this ?

So, I hope you are getting my point: Collecting huge datasets of women in complex positions may sound like the cool thing to do but it may also be the dumbest.

Capsules’s Fix

So Capsules fix is to incorporate these relative positions between objects and represent them mathematically as a multi-dimensional matrix.

That’s it!

For instance: the above picture should be able to make certain recognition based on relative posal relationships with attributes and then go after accuracy monster to recognize.

Such as if all white clothes, cap, equally seperated, then possibly mecca prayers or if colors, handtowels, bums in the air, grass — people doing yoga and so on.

I have restrained from explaining the capsule paper itself in detail, this I will do in detail in the webcast about how Capsules essentially work.

For here just remember that a capsule’s core idea is to represent one entity with a bunch of neurons rather than one entity.

Just mention on my favorite part about remembering and recognizing (I will be starting my own research on minds & thoughts theory and liked Hinton’s intuitions  and subtle insinuations there)

I liked also the “coincidence” concept as Hinton calls it,  on say a bunch of representations / markers are enough to trigger a recognition eventfully. for instance “2020”, “Russia”, “Ronaldo” , should help you make some associations on what next to fill in, right?

This sort of “coincidence filtering” is done by low level capsules who decide which high-level capsules to activate. This way one can judge the confidence / accuracy by clustering high-dimensional vectors.

This sort of “confidence clustering or agreement” is cool in terms of filtering out noise and has potential of making results both robust as well as safe from hacks/tricks or malice.

I will talk about it in more detail in the webcast including some cool intuitions.

CapsLayer Library — Why the world needs this?

While this is the brain child of my colleague Huadong — who as many know by now was the first guy to make a cool TensorFlow Implementation of CapsNET, we are having intense discussions on a daily basis.

Purpose of creating the CapsLayer Library is to addresses issues that we are bound to encounter as we move forward to create advanced representation and recognition learning algorithms.

Also our goal is to make this library so advanced that it will move into recognizing datasets that are considered novel today or probably not even considered useful for deep learning .

For the same purpose I am taking Huadong wherever I can to continue to feed his and my intuitions with enough food for thought.

We will do together the following and more in the coming months (tyou are free to join if you feel inclined to spend some insane hours with us making this happen):

  • Contributing to a special Capsule Network chapter for my book, and I am pleased to have him on board as contributor ?
  •  Participating in projects and workshops at enterprise clients  to demonstrate the value and use case of Capsules — this should help enterprises get a better grip and with surely enthuse the developers and R&D divisions! and,
  • (This just fresh in today) We will also be writing our own paper on CapLayers soon on ACM to infuse more academic rigor into our efforts as well. We are already considering some improvements on the current implementations and fundamental model of CapsNET and will need to test and benchmark these for our upcoming research.

So, lots of awesome stuff coming!


If you search enough and in the right direction, then you’ll find that some good research has been done already in understanding pose, rotation and even other high-dimensional components such as texture to understand.

These models have attempted to understand better where CNNs, in their current form, perform poorly and one might argue that have chugged along because of the “hacks”.

Yet, CapsNET or Capsule Network are definitely moving in the right direction. All that effort is coming together to start building better deep learning models.

I am confident about it and will continue to explore and even apply CapsNET into industrial applications someday soon.

[1]CNN — I will create soon several 30 – 45 minutes videos on what CNNs, RNNs, LSTMs etc are. I promise to take time to explain the concepts easily and slowly.

Written by Tarry Singh
Tarry loves to write about Technology, Innovation, Entrepreneurship and it affects businesses and our daily lives. Profile

6 Replies to “Capsule Theory – A Boon for Deep Learning or Distraction?”

  1. Hi,

    A wonderful overview. However, I wanted to ask you if you are aware of the research labs currently working on CapsNets, other than Hinton’s lab.

  2. I can tell you have put a lot of work into it.
    Thank you for the listing on your web page. You have a good looking web site
    Your site is exactly what I have looking for!! Keep up with the good work.

  3. Data Science is an interdisciplinary field of scientific methods, process, algorithms and systems to extract knowledge or bits of knowledge from data in different structures, either structured or unstructured, like data mining. Mining large measures of structured or unstructured data to identify patterns can help an association to reduce their expenses, increase efficiencies, recognize new market opportunities and increase the association’s competitive advantage. The disciplinary areas that make up the data science field include mining, measurements, machine learning, analytics and some programming.
    For More Info: Data Science Training in Gurgaon

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.