How Siri understands what “2+2” is

Jayani Hewavitharana
6 min readDec 30, 2020

Let me ask a simple question. What is two plus two? It’s four (obviously!). The answer is pretty straightforward and easy but there’s some impressive processing going on behind the scenes in our brains that enables us to understand the question in order to answer it. Now ask the same question from Siri (or any other AI assistant that you like), the chances are that you will be getting the correct answer. But wait… Siri doesn’t have a brain to do all that behind the scenes processing. So how does it understand what we’re asking? Yes, there are some AI things going on that process the question which goes way beyond pushing some buttons to get the answer like in a calculator.

Ask Siri “what is 2+2”

Let’s assume (just for the sake of the explanation) that you have a very bad short-term memory and you forget the first “two” by the time I get to the second “two” of the question. Then obviously you won’t be able to answer. So the AI assistant’s ability to give the correct answer suggests that it grasps the question and understands it as a whole without suffering from short-term memory loss.

The underlying mechanisms that power these AI assistants are none other than neural networks. But these neural nets are a bit different than the traditional feedforward neural networks, where there exists no memory of previous inputs. Basically, if we feed our question to a traditional neural network word by word, it would have no recollection of the first “two” by the time it comes across the second “two” (just like when you had that imaginary short-term memory loss). Feedforward neural networks can only predict one output based on a given input without taking into consideration the previous inputs or the context. For cases where context and order are important such as in our AI assistant, machine translation, or other scenarios that involve sequential data, there is a special type of neural network called a Recurrent Neural Network (or RNN for short).

The specialty of RNNs is that they can keep track of previous outputs and use it for predictions. The outputs or the predictions of an RNN are influenced by a sequence of data, not just the immediate input. RNNs are structured similar to traditional neural networks, but with a feedback mechanism to preserve the influence of previous data.

RNN

It is kind of like a traditional neural network trained in a continuous loop by feeding what’s already learned, back into the network. To get a better intuition about how the feedback works, we can unroll the RNN to view each time step (A time step is basically like one iteration of the RNN).

Many-to-Many RNN (unrolled)

The structure can be unrolled into a many-to-many RNN where an output is produced at each time step (we’ll get to the different types of RNNs in a bit). The output at each time step will be influenced by the immediate input and memory gained from previous time steps.

Since each time step needs to preserve the memory from the previous time steps, the activations of neurons work a bit differently than in a traditional neural network. First, the activation of the linear function incorporating the current input and the previous activation is calculated.

Activation calculated using the current input and previous activations

This activation (g1 is an activation function) is sent to the next time step and it is also used to calculate the output (y) of the time step using another activation function(g2).

Output calculated using the current activation

Now you see how the memory is preserved and propagated through the time steps to impact the outcome of the RNN. RNNs are trained to optimize the loss similar to traditional neural networks and backpropagation is used to update the weights of each time step.

Now it’s time to see the different types of RNNs.

First, there’s one-to-one. A one-to-one RNN is simply a traditional neural network that doesn’t have a feedback mechanism. It takes an input and gives an output based on the given input only.

One-to-One RNN (Traditional Neural Network)

Then there are one-to-many RNNs that take one input and give a sequence of outputs each dependent on the previous output. This is used to generate music. Given a single starting note and the rest of the music piece will be generated, each note dependent on the previous (as we all know music is a beautiful flow notes, not some random sounds)

One-to-Many RNN (unrolled)

Then we have many-to-one RNNs that are used for sentiment classification or audio classification. These types of RNNs take a sequence of inputs and give only one output at the very end. Sort of like analyzing a text or audio clip and saying whether it’s positive or negative.

Many-to-One RNN (unrolled)

Finally, there’s many-to-many which was what we unrolled earlier to get a better understanding of the RNN structure. Just as the name suggests, these involve a sequence of inputs and a sequence of outputs. An example of this is machine translation. Feed in a sentence of some language (which is a sequence of words) and get the translation in another language (again a sequence of words, hence many to many). The number of time steps with inputs may or may not be equal to the number of time steps with outputs depending on the use case. (For machine translation, it is not equal because a translation from one language to another doesn’t always have the same number of words).

Many-to-Many RNN (unrolled and no of inputs ≠ no of outputs)

As you can see, RNNs are useful for many things AI, especially those related to NLP. But like most things, it’s not perfect. Remember that RNNs use backpropagation to update weights? For that, we use gradients (partial derivatives of the loss with respect to the weights). These gradients can cause problems when training RNNs making it hard to preserve the memory for a long time. First, there’s the exploding gradient problem, where the gradients become so large, and this can be solved by gradient clipping or truncating. Then there’s the vanishing gradient problem where the gradient becomes too small. To solve this, there are Gated Recurrent Units (GRU for short) and Long Short-Term Memory units (LSTM for short) that use different types of gates inside the neurons. (There’s a lot going on in those, so more on that later).

So that’s basically the beginning of the story behind how Siri understands two plus two is four or who you want to call and pretty much everything else you ask.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Responses (1)

Write a response