# The (simple) intuition of the Bernoulli distribution's variance

Hi again!

Today is going to be a short post that briefly discusses a simple intuition that I stumbled across after plotting the variance of the Bernoulli distribution. I thought it neatly captured some intuition, and I feel its worth sharing.

I’ll walk you through the steps we need to get to this intuition (it’s not bad, I promise). So, from the top, then.

Let’s start with a random variable, $X$ that can take on a binary value, i.e. $X \in \{0, 1\}$. The probability that $X$ takes on a value of 0 or 1 is:

where $\theta$ is a parameter we can control and is in the range $\theta \in (0, 1)$. Put differently, we can think of $\theta$ as a measure of how biased a coin is. The larger the value of $\theta$, the greater the probability of observing a 1, and the smaller the probability of observing a 0.

We can encode this logic in a single expression with a little bit of creativity:

This is identical to our earlier expression above (feel free to plug-in values to verify) but contained in a single, neat expression.

## The Expected Value

The following is stock-standard affair, but we’ll go through the steps for the sake of thoroughness.

The expected value of a (discrete) random variable is defined as:

(i.e. the sum of each outcome multiplied by the probability of observing that outcome 1)

For the Bernoulli distribution, that gives us the following:

Ah-ha! Our expected value of a random variable with a Bernoulli distribution is our bias parameter, $\theta$ 2.

## Variance

Same steps as before, but this time we’ll be calculating the variance. The variance of a (discrete) random variable is defined as:

So, given our parameters, the variance for the Bernoulli distribution can be expressed as:

Next, we can plug-in our previously-calculated mean:

and away we go…

## The magic

At this point, this may seem kind of arbitrary. But let’s plot our calculated variance of the Bernoulli distribution (remember: the domain of $\theta$ is $(0, 1)$ ). That’s rather interesting, right? No, not the fact that it’s a parabola (the equation for the variance should’ve given that away). But rather that the intuition regarding a biased coin is captured entirely by this single curve!

Let me explain. $\theta$ controls the bias of our coin, as we know from earlier. If we know the coin is entirely biased to one side (eg. $\theta =0$ or $\theta = 1$), then there is no uncertainty about the outcome - hence the variance is zero. Conversely, if we have an unbiased coin (when $\theta = 0.5$), we know the least about the potential outcome, and so your variance is at its maximum.

This might not be rocket science 3 to people who’re very familiar with statistics, but the intuition encoded in that variance plot was a unexpectedly delightful observation that captures more intuition about the behaviour of a Bernoulli distribution than simply looking at the maths. And that’s always a win in my book.

Till next time.

## Footnotes

1. This is a rather neat intuition in and of itself.

2. This isn’t something that may have been immediately clairvoyant upon first inspection. It certainly wasn’t for me, even though in hindsight (after computing the math), it appears rather obvious. Intuition in the probability domain remains a curious thing.

3. or brain surgery, for that matter.

# What is a Skipgram?

## A what?

Let’s start off with defining a sentence of words:

$\text{sentence} = w_1, \dots, w_m$.

Mathematically, we can describe the skip-gram of the above sentence as the following set 1:

where parameter $k$ refers to the max skip-distance and $n$ to the “grams” or subsequence length.

So, what in heaven’s name does the above expression actually mean? I’ve tried to wrap my head around it multiple times and, in all honesty, it’s still rather difficult to interpret intuitively (which I suspect is why you’ll possibly struggle to find a standalone expression for skip-grams online).

A simpler explanation, then: we’re looking to find a set of subsequences of length $n$ where each of the words in each subsequence are less than or equal to a distance $k$ apart (FYI, words next to each other have a distance of zero).

As with most things, this becomes a bit clearer with an example. For starters let’s set $k=1$ and $n=2$. Take a look at the following sentence:

I like to eat cheeseburgers. The complete 1-skip-bi-gram is:

{I like, I to, like to, like eat, to eat, to cheeseburgers, eat cheeseburgers}

Not too bad, right? 2

Here’s a slightly trickier example for when $k=2$ (so now we’re looking for any combination of any two words, but they can be up to a distance of 2 apart). The 2-skip-bi-gram is:

{I like, I to, I eat, like to, like eat, like cheeseburgers, to eat, to cheeseburgers, eat cheeseburgers}

Ok, the logic is still easy enough to follow. So let’s ramp it up yet again to check if we really understand. Let’s set $k=1$ and $n=3$. So, a few examples of the 1-skip-tri-grams are: {I like to, I like eat, I to eat, like to eat, like eat cheeseburgers, …}

The key point, again, is that the maximum distance between any neighbouring word must have a distance less than $k$. This is rarely ever pointed out or emphasized online, and has led to much head-scratching when trying to understand how people determine the skip-grams for certain examples. These details become important if you’d ever want to write your own skip-gram generator, for instance (which I plan on doing in the future for a bit of a challenge).

## Why should I care?

Skip-grams are cool and all, but what are they actually good for? Put simply, skip-grams are a decently good way of encoding the context (sometimes referred to as the co-occurrences) in which words occur, but don’t create datasets that are as sparse (another interesting topic) as vanilla n-grams.

The reason we’re talking about skip-grams is that they form the core sample of the data that is fed into the surprisingly-effective Skip-Gram Model, which was incorporated later with the now extremely famous Word2Vec that exhibited extraordinary syntactic and semantic understanding in its word representation 3. The original paper has so many citations by now that it’s not even worth trying to keep up.

The reason Word2Vec set the world on fire (at least in part) was the Skip-Gram Model’s demonstration of semantic understanding of language. It is surprisingly good at understanding that “man” is to “king” as “woman” is to “queen”. The classic example you’ll find littered everywhere online is “queen” = “king” - “man” + “woman”, and Word2Vec’s vector representation of these words have exactly this linearity property in its vector representations that makes this analogical reasoning possible.

Well, that, and the fact that it’s super efficient to train due to its RNN architecture plus some smart speed-ups (or at least was compared to other models at the time — things have gotten much better since) and so could be realistically fed magnitudes more data. The empirical results were hilariously good compared to anything else at the time.

But more on that once we try and figure the Word2Vec model out properly.

## Conclusion

Today you’ve hopefully learned what Skipgrams are, how do derive our own from a sentence and why they’re important in the world of NLP. There are, of course, other ways of encoding text but I’ll cover these topics as I feel the need arise. For now we’ll stop here.

Till next time!

## Footnotes

1 A Closer Look At Skip-Gram Modelling, Gutherie et al.

2 Let’s pause for a moment before we continue. Surely I could also include subsequences such as “like I” and “to like”, and so on? They are also combinations of words less than distance $k$ apart, after all. From what I’ve been able to gather, the jury is very much still out on whether to include these “negative direction” combinations in the final skip-gram set — particularly once we start looking at various Natural Language Processing (NLP) models. Word2Vec doesn’t care about word order, for example, and so includes the “negative direction” combinations, whilst others do not in a bid to garner additional information from word order. So whichever flavour of skip-gram to use depends very much on what you’re trying to accomplish. We’ll touch on this at a later point once we start looking at what we’ll be using skip-grams for.

For simplicity’s sake, for now we’ll stick to the rule of “only unique combinations of words”.

3 More on this later when we talk about Word2Vec in a separate post.