Not So Big Data Blog Ramblings of a data engineer

The (simple) intuition of the Bernoulli distribution's variance

Hi again!

Today is going to be a short post that briefly discusses a simple intuition that I stumbled across after plotting the variance of the Bernoulli distribution. I thought it neatly captured some intuition, and I feel its worth sharing.

I’ll walk you through the steps we need to get to this intuition (it’s not bad, I promise). So, from the top, then.

Let’s start with a random variable, that can take on a binary value, i.e. . The probability that takes on a value of 0 or 1 is:

where is a parameter we can control and is in the range . Put differently, we can think of as a measure of how biased a coin is. The larger the value of , the greater the probability of observing a 1, and the smaller the probability of observing a 0.

We can encode this logic in a single expression with a little bit of creativity:

This is identical to our earlier expression above (feel free to plug-in values to verify) but contained in a single, neat expression.

The Expected Value

The following is stock-standard affair, but we’ll go through the steps for the sake of thoroughness.

The expected value of a (discrete) random variable is defined as:

(i.e. the sum of each outcome multiplied by the probability of observing that outcome 1)

For the Bernoulli distribution, that gives us the following:

Ah-ha! Our expected value of a random variable with a Bernoulli distribution is our bias parameter, 2.

Variance

Same steps as before, but this time we’ll be calculating the variance. The variance of a (discrete) random variable is defined as:

So, given our parameters, the variance for the Bernoulli distribution can be expressed as:

Next, we can plug-in our previously-calculated mean:

and away we go…

The magic

At this point, this may seem kind of arbitrary. But let’s plot our calculated variance of the Bernoulli distribution (remember: the domain of is ).

That’s rather interesting, right? No, not the fact that it’s a parabola (the equation for the variance should’ve given that away). But rather that the intuition regarding a biased coin is captured entirely by this single curve!

Let me explain. controls the bias of our coin, as we know from earlier. If we know the coin is entirely biased to one side (eg. or ), then there is no uncertainty about the outcome - hence the variance is zero. Conversely, if we have an unbiased coin (when ), we know the least about the potential outcome, and so your variance is at its maximum.

This might not be rocket science 3 to people who’re very familiar with statistics, but the intuition encoded in that variance plot was a unexpectedly delightful observation that captures more intuition about the behaviour of a Bernoulli distribution than simply looking at the maths. And that’s always a win in my book.

Till next time.

Footnotes

  1. This is a rather neat intuition in and of itself. 

  2. This isn’t something that may have been immediately clairvoyant upon first inspection. It certainly wasn’t for me, even though in hindsight (after computing the math), it appears rather obvious. Intuition in the probability domain remains a curious thing. 

  3. or brain surgery, for that matter. 

What is a Skipgram?

A what?

Let’s start off with defining a sentence of words:

.

Mathematically, we can describe the skip-gram of the above sentence as the following set 1:

where parameter refers to the max skip-distance and to the “grams” or subsequence length.

So, what in heaven’s name does the above expression actually mean? I’ve tried to wrap my head around it multiple times and, in all honesty, it’s still rather difficult to interpret intuitively (which I suspect is why you’ll possibly struggle to find a standalone expression for skip-grams online).

A simpler explanation, then: we’re looking to find a set of subsequences of length where each of the words in each subsequence are less than or equal to a distance apart (FYI, words next to each other have a distance of zero).

As with most things, this becomes a bit clearer with an example. For starters let’s set and . Take a look at the following sentence:

I like to eat cheeseburgers.

img

The complete 1-skip-bi-gram is:

{I like, I to, like to, like eat, to eat, to cheeseburgers, eat cheeseburgers}

Not too bad, right? 2

Here’s a slightly trickier example for when (so now we’re looking for any combination of any two words, but they can be up to a distance of 2 apart). The 2-skip-bi-gram is:

{I like, I to, I eat, like to, like eat, like cheeseburgers, to eat, to cheeseburgers, eat cheeseburgers}

Ok, the logic is still easy enough to follow. So let’s ramp it up yet again to check if we really understand. Let’s set and .

img

So, a few examples of the 1-skip-tri-grams are: {I like to, I like eat, I to eat, like to eat, like eat cheeseburgers, …}

The key point, again, is that the maximum distance between any neighbouring word must have a distance less than . This is rarely ever pointed out or emphasized online, and has led to much head-scratching when trying to understand how people determine the skip-grams for certain examples. These details become important if you’d ever want to write your own skip-gram generator, for instance (which I plan on doing in the future for a bit of a challenge).

Why should I care?

Skip-grams are cool and all, but what are they actually good for? Put simply, skip-grams are a decently good way of encoding the context (sometimes referred to as the co-occurrences) in which words occur, but don’t create datasets that are as sparse (another interesting topic) as vanilla n-grams.

The reason we’re talking about skip-grams is that they form the core sample of the data that is fed into the surprisingly-effective Skip-Gram Model, which was incorporated later with the now extremely famous Word2Vec that exhibited extraordinary syntactic and semantic understanding in its word representation 3. The original paper has so many citations by now that it’s not even worth trying to keep up.

The reason Word2Vec set the world on fire (at least in part) was the Skip-Gram Model’s demonstration of semantic understanding of language. It is surprisingly good at understanding that “man” is to “king” as “woman” is to “queen”. The classic example you’ll find littered everywhere online is “queen” = “king” - “man” + “woman”, and Word2Vec’s vector representation of these words have exactly this linearity property in its vector representations that makes this analogical reasoning possible.

Well, that, and the fact that it’s super efficient to train due to its RNN architecture plus some smart speed-ups (or at least was compared to other models at the time — things have gotten much better since) and so could be realistically fed magnitudes more data. The empirical results were hilariously good compared to anything else at the time.

But more on that once we try and figure the Word2Vec model out properly.

Conclusion

Today you’ve hopefully learned what Skipgrams are, how do derive our own from a sentence and why they’re important in the world of NLP. There are, of course, other ways of encoding text but I’ll cover these topics as I feel the need arise. For now we’ll stop here.

Till next time!

References

Footnotes

1 A Closer Look At Skip-Gram Modelling, Gutherie et al.

2 Let’s pause for a moment before we continue. Surely I could also include subsequences such as “like I” and “to like”, and so on? They are also combinations of words less than distance apart, after all. From what I’ve been able to gather, the jury is very much still out on whether to include these “negative direction” combinations in the final skip-gram set — particularly once we start looking at various Natural Language Processing (NLP) models. Word2Vec doesn’t care about word order, for example, and so includes the “negative direction” combinations, whilst others do not in a bid to garner additional information from word order. So whichever flavour of skip-gram to use depends very much on what you’re trying to accomplish. We’ll touch on this at a later point once we start looking at what we’ll be using skip-grams for.

For simplicity’s sake, for now we’ll stick to the rule of “only unique combinations of words”.

3 More on this later when we talk about Word2Vec in a separate post.

Visualizing Moore's Law

You’re probably already familiar with Moore’s Law. If not, it’s the famous / infamous observation of Gordon Moore (who co-founded Fairchild Semiconductor and Intel), who in 1965 described , and eventually revised his prediction in 1975, that the number of components per integrated circuit doubles every 2 years.

This is a classic curve that’s been shown a million times before, and, while I particularly like the graph of Moore’s Law that’s on the Wikipedia page, it’s a bit outdated. I thought it would be fun to give it a go myself - going so far as to scrape the Wikipedia table detailing the transistor count of processors over the years.

So, without further ado, an update to the visualization of Moore’s Law:

Moore's Law

Yup - Moore’s empirical observation still holds. Although we’re definitely starting to see a slow-down from around 2012 onwards. By the way, just for fun, the curve for Moore’s Law in the plot above has the following mathematical expression:

(That was just an excuse to play around with MathJax in Markdown - something that I’ve enabled for the blog recently)

But how has this happened? Have processors also gotten physically larger? Or are transistor just more densely packed onto the CPU wafers? (You’ve probably correctly guessed that the answer is both, but I’ve never seen this broken down before, so I’d like to do it here). Let’s take a look at the area (in mm^2) of processors over time.

Area and date of introduction

This definitely tells us that processors have increased in physical size extremely rapidly over the years. But is that the entire reason?

The “process” of an integrated circuit, nowadays specified in nm (nanometers), is defined as “the average half-pitch of a memory cell”, where a memory cell is a device that can store 1 bit of memory. In other words, the smaller the “process” for a particular integrated circuit, the more miniaturized the components (and hence, the closer we can pack them together). Smaller process equals higher transistor density.

Let’s take a look at the process used over time:

Process and date of introduction

And that can mean only one thing - that transistor density has absolutely exploded since the 1970s:

Transistor density and date of introduction

Moore’s Law still holds for two reasons, then:

  1. Our ability to continue to miniaturize transistors.
  2. Our ability to reliably produce integrated circuits.

But there’s a problem. We’re already sitting with 14nm processors - that’s a handful of atoms. And despite major technological advancements getting us this far, we’re eventually going to run into physical limit. At least with the way transistors work currently. Ars Technica has, as usual, a fantastic article on the supposed death of Moore’s Law.

All doom and gloom, then? Not quite. The world isn’t going to grind to standstill. The demand for increased processing power is at an all time high (just look at present AI research), and will continue to grow. And when there’s a collective will that strong, I can almost guarantee there’s going to be a way for Moore’s Law to continue exist for a number of years to come. What’s got my excited is what happens after we reach the current physical limit - that would be interesting.

Wealthy countries grow more slowly than poor countries

Hey! So while we’re still experimenting and learning how to use Jekyll, I decided to write a quick post to become familiar with the post-writing workflow and reacquainting myself with Markdown. For this brief post, we’ll be looking into something mildly interesting I found while playing around with an open database.

I happened upon the Total World Population open database that is published by The World Bank (feel free to go take a look). While originally my idea was to look at the population growth of some countries over time, I discovered that the World Bank had added something interesting to the dataset - the aggregate population of countries that belong to each of their declared Income Groups, among some other notable additions which I’m sure I’ll be diving into for future posts.

For now, I calculated the population growth rate for each of these income groups, and created the visualization below (extra info! As an experiment I used PlotNine, a ggplot2 port for Python that has really impressed me):

population growth per country

Before we ask some questions, something to remember:

  • These populations are according to country economy, and do not track individuals or families. So we cannot infer anything more than what is happening within entire countries, grouped into income groups.
  • I am not accounting for countries whose economies have changed income groups over time. These do happen fairly often - on the first of July each year the analytical classification is revised by the World Bank. Each time a small handful of countries (typically 10 or so) have their classification changed. So when you see ruffles or spikes on the graph, it could’ve been an event or simply a country changing classification. I’d like to look at the general trend over a number of years, so this shouldn’t be too much of an issue, but I think it’s important to point out.

It’s rather clear that the general trend is downwards for all Income Groups besides “Low income” countries. What’s really curious is how the decline in population growth rate across “Lower middle” to “Upper middle” income has occurred at a similar rate. As is the fact the that the “Lower middle” to “Upper middle” countries diverged from a near-identical point in 1965.

So why is this the case? I’m not sure - there could be a large number of interconnected reasons. So, some speculation:

  • Having children in wealthier countries is more expensive.
  • There’s been a cultural shift to smaller families since the 1960s.
  • There has been a shift to having fewer children that are more highly qualified, than many children who are not. (i.e. specialization vs strength in numbers).
  • Richer economies are becoming poorer (unlikely)
  • Increased prevalence of family planning over time.
  • Increased demand for women to enter the workplace since the 1960s.

Although, this is of course pure speculation. I warrant you could write an entire thesis on why this graph looks the way it does if you had the motivation. Still, it is interesting.

Till next time,
Michael.