17 Feb 2019
Hi again!
Today is going to be a short post that briefly discusses a simple
intuition that I stumbled across after plotting the variance of the
Bernoulli distribution. I thought it neatly captured some intuition, and
I feel its worth sharing.
I’ll walk you through the steps we need to get to this intuition (it’s
not bad, I promise). So, from the top, then.
Let’s start with a random variable, that can take on a binary
value, i.e. . The probability that takes on a
value of 0 or 1 is:
where is a parameter we can control and is in the range
. Put differently, we can think of as a measure of
how biased a coin is. The larger the value of , the greater
the probability of observing a 1, and the smaller the probability of
observing a 0.
We can encode this logic in a single expression with a little bit of
creativity:
This is identical to our earlier expression above (feel free to plug-in
values to verify) but contained in a single, neat expression.
The Expected Value
The following is stock-standard affair, but we’ll go through the steps
for the sake of thoroughness.
The expected value of a (discrete) random variable is defined as:
(i.e. the sum of each outcome multiplied by the probability of observing
that outcome ^{1})
For the Bernoulli distribution, that gives us the following:
Ah-ha! Our expected value of a random variable with a Bernoulli
distribution is our bias parameter, ^{2}.
Variance
Same steps as before, but this time we’ll be calculating the variance.
The variance of a (discrete) random variable is defined as:
So, given our parameters, the variance for the Bernoulli distribution
can be expressed as:
Next, we can plug-in our previously-calculated mean:
and away we go…
The magic
At this point, this may seem kind of arbitrary. But let’s plot our
calculated variance of the Bernoulli distribution (remember: the domain
of is ).
That’s rather interesting, right? No, not the fact that it’s a
parabola (the equation for the variance should’ve given that away). But
rather that the intuition regarding a biased coin is captured entirely
by this single curve!
Let me explain. controls the bias of our coin, as we know
from earlier. If we know the coin is entirely biased to one side (eg.
or ), then there is no uncertainty about
the outcome - hence the variance is zero. Conversely, if we have an
unbiased coin (when ), we know the least about the potential outcome, and so your
variance is at its maximum.
This might not be rocket science ^{3} to people who’re very familiar
with statistics, but the intuition encoded in that variance plot was a
unexpectedly delightful observation that captures more intuition about
the behaviour of a Bernoulli distribution than simply looking at the
maths. And that’s always a win in my book.
Till next time.
02 Jan 2019
A what?
Let’s start off with defining a sentence of words:
.
Mathematically, we can describe the skip-gram of the above sentence as the
following set
^{1}:
where parameter refers to the max skip-distance and to the “grams” or
subsequence length.
So, what in heaven’s name does the above expression actually mean? I’ve tried to
wrap my head around it multiple times and, in all honesty, it’s still rather
difficult to interpret intuitively (which I suspect is why you’ll possibly struggle to find
a standalone expression for skip-grams online).
A simpler explanation, then: we’re looking to find a set of subsequences of
length where each of the words in each subsequence are less than or equal
to a distance apart (FYI, words next to each other have a distance of
zero).
As with most things, this becomes a bit clearer with an example. For starters
let’s set and . Take a look at the following sentence:
I like to eat cheeseburgers.
The complete 1-skip-bi-gram is:
{I like, I to, like to, like eat, to eat, to cheeseburgers, eat cheeseburgers}
Not too bad, right? ^{2}
Here’s a slightly trickier example for when (so now we’re looking for
any combination of any two words, but they can be up to a distance of 2 apart).
The 2-skip-bi-gram is:
{I like, I to, I eat, like to, like eat, like cheeseburgers, to eat, to
cheeseburgers, eat cheeseburgers}
Ok, the logic is still easy enough to follow. So let’s ramp it up yet again to
check if we really understand. Let’s set and .
So, a few examples of the 1-skip-tri-grams are:
{I like to, I like eat, I to eat, like to eat, like eat cheeseburgers, …}
The key point, again, is that the maximum distance between any neighbouring word
must have a distance less than . This is rarely ever pointed out or
emphasized online, and has led to much head-scratching when trying to understand
how people determine the skip-grams for certain examples. These details become
important if you’d ever want to write your own skip-gram generator, for
instance (which I plan on doing in the future for a bit of a challenge).
Why should I care?
Skip-grams are cool and all, but what are they actually good for? Put simply,
skip-grams are a decently good way of encoding the context (sometimes referred
to as the co-occurrences) in which words occur, but don’t create datasets that
are as sparse (another interesting topic) as vanilla n-grams.
The reason we’re talking about skip-grams is that they form the core sample of
the data that is fed into the surprisingly-effective Skip-Gram Model, which was
incorporated later with the now extremely famous Word2Vec that exhibited
extraordinary syntactic and semantic understanding in its word representation
^{3}. The original paper
has so many citations by now that it’s not even worth trying to keep up.
The reason Word2Vec set the world on fire (at least in part) was the Skip-Gram
Model’s demonstration of semantic understanding of language. It is surprisingly
good at understanding that “man” is to “king” as “woman” is to “queen”. The
classic example you’ll find littered everywhere online is “queen” = “king” -
“man” + “woman”, and Word2Vec’s vector representation of these words have
exactly this linearity property in its vector representations that makes this
analogical reasoning possible.
Well, that, and the fact that it’s super efficient to train due to its RNN
architecture plus some smart speed-ups (or at least was compared to other models
at the time — things have gotten much better since) and so could be
realistically fed magnitudes more data. The empirical results were hilariously
good compared to anything else at the time.
But more on that once we try and figure the Word2Vec model out properly.
Conclusion
Today you’ve hopefully learned what Skipgrams are, how do derive our own from a
sentence and why they’re important in the world of NLP. There are, of course,
other ways of encoding text but I’ll cover these topics as I feel the need
arise. For now we’ll stop here.
Till next time!
References
^{1} A Closer Look At Skip-Gram Modelling, Gutherie et al.
^{2} Let’s pause for a moment before we
continue. Surely I could also include subsequences such as “like I” and “to
like”, and so on? They are also combinations of words less than distance
apart, after all. From what I’ve been able to gather, the jury is very much
still out on whether to include these “negative direction” combinations in the
final skip-gram set — particularly once we start looking at various
Natural Language Processing (NLP) models.
Word2Vec doesn’t care about word order, for
example, and so includes the “negative direction” combinations, whilst others
do not in a bid to garner additional information from word
order. So whichever flavour of skip-gram
to use depends very much on what you’re trying to accomplish. We’ll touch on
this at a later point once we start looking at what we’ll be using skip-grams
for.
For simplicity’s sake, for now we’ll stick to the rule of “only unique
combinations of words”.
^{3} More on this later when we talk about Word2Vec in a separate post.
29 Nov 2017
You’re probably already familiar with Moore’s Law. If not, it’s the famous / infamous observation of Gordon Moore (who co-founded Fairchild Semiconductor and Intel), who in 1965 described , and eventually revised his prediction in 1975, that the number of components per integrated circuit doubles every 2 years.
This is a classic curve that’s been shown a million times before, and, while I particularly like the graph of Moore’s Law that’s on the Wikipedia page, it’s a bit outdated. I thought it would be fun to give it a go myself - going so far as to scrape the Wikipedia table detailing the transistor count of processors over the years.
So, without further ado, an update to the visualization of Moore’s Law:
Yup - Moore’s empirical observation still holds. Although we’re definitely starting to see a slow-down from around 2012 onwards. By the way, just for fun, the curve for Moore’s Law in the plot above has the following mathematical expression:
(That was just an excuse to play around with MathJax in Markdown - something that I’ve enabled for the blog recently)
But how has this happened? Have processors also gotten physically larger? Or are transistor just more densely packed onto the CPU wafers? (You’ve probably correctly guessed that the answer is both, but I’ve never seen this broken down before, so I’d like to do it here). Let’s take a look at the area (in mm^2) of processors over time.
This definitely tells us that processors have increased in physical size extremely rapidly over the years. But is that the entire reason?
The “process” of an integrated circuit, nowadays specified in nm (nanometers), is defined as “the average half-pitch of a memory cell”, where a memory cell is a device that can store 1 bit of memory. In other words, the smaller the “process” for a particular integrated circuit, the more miniaturized the components (and hence, the closer we can pack them together). Smaller process equals higher transistor density.
Let’s take a look at the process used over time:
And that can mean only one thing - that transistor density has absolutely exploded since the 1970s:
Moore’s Law still holds for two reasons, then:
- Our ability to continue to miniaturize transistors.
- Our ability to reliably produce integrated circuits.
But there’s a problem. We’re already sitting with 14nm processors - that’s a handful of atoms. And despite major technological advancements getting us this far, we’re eventually going to run into physical limit. At least with the way transistors work currently. Ars Technica has, as usual, a fantastic article on the supposed death of Moore’s Law.
All doom and gloom, then? Not quite. The world isn’t going to grind to standstill. The demand for increased processing power is at an all time high (just look at present AI research), and will continue to grow. And when there’s a collective will that strong, I can almost guarantee there’s going to be a way for Moore’s Law to continue exist for a number of years to come. What’s got my excited is what happens after we reach the current physical limit - that would be interesting.
21 Nov 2017
Hey! So while we’re still experimenting and learning how to use Jekyll, I decided to write a quick post to become familiar with the post-writing workflow and reacquainting myself with Markdown. For this brief post, we’ll be looking into something mildly interesting I found while playing around with an open database.
I happened upon the Total World Population open database that is published by The World Bank (feel free to go take a look). While originally my idea was to look at the population growth of some countries over time, I discovered that the World Bank had added something interesting to the dataset - the aggregate population of countries that belong to each of their declared Income Groups, among some other notable additions which I’m sure I’ll be diving into for future posts.
For now, I calculated the population growth rate for each of these income groups, and created the visualization below (extra info! As an experiment I used PlotNine, a ggplot2 port for Python that has really impressed me):
Before we ask some questions, something to remember:
- These populations are according to country economy, and do not track individuals or families. So we cannot infer anything more than what is happening within entire countries, grouped into income groups.
- I am not accounting for countries whose economies have changed income groups over time. These do happen fairly often - on the first of July each year the analytical classification is revised by the World Bank. Each time a small handful of countries (typically 10 or so) have their classification changed. So when you see ruffles or spikes on the graph, it could’ve been an event or simply a country changing classification. I’d like to look at the general trend over a number of years, so this shouldn’t be too much of an issue, but I think it’s important to point out.
It’s rather clear that the general trend is downwards for all Income Groups besides “Low income” countries. What’s really curious is how the decline in population growth rate across “Lower middle” to “Upper middle” income has occurred at a similar rate. As is the fact the that the “Lower middle” to “Upper middle” countries diverged from a near-identical point in 1965.
So why is this the case? I’m not sure - there could be a large number of interconnected reasons. So, some speculation:
- Having children in wealthier countries is more expensive.
- There’s been a cultural shift to smaller families since the 1960s.
- There has been a shift to having fewer children that are more highly qualified, than many children who are not. (i.e. specialization vs strength in numbers).
- Richer economies are becoming poorer (unlikely)
- Increased prevalence of family planning over time.
- Increased demand for women to enter the workplace since the 1960s.
Although, this is of course pure speculation. I warrant you could write an entire thesis on why this graph looks the way it does if you had the motivation. Still, it is interesting.
Till next time,
Michael.