What is a Skipgram?

A what?

Let’s start off with defining a sentence of words:

$\text{sentence} = w_1, \dots, w_m$.

Mathematically, we can describe the skip-gram of the above sentence as the following set 1:

where parameter $k$ refers to the max skip-distance and $n$ to the “grams” or subsequence length.

So, what in heaven’s name does the above expression actually mean? I’ve tried to wrap my head around it multiple times and, in all honesty, it’s still rather difficult to interpret intuitively (which I suspect is why you’ll possibly struggle to find a standalone expression for skip-grams online).

A simpler explanation, then: we’re looking to find a set of subsequences of length $n$ where each of the words in each subsequence are less than or equal to a distance $k$ apart (FYI, words next to each other have a distance of zero).

As with most things, this becomes a bit clearer with an example. For starters let’s set $k=1$ and $n=2$. Take a look at the following sentence:

I like to eat cheeseburgers.

The complete 1-skip-bi-gram is:

{I like, I to, like to, like eat, to eat, to cheeseburgers, eat cheeseburgers}

Here’s a slightly trickier example for when $k=2$ (so now we’re looking for any combination of any two words, but they can be up to a distance of 2 apart). The 2-skip-bi-gram is:

{I like, I to, I eat, like to, like eat, like cheeseburgers, to eat, to cheeseburgers, eat cheeseburgers}

Ok, the logic is still easy enough to follow. So let’s ramp it up yet again to check if we really understand. Let’s set $k=1$ and $n=3$.

So, a few examples of the 1-skip-tri-grams are: {I like to, I like eat, I to eat, like to eat, like eat cheeseburgers, …}

The key point, again, is that the maximum distance between any neighbouring word must have a distance less than $k$. This is rarely ever pointed out or emphasized online, and has led to much head-scratching when trying to understand how people determine the skip-grams for certain examples. These details become important if you’d ever want to write your own skip-gram generator, for instance (which I plan on doing in the future for a bit of a challenge).

Why should I care?

Skip-grams are cool and all, but what are they actually good for? Put simply, skip-grams are a decently good way of encoding the context (sometimes referred to as the co-occurrences) in which words occur, but don’t create datasets that are as sparse (another interesting topic) as vanilla n-grams.

The reason we’re talking about skip-grams is that they form the core sample of the data that is fed into the surprisingly-effective Skip-Gram Model, which was incorporated later with the now extremely famous Word2Vec that exhibited extraordinary syntactic and semantic understanding in its word representation 3. The original paper has so many citations by now that it’s not even worth trying to keep up.

The reason Word2Vec set the world on fire (at least in part) was the Skip-Gram Model’s demonstration of semantic understanding of language. It is surprisingly good at understanding that “man” is to “king” as “woman” is to “queen”. The classic example you’ll find littered everywhere online is “queen” = “king” - “man” + “woman”, and Word2Vec’s vector representation of these words have exactly this linearity property in its vector representations that makes this analogical reasoning possible.

Well, that, and the fact that it’s super efficient to train due to its RNN architecture plus some smart speed-ups (or at least was compared to other models at the time — things have gotten much better since) and so could be realistically fed magnitudes more data. The empirical results were hilariously good compared to anything else at the time.

But more on that once we try and figure the Word2Vec model out properly.

Conclusion

Today you’ve hopefully learned what Skipgrams are, how do derive our own from a sentence and why they’re important in the world of NLP. There are, of course, other ways of encoding text but I’ll cover these topics as I feel the need arise. For now we’ll stop here.

Till next time!

Footnotes

1 A Closer Look At Skip-Gram Modelling, Gutherie et al.

2 Let’s pause for a moment before we continue. Surely I could also include subsequences such as “like I” and “to like”, and so on? They are also combinations of words less than distance $k$ apart, after all. From what I’ve been able to gather, the jury is very much still out on whether to include these “negative direction” combinations in the final skip-gram set — particularly once we start looking at various Natural Language Processing (NLP) models. Word2Vec doesn’t care about word order, for example, and so includes the “negative direction” combinations, whilst others do not in a bid to garner additional information from word order. So whichever flavour of skip-gram to use depends very much on what you’re trying to accomplish. We’ll touch on this at a later point once we start looking at what we’ll be using skip-grams for.

For simplicity’s sake, for now we’ll stick to the rule of “only unique combinations of words”.

3 More on this later when we talk about Word2Vec in a separate post.