Hello, this is my first blog post. Any suggestions regarding improvements are very welcome.
Formally, this is called semi-supervised POS tagging.
What is POS tagging?
If you have some acquaintance with NLP, then you will most probably know what POS tagging is. Basically, given a sentence, you have to label each word with its part-of-speech tag. This can simply be put as a sequence labeling task. You can read more about POS tagging here.
HMM and POS tagging
The first approach for POS tagging is to use an HMM with Viterbi algorithm, which is basically a dynamic programming technique to speed up HMM for POS tagging. You can read more about HMM and Viterbi following the links given. But for giving a high-level view, we should find the sequence of tags which maximizes the probability P(t1, t2, … tn | w1, w2, … wn). Not going completely into derivation which involves Markov assumption and some probability manipulation, the above expression can be converted to finding tag sequence which maximises
Π P(wk|tk) P(tk|tk-1).
P(wk|tk) is called emission probability as it is the probability of wk occurring given that tk occurred. P(tk|tk-1) is called transition probability as this determines the probability of next tag, given the previous tag. This is a good tutorial for this technique.
The method discussed above is called supervised because the emission and transmission probabilities are usually calculated from training data by counting bigram frequencies:
P(wk|tk) = count(wk, tk)/count(tk)
P(tk|tk-1) = count(tk, tk-1)/count(tk-1)
But, what if we don’t have bigram counts of all possible bigrams, which is clearly the case in real world data. One basic approach is to use some smoothing technique, but we will be looking at a different approach here. It is clear that we don’t have exhaustively labeled data (all possible bigram counts). So, we make use of limited labeled data and word similarities to find the tags, which is why we call it semi-supervised.
Clustering and two-level HMM
Finally, we come to the actual solution. We first cluster the words in the train data into some k no of clusters. k can be fixed by experimenting. For this, we need vector representations of words. We can obtain vector representations by building a cooccurrence matrix and reducing the dimensionality by using SVD. A simpler way would be to use Word2Vec or GloVe vectors. But the problem with them is that they are general and may not be very relevant to the domain of our data. Anyway, it won’t make much of a difference. Now, we have n clusters – k1, k2 …, kn.
To visualize, our HMM looks like following:
As you can see, the first level of HMM is to get the cluster sequence from the word sequence. As discussed in the previous section, this will require two probabilities – emission and transition. There is no problem with transition probability as the count based approach earlier can still be used (all possible cluster bigram counts are mostly present). But, emission probability cannot be count-based as all possible cluster-word pair counts may not be available. This is the place the clustering and word vectors prove useful. The inverse of the euclidean distance between word vector and mean vector of a cluster can be considered as emission probability:
P(wi | kj) = ||vector(wi)-mean(kj)||-1
Now, we obtained cluster sequence. From this, we have to obtain the tag sequence. This part is the second level of HMM. Similar to the first level, transition probabilities can remain count-based as all tag bigram counts are usually available. Now comes the most tricky part of the entire approach – emission probabilities of cluster-tag. We already have tags for all the words in training data. We now calculate embedding for a tag as the mean of vectors of all words with that tag. Once we have tag embeddings, emission probabilities of tag-cluster can be obtained as the inverse of the euclidean distance between the mean of cluster and tag embedding.
P(ki | tj) = ||vector(tj)-mean(ki)||-1
vector(ti) = 1/n ∑ vector(wj) such that tag(wj) = ti
Points to note while implementing
- While doing Viterbi, during every iteration, you will have to multiply three probabilities – emission, transition, and probability from the previous word. Since these are very small values, multiplying them over and over makes them even smaller. So, instead of multiplying them, we have to add their log probabilities which yield the same result:
log ( P(wi | kj) * P(ki | ki-1) * dp(ki-1) ) = log P(wi | kj) + log P(ki | ki-1) + log dp(ki-1)
For every probability, we have to apply softmax before using as measures like inverse of euclidean distance are very arbitrary and does not obey rules or probability.
- Failing to do any of the above will result in putting all tags same or a repetition of a sequence of tags.