In traditional NNLM, we need to compute the conditional probabilities of all vocab words given a history: p(w|history) , finally perform a normalization (namely softmax). Obviously, such kind of operation requires huge computation, especially on the whole vocab.

To solve the problem, some people like the authors of “A Scalable Hierarchical Distributed Language Model” proposed a solution named “hierarchical softmax” with following process:

1) first, build a Huffman tree based on word frequencies. As a result, each word is a leaf of that tree, and enjoys a path from the root to itself;

2) Now, each step in the search process from root to the target word is a normalization. For example, Root chooses left subtree with probability 0.7 and chooses right subtree with probability 0.3. This is a normalization step in this tree level. Naturally, the final probability for finding the target word is the continuous multiplication of probabilities in each search step. As the classic sentence in the page 2 says: “**In probabilistic terms, one N-way normalization is replaced by a sequence of O(logN) local (binary) normalizations.**“

Apparently, the time complexity decreases from O(V) to O(logV).

Luckly, the famous paper “Distributed Representations of Words and Phrases and their Compositionality” by Mikolov et al., gives concrete formula (i.e., Equ. 3)to compute the Hierarchical Softmax probability:

**Now, I give some comments to understand this formula:**

1）The whole process is a continuous multiplication. Product can be treated as a kind of “similarity” between history (also called “Input” or “context”) and the jth ancestor. Note that all the leaves and non-leaves (i.e., ancestors) have initialized embeddings which will be updated gradually. So, above “similarity” is computed using embeddings. **Its intuition is that if given context is more similar with the ancestors of certain word, then the word has higher probability to be the target word. **

2) Given “similarity” score, Hierarchical Softmax converts it to two probabilities to search left subtree and right subtree. That’s achieved by using . Note that the equation in the symbol “” is -1 or +1. It makes . This is a normalization step. The original sentence says “let ch(n) be an arbitrary fixed child of n”. Hence, we can always think that ch(n) means the left child. So, if the j+1 ancestor of target word w is the left child of its j ancestor, then the probability of j choosing j+1 is (let’s assume the similarity between context and ancestor j is 0.78234), otherwise, the probability is .

3) Now, we know how to compute the probability of ancestor j choosing j+1 (). Multiplying them one by one equals to the normalized probability of target word given context or history.

janson

said:貌似看你这篇文章，就有些理解Hierarchical Softmax了，一直很疑惑这个。

yinwenpeng

said:That’s my great honour ! I will modify it recently with more details.

siegfang

said:请问sigma是什么函数？

yinwenpeng

said:hello, you can find the answer in my another blog: https://yinwenpeng.wordpress.com/2013/09/27/sigmoidlogistic-function-vs-softmax/

Gonzalo

said:Thanks !!! You save my day !!!

yinwenpeng

said:haha, my great pleasure

Pingback: word2vec gradient calculation | Wenpeng Yin's Blog

tong

said:thx for the article!

how are the embeddings for the ancestors trained though?

yinwenpeng

said:in practical, the context words and target words have separate embedding matrices.

hadigayri

said:Thanks for this explanation.

You say that it is for reducing computation cost. But if the classification problem has a hierarchical structure, (like there are subclasses of classes), then isnt it more appropriate to use this hierarchical softmax instead of the usual one? This looks like an other advantage, in addition to computational advantage. What do you think?

yinwenpeng

said:yes, I agree you argument. But for general language modeling, words do not have hierarchical classes, or it is pretty expensive to build hierarchical class structure for vocab before training the language modeling. So, some work like word2vec used huffman tree based on word frequency (which costs nothing to build) to denote the hierachichy. Indeed for other tasks in which hierarchical structure is important and less expensive, hierarchical softmax is obsolutely preferred.