In traditional NNLM, we need to compute the conditional probabilities of all vocab words given a history: p(w|history) w\in Vocabulary, finally perform a normalization (namely softmax). Obviously, such kind of operation requires huge computation, especially on the whole vocab.

To solve the problem, some people like the authors of “A Scalable Hierarchical Distributed Language Model” proposed a solution named “hierarchical softmax” with following process:

1) first, build a Huffman tree based on word frequencies. As a result, each word is a leaf of that tree, and enjoys a path from the root to itself;

2) Now, each step in the search process from root to the target word is a normalization. For example, Root chooses left subtree with probability 0.7 and chooses right subtree with probability 0.3. This is a normalization step in this tree level. Naturally, the final probability for finding the target word is the continuous multiplication of probabilities in each search step. As the classic sentence in the page 2 says: “In probabilistic terms, one N-way normalization is replaced by a sequence of O(logN) local (binary) normalizations.

Apparently, the time complexity decreases from O(V) to O(logV).

Luckly, the famous paper “Distributed Representations of Words and Phrases and their Compositionality” by Mikolov et al., gives concrete formula (i.e., Equ. 3)to compute the Hierarchical Softmax probability:

p(w|w_I)=\prod\limits_{j=1}^{L(w)-1}\sigma([n(w, j+1)=ch(n(w,j))]\cdot {v_{n(w,j)}^{'}}^Tv_{w_I})

Now, I give some comments to understand this formula:
1)The whole process is a continuous multiplication. Product {v_{n(w,j)}^{'}}^Tv_{w_I} can be treated as a kind of “similarity” between history (also called “Input” or “context”) and the jth ancestor. Note that all the leaves and non-leaves (i.e., ancestors) have initialized embeddings which will be updated gradually. So, above “similarity” is computed using embeddings. Its intuition is that if given context is more similar with the ancestors of certain word, then the word has higher probability to be the target word.


2) Given “similarity” score, Hierarchical Softmax converts it to two probabilities to search left subtree and right subtree. That’s achieved by using \sigma([n(w, j+1)=ch(n(w,j))]\cdot {v_{n(w,j)}^{'}}^Tv_{w_I}). Note that the equation in the symbol “[\cdot]” is -1 or +1. It makes \sigma(+x)+\sigma(-x)=1. This is a normalization step. The original sentence says “let ch(n) be an arbitrary fixed child of n”. Hence, we can always think that ch(n) means the left child. So, if the j+1 ancestor of target word w is the left child of its j ancestor, then the probability of j choosing j+1 is \sigma(+0.78234) (let’s assume the similarity between context and ancestor j is 0.78234), otherwise, the probability is \sigma(-0.78234).

3) Now, we know how to compute the probability of ancestor j choosing j+1 (j\in \{1,2,\cdots,L(w)-1\}). Multiplying them one by one equals to the normalized probability of target word given context or history.