Hierarchical Softmax in neural network language model

In traditional NNLM, we need to compute the conditional probabilities of all vocab words given a history: p(w|history) $w\in Vocabulary$ , finally perform a normalization (namely softmax). Obviously, such kind of operation requires huge computation, especially on the whole vocab.

To solve the problem, some people like the authors of “A Scalable Hierarchical Distributed Language Model” proposed a solution named “hierarchical softmax” with following process:

1) first, build a Huffman tree based on word frequencies. As a result, each word is a leaf of that tree, and enjoys a path from the root to itself;

2) Now, each step in the search process from root to the target word is a normalization. For example, Root chooses left subtree with probability 0.7 and chooses right subtree with probability 0.3. This is a normalization step in this tree level. Naturally, the final probability for finding the target word is the continuous multiplication of probabilities in each search step. As the classic sentence in the page 2 says: “In probabilistic terms, one N-way normalization is replaced by a sequence of O(logN) local (binary) normalizations.“

Apparently, the time complexity decreases from O(V) to O(logV).

Luckly, the famous paper “Distributed Representations of Words and Phrases and their Compositionality” by Mikolov et al., gives concrete formula (i.e., Equ. 3)to compute the Hierarchical Softmax probability:

$p(w|w_I)=\prod\limits_{j=1}^{L(w)-1}\sigma([n(w, j+1)=ch(n(w,j))]\cdot {v_{n(w,j)}^{'}}^Tv_{w_I})$

Now, I give some comments to understand this formula:
1）The whole process is a continuous multiplication. Product ${v_{n(w,j)}^{'}}^Tv_{w_I}$ can be treated as a kind of “similarity” between history (also called “Input” or “context”) and the jth ancestor. Note that all the leaves and non-leaves (i.e., ancestors) have initialized embeddings which will be updated gradually. So, above “similarity” is computed using embeddings. Its intuition is that if given context is more similar with the ancestors of certain word, then the word has higher probability to be the target word.

2) Given “similarity” score, Hierarchical Softmax converts it to two probabilities to search left subtree and right subtree. That’s achieved by using $\sigma([n(w, j+1)=ch(n(w,j))]\cdot {v_{n(w,j)}^{'}}^Tv_{w_I})$ . Note that the equation in the symbol “ $[\cdot]$ ” is -1 or +1. It makes $\sigma(+x)+\sigma(-x)=1$ . This is a normalization step. The original sentence says “let ch(n) be an arbitrary fixed child of n”. Hence, we can always think that ch(n) means the left child. So, if the j+1 ancestor of target word w is the left child of its j ancestor, then the probability of j choosing j+1 is $\sigma(+0.78234)$ (let’s assume the similarity between context and ancestor j is 0.78234), otherwise, the probability is $\sigma(-0.78234)$ .

3) Now, we know how to compute the probability of ancestor j choosing j+1 ( $j\in \{1,2,\cdots,L(w)-1\}$ ). Multiplying them one by one equals to the normalized probability of target word given context or history.

11 thoughts on “Hierarchical Softmax in neural network language model”

janson said:

December 7, 2013 at 4:29 pm

貌似看你这篇文章，就有些理解Hierarchical Softmax了，一直很疑惑这个。

- yinwenpeng said:
  
  December 7, 2013 at 4:44 pm
  
  That’s my great honour ! I will modify it recently with more details.
  
siegfang said:

December 9, 2013 at 1:29 pm

请问sigma是什么函数？

- yinwenpeng said:
  
  December 9, 2013 at 1:56 pm
  
  hello, you can find the answer in my another blog: https://yinwenpeng.wordpress.com/2013/09/27/sigmoidlogistic-function-vs-softmax/
  
Gonzalo said:

December 10, 2013 at 4:46 pm

Thanks !!! You save my day !!!

- yinwenpeng said:
  
  December 10, 2013 at 6:49 pm
  
  haha, my great pleasure
  
Pingback: word2vec gradient calculation | Wenpeng Yin's Blog
tong said:

October 23, 2014 at 7:24 pm

thx for the article!
how are the embeddings for the ancestors trained though?

- yinwenpeng said:
  
  October 24, 2014 at 1:37 pm
  
  in practical, the context words and target words have separate embedding matrices.
  
hadigayri said:

April 27, 2016 at 9:24 am

Thanks for this explanation.
You say that it is for reducing computation cost. But if the classification problem has a hierarchical structure, (like there are subclasses of classes), then isnt it more appropriate to use this hierarchical softmax instead of the usual one? This looks like an other advantage, in addition to computational advantage. What do you think?

- yinwenpeng said:
  
  April 30, 2016 at 10:07 pm
  
  yes, I agree you argument. But for general language modeling, words do not have hierarchical classes, or it is pretty expensive to build hierarchical class structure for vocab before training the language modeling. So, some work like word2vec used huffman tree based on word frequency (which costs nothing to build) to denote the hierachichy. Indeed for other tasks in which hierarchical structure is important and less expensive, hierarchical softmax is obsolutely preferred.

	yinwenpeng on Hierarchical Softmax in neural…
	hadigayri on Hierarchical Softmax in neural…
	MEhomer on word2vec implementations of gr…
	yinwenpeng on Hierarchical Softmax in neural…
	tong on Hierarchical Softmax in neural…

Share this:

Related

11 thoughts on “Hierarchical Softmax in neural network language model”

Leave a comment Cancel reply