KL-divergence of Two Documents

A hands-on tutorial

Manos Tsagkias

University of Amsterdam
[email protected]
06 December 2010
Keywords: information theory

6 December 2024—Note: This post is a restoration of an older post, which is no longer available online. Content-wise this version is identical to the original. The original version can still be found in the WayBack Machine.

Let \(P\) and \(Q\) be two probability distributions of a discrete random variable. If the following two properties hold:

  1. when \(P\) and \(Q\) both sum to \(1\)
  2. and for any \(i\) such that \(P(i) > 0\) and \(Q(i) > 0\)

then, we can define their KL-divergence as:

\[D_{KL}(P||Q) = \sum_{i}P(i)log\frac{P(i)}{Q(i)},\]

and it has three properties:

  1. \(D_{KL}(P||Q) \neq D_{KL}(Q||P)\) (asymmetry)
  2. it is additive for independent distributions
  3. \(D_{KL} \geq 0\) with \(D_{KL} = 0\) iff \(P=Q\)

Working with documents

We regard a document \(d\) as discrete distribution of \(|d|\) random variables, where \(|d|\) is the number of words in the document. Now, let \(d_{1}\) and \(d_{2}\) be two documents for which we want to calculate their KL-divergence. We run into two problems:

  1. we need to compute the KL-divergence twice due to asymmetry: \(D_{KL}(d_{1}||d_{2})\) and \(D_{KL}(d_{2}||d_{1})\).
  2. also, due to the 2nd constraint for defining KL-divergence, our calculations should only consider words occurring in both \(d_{1}\) and \(d_{2}\).

Symmetric KL-divergence

We start from the 2nd property of KL-divergence:

\[\begin{array}{rcl} D_{KL}(P||Q) + D_{KL}(Q||P) & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)} + \sum_{i}Q(i)log\frac{Q(i)}{P(i)} \\& = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}+Q(i)log\frac{Q(i)}{P(i)}\\ & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}-Q(i)log\frac{P(i)}{Q(i)}\\ & = & \sum_{i}(P(i)-Q(i))log\frac{P(i)}{Q(i)}\end{array}\]

Ok! It looks good [1]! Now we need to compute KL-divergence only once for every pair of documents.

Over which random variables?

Now let’s turn into how to handle documents with no or little overlapping vocabularies. To illustrate the problem, consider the following documents:

d1:

This is a document

d2:

This is a sentence

For each document we remove stopwords (‘this’, ‘is’, ‘a’) so they become:

d1:

document

d2:

sentence

According to constraint 2, we need to operate on the intersection of the documents’ vocabularies: \(d_{1}\cap d_{2}=\emptyset\). We end up with the empty set and therefore we cannot compute directly the KL-divergence. In this case we can assign it a large number like \(1e33\).

Let’s see what happens when we have larger documents.

d1:

Many research publications want you to use BibTeX which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.

d2:

In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn’t have to be an actual ‘left delimiter’, that is you can use ‘\left)’ if there were some reason to do it.

After stopword removal, lowercasing and discarding words less than \(2\) characters, the documents become:

d1:

many research publications want you use bibtex better organizes whole process suppose concreteness your source file tex basically you create file bib containing bibliography run bibtex file

d2:

case you must supply both left right because delimiter height made match whatever contained between two commands left doesn have actual left delimiter you use left some reason

The vocabulary intersection of the documents consists of two terms: “use” and “you”. In \(d_{1}\) “use” occurs 1 time and “you” occurs 2 times. Surprisingly, in \(d_{2}\) “use” also occurs \(1\) time and “you” occurs \(2\) times too. The distributions \(D_{1}\) and \(D_{2}\) are equal, and therefore \(D_{KLsym}(D_{1}||D_{2}) = 0\). So these documents are deemed equal! A better stopword list could have removed “use” and “you” and in that case the documents would have an infinite KL-divergence as in the first example. However it is easy to think of similar examples where stopword lists wouldn’t have been of much help.

So, how can we overcome this problem?

Simple back-off

Since operating on the vocabulary intersection is not an option, we need to find a trick that allows us to consider the entire vocabulary of the documents. Smoothing comes to mind. Dirichlet and Laplacian smoothing are amongst the most popular smoothing techniques but after smoothing the probability distribution doesn’t sum up to \(1\) and violates the first constraint for defining KL-divergence.

Brigette Biggi [2] suggested a back off smoothing method which keeps the probability distributions sums to \(1\) and also allows operating on the entire vocabulary. According to their proposed back-off method, the smoothed probability \(P'(t, d)\) of a term \(t\) in a document \(d\) is:

\[P'(t_{i},d) = \left\{ \begin{array}{ll} \gamma P(t_{i}|d) & \quad \text{if it occurs in d}\\ \epsilon & \quad \text{otherwise}\\ \end{array} \right.\]

where

\[P(t_{i}|d) = \frac{tf(t_{i}, d)}{\sum_{x\in d}tf(t_{x},d)}\]

the interesting part is on how \(\gamma\) and \(\epsilon\) are calculated. In order to keep the term probabilities for \(d_{1}\) and \(d_{2}\) summing up to \(1\), the following constraint should be met:

\[\sum_{i \in d}\gamma P(t_{i}|d) + \sum_{i \in d_{1}, i \notin d_{2}}\epsilon = 1\]

where \(\gamma\) is a normalization coefficient:

\[\gamma = 1 - \sum_{i \in d_{1}, i \notin d_{2}}\epsilon\]

and \(\epsilon\) is a positive number smaller than the minimum term probability occurring in either \(d_{1}\) or \(d_{2}\).

The code

To illustrate the above, I wrote a small Python script:

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    # print "_s: %s" % _s
    # print "_t: %s" % _t

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    if sc < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: sc does not sum up to 1. Bailing out .."
        sys.exit(2)
    if st < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: st does not sum up to 1. Bailing out .."
        sys.exit(2)

    div = 0.
    for t, v in _s.iteritems():
        pts = v / ssum

        ptt = epsilon
        if t in _t:
            ptt = gamma * (_t[t] / tsum)

        ckl = (pts - ptt) * math.log(pts / ptt)

        div +=  ckl

    return div
#end of kldiv

d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""

d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""

print "KL-divergence between d1 and d2:", kldiv(tokenize(d1), tokenize(d2))
print "KL-divergence between d2 and d1:", kldiv(tokenize(d2), tokenize(d1))

The output looks like this:

KL-divergence between d1 and d2: 6.52185430964
KL-divergence between d2 and d1: 6.51142363095

Now, KL-divergence is greater than zero, so the documents are not thought to be the same as before! Good job. Looking at KL symmetry, although the divergence of the two pairs is not identical, it is sufficiently close [1].

Acknowledgments

The above is a compilation of knowledge found in Wikipedia and from the paper “Reducing the Plagiarism Detection Search Space on the basis of Kullback-Leibler Distance” by Alberto Barron-Cedeno, Paolo Rosso, and Jose-Miguel Benedi [1]. Examples and code are mine and you are free to use them.

References

[1] Pinto, D., Benedí, JM., Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. In Computational Linguistics and Intelligent Text Processing (CICLing 2007). Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. Link

[2] Bigi, B. (2003). Using Kullback-Leibler Distance for Text Categorization. In Advances in Information Retrieval (ECIR 2003). Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. Link