# Mutual information

In classical information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two variables. Intuitively, the mutual information "I(X:Y)" measures the information about X that is shared by Y. Image:classinfo.png

If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.

If we consider pairs of discrete random variables (X, Y), then formally, the mutual information can be defined as: I(X : Y) :  = H(X) + H(Y) − H(XY) with H(X), H(Y) the Shannon entropy of "X" and "Y", and H(XY) the Shannon entropy of the pair "(X,Y)". In terms of the probabilities, the mutual information can be written as

$$I(X;Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac{p(x,y)}{f(x)\,g(y)},$$

where p is the joint probability distribution function of X and Y, and f and g are the marginal probability distribution functions of X and Y respectively.

In the continuous case, we replace summation by a definite double integral:

$$I(X;Y) = \int_Y \int_X p(x,y) \log \frac{p(x,y)}{f(x)\,g(y)} \; dx \,dy, \!$$

where p is now the joint probability density function of X and Y, and f and g are the marginal probability density functions of X and Y respectively.

Mutual information is nonnegative by subadditivity of the Shannon entropy. (i.e. I(X;Y) ≥ 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)).

### Relation to other quantities

Mutual information can be equivalently expressed as

I(X; Y) = H(X) − H(XY) = H(Y) − H(YX) = H(X) + H(Y) − H(X, Y)

where H(XY) = H(XY) − H(Y) is the conditional entropies.

Mutual information can also be expressed in terms of the Kullback-Leibler divergence between the joint distribution of two random variables X and Y and the product of their marginal distributions. Let q(x, y) = f(x) × g(y); then

I(X; Y) = KL(p, q).

Furthermore, let hy(x) = p(x, y) / g(y). Then

$I(X;Y) = \sum_y g(y) \sum_x h_y(x) \times \log_2 \frac{h_y(x)}{f(x)} \!$
= ∑yg(y) KL(hy, f)
= EY[KL(hy, f)].

Thus mutual information can also be understood as the expectation of the Kullback-Leibler divergence between the conditional distribution h of X given Y and the univariate distribution f of X: the more different the distributions f and h, the greater the information gain.