An information theory perspective on probability

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

This article is an excerpt from the 44-pages article “a Probability Perspective” published by Eric C.R. Hehner. You can find the full article here. In this article Hehner give the above perpective on Information theory, then the Bayesian perspective and the modeling of real world events using a new probabilistic formalism based on programmation, that he calls the programmer’s perspective. His formalism allows to solve classical paradox in an easy and non-ambiguous way.

In 1948, Claude Shannon invented information theory based on probability theory. The basic definition is entropy. Given of a set of messages mi, each one occurring with probability pi, their entropy is defined as $–\sum_i p_i × log(p_i)$ where $log$ is logarithm base 2 . The messages could be letters in an alphabet, or words in a language, and the idea is that a long sequence of messages is sent from a sender to a receiver. The probability pi is the relative frequency of message mi in the sequence. Shannon referred to entropy as a measure of “uncertainty” on the part of the receiver, before receiving a message, about what message would be received next. It is independent of representation.

The word “entropy” comes from statistical mechanics, where it originally represented the amount of “disorder” in a large collection of molecules. Currently it is explained as the average energy carried by a molecule, which is related by the Boltzmann constant $k ≈ 1.38×10^{–23}$ to the temperature. Although temperature is considered a macro property, and one may be reluctant to talk about the average value in a set that contains only one value, there is no harm in relating energy to temperature even for a single molecule.

$E = k×T/2$

Similarly, Shannon was reluctant to talk about the information content of each message individually, but there is no harm in doing so. If we define the information content $I_i$ of message $m_i$ as:

$I_i = – \log(p_i)$

then, the entropy:

$\sum_i p_i × I_i$

is the average information content of a message measured in bits.

In 1948 it made good sense to explain information in terms of probability; information (as a mathematical theory) was unknown, and probability (as a mathematical theory) was already well developed. But today it might make better sense to explain probability in terms of information. Most people today have a quantitative idea of what information and memory are; they talk about bits and bytes; they buy an amount of memory, and hold it in their hand; they wait for a download, and complain about the bandwidth. Many people already understand the important difference between information and memory; they compress files before sending them, and they decompress files upon receiving them.

Information theory talks about messages, but it could just as well talk about events, or outcomes of an experiment. (Perhaps a message is just a special case of event, or perhaps an event is just a special case of message.) Let us be more abstract, and dispense with events and messages. The information $I$ (in bits) associated with probability $p$ is:

$I = –\log p$

which is easily inverted:

$p = 2^{–I}$

to allow us to define probability in terms of information. The suggestion to define probability in terms of information is intended as a pedagogical technique: define the less familiar in terms of the more familiar, or perhaps I mean define the less understood in terms of the more understood. Henceforth I will be neutral on this point, making use of the relationship between them, without taking either one of them to be more basic.

Shannon explained the amount of information carried by a message as a measure of how surprised one is to learn the message. Probability is also a measure of surprise, or inversely, expectation. If there are two possibilities A and B, and I say it will probably be A, I mean that I expect to see A, and I will be surprised to see B . A numeric probability expresses the strength of my expectation to see A, and the amount of my surprise if I see B . One’s expectation and surprise may be shaped by past frequencies, or they could be shaped by considerations that apply to one-time-only events.

Scale

There are two temperature scales in common use: Fahrenheit (in the USA) and Celsius (in the rest of the world). There are formulas to convert each to the other:

$c = (f–32)×\frac{5}{9}$

and

$f = c×\frac{9}{5} + 32$

Whenever two physical quantities can be converted, each to the other, they measure the same thing on different scales. (More generally, every physical law says that there are fewer things to measure than there are variables in the law.) So energy and mass measure the same thing on different scales: $E=m×c^2$ and $m=E/c^2$ .

More to the point, information and probability measure the same thing on different scales.

$I=–\log p$

and

$p=2^{–I}$

I am not sure what to call the “thing” measured on these two scales; rather than introduce a new word I shall just call it “information”.

There is another scale in common use for measuring information: the number of possible states. (This same scale applies to energy-temperature-mass too.) This is the scale preferred by people who build “model checkers” to verify the correctness of computer hardware or software. They like to say they can handle up to $10^{60}$ states, which is something like the number of atoms in our galaxy. That is a truly impressive number, until we realize that $10^{60}$ is about $2^{200}$ , which is the state space of 200 bits, or about six 32-bit variables; we rapidly descend from $10^{60}$ states to 6 program variables!

In order to write the conversion formulas among the three scales neatly, I need unit names for each of them. We already have the “bit” and the “state”; I am missing a unit for the probability scale, so let me invent the “chance”. (All three of these units are non- physical; they are alternative names for unity (pure numbers).) Here are the conversions.

$b$ bit	= $2^b$ state	= $2^{–b}$ chance
$s$ state	= $1/s$ chance	= $\log s$ bit
$c$ chance	= $-\log c$ bit	= $1/c$ state

Let’s look at three example point on these scales.

0 bit	= 1 state	= 1 chance
1 bit	= 2 state	= 0.5 chance
$\inf$ bit	= $\inf$ state	= 0 chance

On the middle line, 1 bit is the amount of information needed to tell us which of 2 states we are in, or has occurred, or will occur, and that corresponds to probability 1/2 chance for each state. On the top line, 0 bits is the amount of information needed to tell us which state if there is only 1 state, and that corresponds to 1 chance (certainty). On the bottom line, it takes $\inf$ bits to tell us that something impossible is occurring (Shannon would say that we are infinitely surprised). (I say “certain” for probability 1 and “impossible” for probability 0 and I don’t care about any measure-theoretic difference.) […]

The “problem of prior probabilities” is the problem of how Bayesians justify the assumption that the initial probability distribution is uniform across all states. I suggest that there is no “assumption” being made, and so no need for “justification”.

Saying that there are 4 states is saying, on another scale, that the probability is 1/4, and on yet another scale that 2 bits are required to specify the situation. If we then learn that one of the states never occurs, we adjust: there are 3 states (that occur); each of the (occurring) states has probability 1/3 (and any nonoccurring state has probability 0 ); it takes about 1.585 bits to identify a state (that occurs, and infinitely many bits to identify any nonoccurring state).

To be less extreme, if we learn that one of the four states rarely occurs, then we adjust: as a measure of information, there are less than 4 but more than 3 states ; each commonly occurring state has a probability between 1/4 and 1/3, and the rarely occurring state has a probability between 0 and 1/4 ; it takes somewhere between 1.585 and 2 bits to identify any of the commonly occurring states, and somewhere between 2 and ∞ bits to identify the rarely occurring state. In general, having no prior information about which of n states occurs is probability 1/n for each state, not by assumption, but by a change of scale. […]

What is the point of having several scales on which to measure the same quantity? If they are Fahrenheit and Celsius for measuring temperature, there is no point at all; they are linear translations of each other, and the duplication is just annoying. A slide rule multiplies two numbers by transforming them to a logarithmic scale, where the multiplication is transformed into the simpler operation of addition, and then transforms the result back. Fourier transforms are used for the same reason. Similarly, perhaps some information calculations are easier on the chance (probability) scale, others on the bit scale, and still others on the state scale. Thus they might all be useful.

An information theory perspective on probability

Scale

Read next