Extending logic to deal with uncertainty

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

This article sketches a construction of probability calculus as an extension of classical logic to account for uncertainty so that by construction, it can be used to automate or bullet-proof our everyday decisions. This has applications both in artificial intelligence and decision making theory.

The true logic of this world is the calculus of probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind. – James C. Maxwell (1850) from [1]

I have never been satisfied by the usual presentation of probability theory using Kolmogorov’s axioms or any other formalism that uses sets. I don’t care about sets or measures. What I care about is deduction and inference; I care about logic and using a mathematical theory to bullet proof my decision making, or to automate my decisions using artificial intelligence. Classical logic is not sufficient for that because it can’t deal with uncertainty like humans do. So we need an extension of it that is able to deal with various degrees of certainty.

The main point of this article is to emancipate probability calculus from the usual construction of probability theory. So this article is not about events that will follow a distribution if an experience is repeated multiple times. Nor is it about sigma algebras. It is about contructing a new formal calculus from scratch that is based on classical logic and extends it to account for uncertainty. By chance, this calculus has the same rules as probability calculus and thus provides an alternative view of probability theory. We use the same notations to ease the comparison.

If you want to gain an alternative view on probability, or convince yourself that probability calculus is a valid theory upon which to base your decisions, this article is for you. You will need basic understanding of formal logic (proposition, and, or, not, imply) the follow the exposition. I won’t explain here how to use the rules of the newly constructed calculus to automate or improve your decisions. Rather, I sketch how such a calculus can be constructed from classical logic. The construction is not complete because my purpose is to show that it can be done and roughly illustrate how.

To be clear: the purpose of this construction is not to provide a mathematically stronger theory of probability. The goal is pedagogical: our construction explains the intuition behind the theory and explains why it can be used successfully in the real world. In a way, it’s about constructing a physical theory rather that mathematical theory, even though the construction is completely rigorous and mathematically justified (by Cox’s theorem).

The following exposition is a summary of the first few chapters of the book Probability theory: the Logic of Science by E. T. Jaynes. In the book, the concepts are introduced slowly and accompanied by a lot of conceptual remarks and insights that make Jaynes’ writing delightful. I strongly encourages you to read this book or at least its preface and introduction.

In his book, Jaynes explains how his construction of probabilities differs from the traditional one:

Our system of probability, however, differs conceptually from that of Kolmogorov in that we do not interpret propositions in terms of sets, but we do interpret probability distribution as carriers of incomplete information. Partly as a result, our system has analytical resources not present at all in the Kolmogorov system.

What does “probability” mean?

Throughout this article, I will use the term probability. Here is how to interpret it: the probability of an hypothesis is a measure of our confidence in it. It is subjective and depends primarily on our information about that hypothesis.

For instance, your confidence that it will rain soon might increase if you see dark clouds in the sky. Using the formalism, the confidence in hypothesis $H_1$ might increase if we learn evidence $E_1$ , which we will note: $p(H_1 \mid E_1) > p(H_1)$ .

To say it again: probabilities quantify our confidence based on the amount of information we have.

This view has a concrete impact on how to use the results of probability theory. Suppose for instance that you draw balls from an urn.

You know that the urn contains $N$ balls, $R$ of which are reds. You draw a ball blindfolded without knowing its color and then look at the color. The probability that the ball you have is red is $R / N$ .

This means that your confidence that the ball will be red is $R / N$ . But this probability assignment is not an assertion of any physical property of the urn or its content; it is a description of your state of knowledge prior to the drawing. Therefore, it is illogical to speak of verifying this probability by performing experiments with the urn. The probability is not an expected frequency.

Probabilities are not properties of the real world, but we can use them to infer physical predictions. For instance, we can compute the most probable fraction of red balls in a sample of $n$ draws and confront this fraction with its frequency among multiple samples. If we do the math, the most probable fraction is almost equal to $R/N$ which can lead to confusion since it’s close to our probability estimate. But both are conceptually different, and in practice their values differ slightly.

Classical logic is not enough

Classical logic is concerned with certainty: a proposition $A$ is either true or false. The rules to deduct true propositions from others is:

$A \Rightarrow B$

$A$ is true, then

$B$ is true

For instance, let the two propositions:

$A$	“it will start to rain by 10a.m. at the latest”
$B$	“the sky will be cloudy before 10a.m.”

Then it is true that $A \Rightarrow B$ : if (it will start to rain by 10a.m. at the latest), then (the sky will be cloudy before 10a.m. ).

So, if we know $A$ then the rule says that we know $B$ . But what if we know $B$ and ask about $A$ ? Classical logic doesn’t tell us anything about $A$ .

But in our everyday lives, we constantly use knowledge about $B$ to quantify our certainty about $A$ . For instance, if (the sky is cloudy before 10a.m.) then (I’m more confident that it will start raining by 10a.m. at the latest). Classical logic can’t help us with this kind of inference and that’s why we are looking to extend it with a formalism that can.

In particular, here are some inferences that classical logic can’t deal with but that we constantly use:

3) given $A \Rightarrow B$ :	if $B$ is true, then $A$ is more plausible
4) given $A \Rightarrow B$ :	if $A$ is false, then $B$ is less plausible
5) given $A \Rightarrow \text{more plausible(}B\text{)}$ :	if $B$ is true, then $A$ is more plausible

For instance, it’s rule 3 that we used previously: if (the sky is couldy before 10a.m.), then (I’m more confident that it will rain by 10a.m. at the latest).

We will see at the end of this article how probability theory justifies these points.

The object of this article is to develop the mathematical theory which answers questions such as: What determines whether the probability for $A$ increases by large amount, raising it almost to certainty; or only a negligibly small amount, making the data $B$ almost irrelevant?

Before we dive into the construction of the theory, recall to following about formal logic:

$A \land B$ is true	if and only if both $A$ and $B$ are true
$\neg A$ is true	if and only if $A$ is false
$A \lor B$ is true	as soon as one of $A$ or $B$ is true

And in particular:

$A \lor B$	is equivalent to $\neg (\neg A \land \neg B)$
$A \Rightarrow B$	is equivalent to $\neg B \lor A$

Which means that we only need a suitable definition for $\neg A$ and $A \land B$ to construct both $A \lor B$ and $A \Rightarrow B$ .

An extension of logic to deal with confidence levels

We want to construct a formalism that will allow us to reason using uncertainty like we did in points (3, 4 and 5) in the previous section. Our approach is to establish a list of reasonable desiderata and then to obtain the formalism as a consequence of this list. This type of mathematical reasoning is called analysis-synthesis: we use our current knowledge to obtain necessary conditions on the theory being contructed and use them to find the unique construction that satisfies all the sufficient ones.

The desiderata are chosen so that a rational person, on discovering that they were violating one of them, would wish to revise their thinking. They are broadly stated below:

Degrees of probability are represented by real numbers;
Qualitative correspondence with common sense;
Consistency 1: if a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result;
Consistency 2: the theory always take into account all the evidence it has and does not arbitrarily ignore some of the information;
Consistency 3: the theory always represents equivalent states of knowledge by equivalent probability assignments.

You can think of the theory being constructed as the rules a robot will follow to take decisions and compute its own confidence in hypothesis.

I will not reconstruct the whole theory here and you should refer to the book to see the completed construction. Instead, I will only give an example of how the desiderata can be used to establish the rules of the new theory.

Since this theory is about computing confidence levels based on the information we have about an hypothesis, we need a notation to make explicit what information is taken into account. So we will use the notation: $(H \mid E)$ the probability of hypothesis $H$ given evidence that $E$ is true. I will often use $E$ for “evidence” or $B$ for “background information”.

Note: a consequence of the desiderata for our theory is that it can’t selectively ignore some established evidence, so we must take every established information into account when computing our probability estimations.

Estimate $(H_1 \land H_2 \mid E)$

If you recall from previous section, classical logic can be built using only $\land$ and $\neg$ . So our first objective will be to construct a probability equivalent of $\land$ .

Our goal is thus to relate the probability $(H_1 \land H_2 \mid E)$ that both $H_1$ and $H_2$ are true given evidence $E$ with the probability $(H_1 \mid E)$ of $H_1$ given evidence $E$ and the probability $(H_2 \mid E)$ .

In classical logic, to establish that $H_1 \land H_2$ is true, we must first establish that $H_1$ is true and then establish that $H_2$ is true. Or the other way around with $H_2$ first and $H_1$ second. To estimate our confidence in the fact that $H_1 \land H_2$ is true, we can estimate our confidence in the two steps.

step	confidence in that step
decide that $H_1 \land H_2$ is true	$(H_1 \land H_2 \mid E)$
- decide that $H_1$ is true	$(H_1 \mid E)$
- having accepted $H_1$ as true, decide that $H_2$ is true	$(H_2 \mid H_1, E)$

In order for $H_1 \land H_2$ to be a true proposition, it is necessary that $H_1$ is true. So the probability $(H_1\mid E)$ should be involved.
If $H_1$ is true then it is further necessary that $H_2$ be true, so the probability $(H_2 \mid H_1, E)$ is also needed.
But if $H_1$ is false, then $H_1 \land H_2$ will be false independently of whatever one knows about $H_2$ . So our estimate will not depend on $(H_2 \mid E)$ .
Using similar arguments, we can establish that our estimate only depends on $(H_1\mid E)$ and $(H_2 \mid H_1, E)$ .

So we can state the following rule where $F$ is a function that indicates the dependence:

$(H_1 \land H_2 \mid E) = F[(H_1 \mid E); (H_2 \mid H_1, E)]$

Do we really need to take the evidence $H_1$ ?

We could be tempted to use the dependence:

$(H_1 \land H_2 \mid E) = F[(H_1 \mid E); (H_2 \mid E)]$

where the estimate for $H_2$ does not use evidence for $H_1$ . But this dependence is flawed as shown by taking:

$H_1$ : the next person you meet has a blue left eye
$H_2$ : the next person you meet has a green left eye
$E$ : you will meet someone soon

In which case, both $H_1 \mid E$ and $H_2 \mid E$ are quite plausible but $H_1 \land H_2 \mid E$ is not.

How do we determine F?

Using a list of desiderata for our theory that we confront with $F$ , we can find the properties of $F$ . For instance, using a requirement for structural consistency and given that boolean algebra is associative, we find:

$F[F[x, y], z] = F[x, F[y, z]]$

From which we can deduce that:

$F(x, y) = w^{-1}[w(x)w(y)]$

for some given function $w$ . And another requirement will allow us to find the properties of $w$ ; etc. Until we establish the properties of the whole theory and set $p = w^m$ for a fixed positive value of $m$ .

The above derivation is only intended as an illustration of the reasoning needed to construct the theory. I’m skipping details on purpose. If you want to know more about this construction, check out this wikipedia article about Cox’s theorem).

Rules of the new logic

Once every requirement for the theory has been used to establish necessary conditions, we find the founding rules of the theory:

The “and” rule expresses our confidence in $H_1 \land H_2$ given evidence $E$ .

$p(H_1 \land H_2 \mid E) = p(H_1 \mid E) p(H_2 \mid H_1, E)$

The “not” rule expresses our confidence in $\neg H$ given evidence $E$ .

$p(H \mid E) + p(\neg H \mid E) = 1$

Using these two founding rules, we can estimate our confidence in $H_1 \lor H_2$ :

$p(H_1 \lor H_2 \mid E) = p(H_1 \mid E) + p(H_2 \mid E) - p(H_1 \land H_2 \mid E)$

The principle of indifference

A nice consequence of the theory developped in the book is the proof of the principle of indifference. This principle states that if given background information $B$ the hypothesis ( $H_1$ , …, $H_n$ ) are mutually exclusive and exhaustive and $B$ does not favor any one of them over any other, then:

$p(H_i \mid B) = \frac{1}{N}$

$1 \leq i \leq N$

In the book, the principle is derived using permutations and the requirement “consistency 3” only. In other words, we haven’t made any hypothesis about the (uniform) distribution of $H_i$ . Actually we haven’t even talked about distributions at all.

So it’s the information fed into the theory that determines the definite numerical values of $p(\cdot)$ .

I really like the way the principle was derived because it appeals to my way of thinking in everyday life. When confronted with two alternatives, if I don’t have any information indicating that I should have more confidence in one than the other, I always assume probabilities 1/2 for each.

If we relate the principle with information theory, then each hypothesis $H_i$ caries the same amount of information $I_i$ . Using the conversion formula between information and probabilities we find that $\forall i \in [1; N]$ , $p(H_i) = 2^{I_i}$ . You can read more about the link between probabilities and information in this article.

The principle can be further applied using the “or rule” to show that when we draw a ball from an urn containing 3 black balls and 7 white balls, we have confidence $p(\text{black}\mid B) = 3/10$ that we will draw a black ball (where $B$ is some background information). This problem is called the Bernoulli urn problem.

The beauty of this approach is that we obtain this classical result of probability theory (the Bernoulli urn) without any arbitrary assumption, and without defining a formula for $p$ . Instead, it’s our previously established calculus rules along with a requirement for consistency that dictates the numerical value. Had we chosen any other numerical value, then we would get a contradiction with one of our previous rules.

Contrast this result that we obtained as a consequence of our theory with the original mathematical definition of probability: “the probability for an event is the ratio of the number of cases favorable to it, to the number of all cases possible when nothing leads us to expect that any one of these cases should occur more than any other, which renders them, for us, equally possible.” (Laplace in Theorie Analytique des Probabilites, 1812).

The definition given by Laplace seems arbitrary to me. It looks sensible, but what tells us that Laplace (or Bernoulli before him) wasn’t mistaken? Our new theory confirms Bernoulli and Laplace’s intuition in a formal setting.

Inference in the new logic

Our extension of propositional logic is of course compatible with the rules we already had. I extracted this section into its own article: Propositional logic from probability calculus

Recall the inference rules we wished to model:

3) given $A \Rightarrow B$ :	if $B$ is true, then $A$ is more plausible
4) given $A \Rightarrow B$ :	if $A$ is false, then $B$ is less plausible
5) given $A \Rightarrow \text{more plausible(}B\text{)}$ :	if $B$ is true, then $A$ is more plausible

Using probability theory as constructed above, we can prove these inference rules are valid. For instance, given the rule $A \Rightarrow B$ , information that $B$ is true increases the probability for $A$ . I extracted this section into its own article: Why bayesian inference is more powerful than logic

References

[1] Probability theory: the Logic of Science, E.T. Jaynes