Why bayesian inference is more powerful than logic

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

In a previous article I showed that the inference rules of propositional logic can be obtained from probability calculus. But actually, we can obtain much more, and even explain why most people don’t think the propositional logic way.

In this article, we will see that there is more to the traditional implication than it seems in the framework of propositional logic. Using probability calculus and bayesian inference, we will show why most people mistakenly use the $A \Rightarrow B$ implication… and show that they are not completely mistaken after all.

Let’s take an example. Yesterday evening, my friend Bob was on his way to a party. He told me this: “If I can kiss Alice during the party, I will go to the cinema with her tomorrow evening”. I haven’t met Bob since the party, but a friend saw Alice and him at the cinema’s evening projection today.

Did you assume that Bob managed to get a kiss from Alice during the party? Logicians know this shortcut too well. In propositional logic, nothing tells us that Bob kissed Alice. Maybe they didn’t kiss and still went to the cinema today.

But using probability calculus, we can show that our probability estimate for their kiss increased when we learned they went to the cinema. This is what I will prove now.

Without further ado, let’s dive in.

$A$ and $B$ are propositions and I use the convention $p_e(\cdot) = p(\cdot \mid e)$ . If you need a cheatsheet about probability calculus or the notations I use, check this out.

If we let $e$ = “ $A \Rightarrow B$ ”, then $p_e(B \mid A) = 1$ by definition of $e$ . I already showed in a previous article that $e$ is enough to derive the usual equivalent forms: $A \Rightarrow B \equiv \bar{B} \Rightarrow \bar{A} \equiv \bar{A}+B$ using probability calculus. This article showed the following rules:

If $A$ is true	then $B$ is true
If $B$ is false	then $A$ is false

But actually, given this rule $A \Rightarrow B$ , we can show that:

If $B$ is true	then $A$ is more likely
If $A$ is false	then $B$ is less likely

Part 1: when $A$ is more likely

I will now show that we can rewind the arrow: given a rule such as $A \Rightarrow B,$ we will show that the probability for $A$ increases when we gain information about $B$ . From the propositional logic vantage point, this is surprising because information about $B$ doesn’t tell us anything about $A$ . As we will see, in probability calculus, it’s a completely different story. This could explain why most people mistakenly use $A \Rightarrow B$ as $B \Rightarrow A$ , even though both are completely different in propositional logic.

Let $e$ be an evidence such that $p_e(B) < p_e(B \mid A)$ . For instance, the rule $A \Rightarrow B$ is such evidence given that $p(B) \neq 1$ . But so is the weaker rule: $A \Rightarrow$ $\text{more_plausible}(B)$ .

We will show that given evidence $e$ , we also have: $p_e (A) < p_e(A \mid B)$ which means that evidence for $B$ increases our belief in $A$ .

We have:	$\begin{align} p_e(AB) &= p_e(A \mid B)\,p_e(B) \\ \text{ and } p_e(AB) &= p_e(B \mid A)\,p_e(A) \\ \Rightarrow \color{blue}{p_e(A \mid B) / p_e(A)} &= \color{red}{p_e(B \mid A) / p_e(B)} \end{align}$
Hence:	$\begin{align} p_e(B) &< p_e(B \mid A) \\ \Rightarrow 1 &< \color{red}{p_e(B \mid A) / p_e(B)} \\ \Rightarrow 1 &< \color{blue}{p_e(A \mid B) / p_e(A)} \\ \Rightarrow p_e(A) &< p_e(A \mid B) \end{align}$

And that’s why given the rule $A \Rightarrow B$ , the probability for $A$ increases when we know that $B$ is true, even though propositional logic don’t tell us anything about $A$ .

Part 2: when $B$ is less likely

Actually, we can even show that when $A \Rightarrow B$ , the probability estimate for $B$ decreases when we know that $A$ is false!

I will now show that if $p_e(B) < p_e(B \mid A)$ , then $p_e(B \mid \bar{A}) < p_e(B)$ :

$\begin{align} p_e(B \mid \bar{A}) &= \frac{p_e(\bar{A} \mid B)}{p_e(\bar{A})}\cdot p_e(B) \\ &= \frac{1-p_e(A \mid B)}{1 - p_e(A)}\cdot p_e(B) \\ \end{align}$

But, we showed that $p_e(A) < p_e(A \mid B)$ , so the fraction is less that $1$ . Thus the proof that:

$p_e(B \mid \bar{A}) < p_e(B)$

Why bayesian inference is more powerful than logic

Part 1: when AA is more likely

Part 2: when BB is less likely

Part 1: when $A$ is more likely

Part 2: when $B$ is less likely