I basically know of two principles for treating complicated systems in simple ways: the first is the principle of modularity and the second is the principle of abstraction. I am an apologist for computational probability in machine learning because I believe that probability theory implements these two principles in deep and intriguing ways — namely through factorization and through averaging. Exploiting these two mechanisms as fully as possible seems to me to be the way forward in machine learning. -— Michael Jordan, 1997
- Murphy 10
- Murphy 19
- McElreath Statistical Rethinking
- Joint Distributions
- Chain Rule
- Conditional Independence Assumptions
- Directed Graphical Models
- Conditional Independence in DGMs
- Bayes Ball Algorithm
- Undirected Graphical Models
- Conditional Independence in UGMs
Machine learning tasks re-expressed as manipulations of joint probability distribution.
If we model a joint distribution over these variables
-
Regression:
$p(Y | X) = \frac{p(X,Y)}{p(X)} = \frac{p(X,Y)}{∫ p(X,Y) dY}$ -
Classification:
$p(C | X) = \frac{p(X,C)}{p(X)} = \frac{p(X,C)}{∑_C p(X,C)}$
The central object of our interest is the joint distribution
$$ p_θ(X)= p(X_1, X_2, X_3, … | θ) $$
over multiple possibly correlated variables
Some of the variables may be observed, such as words in a document, or unobserved, such as missing characters in a damaged document fragment. Some variables, we will call latent, represent unobserved abstractions, such as the topic of the document.
- Probabilistic: How do we compactly represent the joint distribution?
- Reasoning: How do we efficiently infer one set of variables given information about another set?
- Learning: How can we learn the parameters of the joint representation from data?
The Chain Rule of Probability: We can always represent a joint distribution, given any ordering of the variables, by $$p(X1:N) = p(X_1)p(X_2 | X_1)p(X_3 | X_2 X_1)\cdots p(X_N | X1:{N-1})$$
New problem, representing the conditional distribution compactly: $p(X_N | X1:{N-1})$
Suppose all variables are discrete with
We can represent
Similarly,
The problem with appealing only to chain rule is that the final conditional distribution $p(X_N | X1:{N-1})$ requires
The values in these tables are the parameters
Our primary tool is introduce conditional independence assumptions into the model.
The definition of CI is that the joint distribution can be factorized as $$ X ⊥ Y | Z \iff p(X,Y | Z) = p(X | Z)p(Y | Z)$$
PGMs are tools to represent joint distributions by introducing CI assumptions. Nodes in the graph represent the variables. (Lack of) edges in the graph represent CI assumptions.
So important this was the name of CSC412 back in my day.
DGMs are PGMs whose graph is a Directed Acylcic Graph (DAG)
AKA
- Belief Networks.
- Bayesian Networks, Bayes Nets, but this is confusing with Bayesian Neural Networks)
- Causal Networks, because the directed edges are sometimes (mis?)interpreted as causal relations. As discussed previously, DGMs are not inherently causal.
Given a DGM whose directed pointing into a node
This allows us to represent the joint distribution, with those CI assumptions, as $$ p(X_1, X_2, …, X_N) = ∏i=1Np(X_i | \text{parents}(X_i)) $$
E.g. DAG
No assumptions DAG from Chain Rule $$p(X1:6)$$
DGMs are powerful tool to represent CI assumptions between sets of variables.
In graph
The CI assumptions encoded into the DAG have some unintuitive consequences. To see we can consider all possible relationships between 3 variables in a DAG. Why is this sufficient to describe the entire DAG?
X -> Z -> Y
Z mediates the information dependence between X and Y.
Conditioning on Z removes dependence between X and Y.
Gender -> Field -> Funding
p(Funding | Gender), massive imbalance. p(Funding | Gender, Field), no imbalance.
Gender influences field, field influences funding.
If we want to do something about this, intervene to make funding gender fair, it is important to understand the dependence in our model to make effective changes. E.g. fixing this issue should happen upstream of grant review, intervening at the decision of funding already conditioned on field.
X <- Z -> Y
Z is the “common cause” of X and Y.
Conditioning on Z removes the dependence between X and Y.
Data do not distinguish Forks from Chains!
X -> Z <- Y
Z depends on X and Y. X and Y are independent.
Conditioning on Z introduces a dependency
Switch -> Light <- Electricity
Hype -> Published <- Trustworthy
Height -> BasketballRank <- Skill
Taller players aren’t better according to NBA stats?
D-separation definition gives a way of determining conditional independence properties of a DGM from the graphical representation,
Unfortunately the definition itself is not a practical algorithm.
Bayes ball is an efficient algorithm for computing d-separation by passing simple messages between nodes of the graph.
“Bayes Ball”
- sounds like “baseball”
- balls bouncing around a directed graph with specific rules
- if a ball cannot bounce between two nodes then they are [conditionally] independent.
why boundary conditions? 10.10
UGMS are PGMS whose graph has undirected edges.
Historically very important to physics and computer vision, less so on ML.
They are symmetric which is more natural for certain domains.
However, the parameters are less interpretable and less modular. Parameter estimation is considerably more expensive.
Recent advances in “energy based models” addressing these.
UGMs represent CI assumpotions much more simply than DAGs.
Instead of D-separation, we remove the conditioned nodes from G and assess path contectedness.
E.g. vision model
To parameterize a UGM we cannot associate conditional probabilites to each node.
Instead we associate “potential functions” or “factors” to the cliques, ideally maximal cliques, of the graph.
E.g. UGM ABCD with discrete categories as arrays
Potentials are not probabilities. Instead they are relative affinities for the states.
We will see this connection later.
Related to energy in physical models, e.g. of chemical structures. High energy configurations are unlikely, nature seeks low energy states. Inversely proportional to the probability given by the UGM.