Connecting Typicality with Conditional Probability
We start from eq. 8 of [Li et al. 2023]:
\begin{equation}
p_\theta\left(c_i|x\right) = \frac{1}{\sum_j \exp \left\{\mathbb{E}_{\epsilon, t}\left[L_t(x, \epsilon, c_i) - L_t(x, \epsilon, c_j)\right]\right\}},
\tag{1}
\label{li}
\end{equation}
where $p_\theta\left(c_i \mid x\right)$ is the probability of a label $c_i$ conditioned on an input image $x$, across the set of all available labels $c_j$.
$L_t$ is the loss of the diffusion model at timestep $t$, defined in eq. 2 of the main paper, computed for a certain noise $\epsilon$ and timestep $t$.
Instead of computing probability across all classes in the denominator, we only compute the probability across the target class $c$ and $\varnothing$. This is motivated from classifier free guidance [Ho and Salimans, 2021], where instead of contrasting $c$ to all other conditions $c'\neq c$, to generate an output that respects $c$, the authors do it against a separate label $\varnothing$, learned from all the data.
By reducing the summation over $c_j \in \{\varnothing, c\}$ in the denominator of \eqref{li}, we have:
\begin{equation}
p_\theta\left(c_i|x\right) = \frac{1}{1 + \exp \left\{\mathbb{E}_{\epsilon, t}\left[L_t(x, \epsilon, c) - L_t(x, \epsilon, \varnothing)\right]\right\}},
\tag{2}
\label{l}
\end{equation}
In eq. 3 of our main paper, we define typicality $\;\mathbf{T}(x|c)$, between image $x$ and a label $c$ as:
\begin{equation}
\mathbf{T}(x|c) = \mathbb{E}_{\epsilon,t}\left[L_t(x, \epsilon, \varnothing) - L_t(x, \epsilon, c)\right],
\tag{3}
\label{typicality}
\end{equation}
Taking the log over \eqref{l} and substituting \eqref{typicality}, we have:
\begin{equation}
p_\theta\left(c|x\right) = \frac{1}{1 + \exp \left(- \mathbf{T} (x|c) \right)}, \tag{4}
\label{logistic_loss}
\end{equation}
Given two images $x, x'$:
\begin{align}
\mathbf{T}(x|c) &> \mathbf{T}(x'|c) &\Longleftrightarrow\\
1 + \exp( -\mathbf{T}(x|c) ) &< 1 + \exp( -\mathbf{T}(x'|c)) &\Longleftrightarrow\\
p_\theta\left(c|x\right) &> p_\theta\left(c|x'\right).
\end{align}
Thus, ranking through typicality, is equivalent to ranking through the highest conditional probability for a class $c$. ∎
Connection to Mutual Information
Note: Section added post-publication.
Very interestingly a recent paper [1] reveals that what we call "typicality" is the mutual information $I(x;c)$ between an image x and a label c (see $i^{s}$ in eq. 5).
Even more interestingly it reveals that if the denoiser converges into the optimal MMSE estimator, typicality:
\begin{equation}
\mathbf{T}(x|c) = \mathbb{E}_{\epsilon,t}\left[L_t(x, \epsilon, \varnothing) - L_t(x, \epsilon, c)\right] = \mathbb{E}_{\epsilon,t}\left[\left\|\epsilon_\theta(\operatorname{noise}(x, \epsilon, t), t, \varnothing)-\epsilon\right\|^2 - \left\|\epsilon_\theta(\operatorname{noise}(x, \epsilon, t), t, c)-\epsilon\right\|^2\right], \tag{5}
\label{a}
\end{equation}
becomes equivalent to:
\begin{equation}
\mathbb{E}_{\epsilon,t}\left[\left\|\epsilon_\theta(\text{noise}(x, \epsilon, t), t, \varnothing)-\epsilon_\theta(\text{noise}(x, \epsilon, t), t, c) \right\|^2\right].\tag{6}
\label{b}
\end{equation}
The later is interestingly the squared relevance map $\mathcal{R}_{x, I, T}$ from eq. 3 of Mirzaei et al. 2025 [2].
Note that the proof of [1] (see Appendix B) is done for integral definition of the expecations in \eqref{a},\eqref{b} (not for samples) and that the original relevance map as in [2] is evaluated only on a single sample.
Also, note that by definition typicallity is a signed measure whose order signifies in practice positive correlation to the label $c$ while relevance map is unsigned.
As I write in my upcoming dissertation mutual information for discriminative clustering (the main clustering technique for seminal mining works such as "What makes Paris look like Paris?" [3]) is a very old idea for discriminative clustering [4] however didn’t work for reasons of optimization as Ohl et al. 2023 [5] showed.
References
[1] Kong, Xianghao, et al. "Interpretable diffusion via information decomposition.", ICLR 2024.
[2] Mirzaei, Ashkan, et al. "Watch your steps: Local image and scene editing by text instructions." ECCV 2025.
[3] Doersch, Carl, et al. "What makes paris look like paris?" SIGGRAPH 2012.
[4] Bridle, John, Anthony Heading, and David MacKay. "Unsupervised classifiers, mutual information and'phantom targets." NIPS 1991.
[5] Ohl, Louis, et al. "Generalised Mutual Information: a Framework for Discriminative Clustering." NeurIPS 2022.