Connecting Typicality with Conditional Probability

We start from eq. 8 of [Li et al. 2023]: \begin{equation} p_\theta\left(c_i|x\right) = \frac{1}{\sum_j \exp \left\{\mathbb{E}_{\epsilon, t}\left[L_t(x, \epsilon, c_i) - L_t(x, \epsilon, c_j)\right]\right\}}, \tag{1} \label{li} \end{equation} where $p_\theta\left(c_i \mid x\right)$ is the probability of a label $c_i$ conditioned on an input image $x$, across the set of all available labels $c_j$. $L_t$ is the loss of the diffusion model at timestep $t$, defined in eq. 2 of the main paper, computed for a certain noise $\epsilon$ and timestep $t$.

Instead of computing probability across all classes in the denominator, we only compute the probability across the target class $c$ and $\varnothing$. This is motivated from classifier free guidance [Ho and Salimans, 2021], where instead of contrasting $c$ to all other conditions $c'\neq c$, to generate an output that respects $c$, the authors do it against a separate label $\varnothing$, learned from all the data.

By reducing the summation over $c_j \in \{\varnothing, c\}$ in the denominator of \eqref{li}, we have: \begin{equation} p_\theta\left(c_i|x\right) = \frac{1}{1 + \exp \left\{\mathbb{E}_{\epsilon, t}\left[L_t(x, \epsilon, c) - L_t(x, \epsilon, \varnothing)\right]\right\}}, \tag{2} \label{l} \end{equation}

In eq. 3 of our main paper, we define typicality $\;\mathbf{T}(x|c)$, between image $x$ and a label $c$ as: \begin{equation} \mathbf{T}(x|c) = \mathbb{E}_{\epsilon,t}\left[L_t(x, \epsilon, \varnothing) - L_t(x, \epsilon, c)\right], \tag{3} \label{typicality} \end{equation}

Taking the log over \eqref{l} and substituting \eqref{typicality}, we have: \begin{equation} p_\theta\left(c|x\right) = \frac{1}{1 + \exp \left(- \mathbf{T} (x|c) \right)}, \tag{4} \label{logistic_loss} \end{equation}

Given two images $x, x'$: \begin{align} \mathbf{T}(x|c) &> \mathbf{T}(x'|c) &\Longleftrightarrow\\ 1 + \exp( -\mathbf{T}(x|c) ) &< 1 + \exp( -\mathbf{T}(x'|c)) &\Longleftrightarrow\\ p_\theta\left(c|x\right) &> p_\theta\left(c|x'\right). \end{align} Thus, ranking through typicality, is equivalent to ranking through the highest conditional probability for a class $c$. ∎

Connection to Mutual Information

Note: Section added post-publication.

Very interestingly a recent paper [1] reveals that what we call "typicality" is the mutual information $I(x;c)$ between an image x and a label c (see $i^{s}$ in eq. 5). Even more interestingly it reveals that if the denoiser converges into the optimal MMSE estimator, typicality: \begin{equation} \mathbf{T}(x|c) = \mathbb{E}_{\epsilon,t}\left[L_t(x, \epsilon, \varnothing) - L_t(x, \epsilon, c)\right] = \mathbb{E}_{\epsilon,t}\left[\left\|\epsilon_\theta(\operatorname{noise}(x, \epsilon, t), t, \varnothing)-\epsilon\right\|^2 - \left\|\epsilon_\theta(\operatorname{noise}(x, \epsilon, t), t, c)-\epsilon\right\|^2\right], \tag{5} \label{a} \end{equation} becomes equivalent to: \begin{equation} \mathbb{E}_{\epsilon,t}\left[\left\|\epsilon_\theta(\text{noise}(x, \epsilon, t), t, \varnothing)-\epsilon_\theta(\text{noise}(x, \epsilon, t), t, c) \right\|^2\right].\tag{6} \label{b} \end{equation} The later is interestingly the squared relevance map $\mathcal{R}_{x, I, T}$ from eq. 3 of Mirzaei et al. 2025 [2]. Note that the proof of [1] (see Appendix B) is done for integral definition of the expecations in \eqref{a},\eqref{b} (not for samples) and that the original relevance map as in [2] is evaluated only on a single sample. Also, note that by definition typicallity is a signed measure whose order signifies in practice positive correlation to the label $c$ while relevance map is unsigned. As I write in my upcoming dissertation mutual information for discriminative clustering (the main clustering technique for seminal mining works such as "What makes Paris look like Paris?" [3]) is a very old idea for discriminative clustering [4] however didn’t work for reasons of optimization as Ohl et al. 2023 [5] showed.

References

[1] Kong, Xianghao, et al. "Interpretable diffusion via information decomposition.", ICLR 2024.
[2] Mirzaei, Ashkan, et al. "Watch your steps: Local image and scene editing by text instructions." ECCV 2025.
[3] Doersch, Carl, et al. "What makes paris look like paris?" SIGGRAPH 2012.
[4] Bridle, John, Anthony Heading, and David MacKay. "Unsupervised classifiers, mutual information and'phantom targets." NIPS 1991.
[5] Ohl, Louis, et al. "Generalised Mutual Information: a Framework for Discriminative Clustering." NeurIPS 2022.