Research interests

A first line of my work studies structures that training selects, beyond just fitting the data. Recent examples include diffusion language models learning support before frequencies, and match-and-copy mechanisms learned by Transformers.

Another line of my work studies structure in the parameterization and implementation of neural networks. This includes complexity measures, perturbation bounds, and pruning rules that respect rescaling symmetries, as well as GPU implementations of matrix multiplication with structured sparse layers.

Below are short informal notes on a few projects.

Expand all Collapse all

Support before frequency in discrete diffusion

Related paper:

[1] Adrian Müller, Antoine Gonon, Zebang Shen, Ya-Ping Hsieh, Niao He. Support Before Frequency in Discrete Diffusion, arXiv 2026 · PDF

In [1], we give theoretical and experimental evidence that diffusion language models recover which sequences are valid before they fully recover which valid sequences should be more common than others. The explanation we put forward is that the ideal object that training aims to recover, namely the exact denoising transition probabilities, naturally separates support information from frequency information, making the former easier to detect than the latter.

Indeed, when the noise goes to zero, i.e., during the last denoising steps where the model is supposed to recover the full data distribution, the entries of the ideal denoising matrix split into different scales. Entries corresponding to moves toward the support appear at a higher scale than those that do not. So the support becomes easy to identify: one can simply look at the entries that get relatively amplified in the final denoising steps.

Frequency information behaves differently. Rare and common sequences still appear at the same scale, and their ratios stay constant as the noise goes to zero, instead of blowing up or vanishing as happens for support-related quantities.

This suggests that support should be easier to identify than frequencies, and that during training, models may fully recover the support earlier than the exact distribution on this support.

We tested these predictions on regular languages generated by automata and on diffusion language models trained on web text [1]. The observations support this picture.

Whether this behavior is desirable depends on the application. Once identified, one might want to modify training to either promote or penalize it, keeping in mind that this phenomenon likely comes from the particular structure of the optimal denoising matrix targeted by current diffusion training objectives.

Discrete diffusion models for text are themselves interesting because they are a promising alternative to autoregressive language models. They can generate many tokens in parallel rather than strictly left-to-right, potentially reducing inference cost by requiring fewer model calls than the sequence length. Their performance has kept improving, and efforts such as Google’s Gemini Diffusion show that this direction is now being explored seriously at scale.

A small task for match-and-copy

Related paper:

[1] Antoine Gonon, Alexandre Cordonnier, Nicolas Boumal. Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction, arXiv 2026 · PDF

In [1], we introduce a small task that explicitly asks models to implement match-and-copy: find a match in the context, then copy what came next. Knowing whether an architecture can learn this mechanism matters for interpretability, because mechanisms explain how trained models operate, and for architecture design, because cheaper architectures are only useful if they preserve the behaviors that made Transformers effective.

Match-and-copy is the abstract sequence routine implemented by induction heads. These are attention heads that emerge in Transformers trained for next-token prediction on web text, even though the training objective never asks for them explicitly.

For instance, match-and-copy is useful to complete:

Granny Susie comes tomorrow. I love Granny ____.

The answer “Susie” is already in the prompt. The model does not need to have memorized an association between “Granny” and “Susie” during training. It can solve the task at inference time by finding the previous occurrence of “Granny”, then copying the next token. This is the kind of routine induction heads implement in large language models.

Gaussian Match-and-Copy asks for the same routine, with text tokens replaced by random Gaussian vectors. A training example contains random vectors \(e_1,\ldots,e_T\) and a query \(e\). One hidden position \(t_0\) is correlated with \(e\), and the target is the next vector:

\[\operatorname{Cov}(e_t,e) \begin{cases} \neq 0, & t=t_0,\\ =0, & t\neq t_0, \end{cases} \qquad \text{target}=e_{t_0+1}.\]

So the model must find the correlated vector and copy its successor. The matching position changes across examples, so memorizing fixed associations is not enough.

In our experiments, Transformers develop circuits of the same type as those observed in large language models: previous-token heads and induction-style heads [1]. However, we also empirically show that several state-space or attention-free alternatives struggle to implement this routine under the same budget. This suggests that this low-level task can be a useful diagnostic for architectures that aim to be cheaper than Transformers while retaining their capabilities.

The paper also tells a small theory story. We consider a one-layer attention model with weights \(W\), whose output for a context \((e_1,\ldots,e_T)\) and query \(e\) is

\[\hat y = \sum_{t=1}^T \frac{\exp(\langle W e_t,e\rangle)} {\sum_s \exp(\langle W e_s,e\rangle)} e_{t+1}.\]

This model can solve the task by learning the hidden correlation structure, using it to detect the match at inference time, and copying the next vector.

What is interesting is that, despite the loss not being a standard max-margin loss, gradient descent often drives \(W\) toward a max-margin matching rule [1]. Among many directions that could drive the training loss to zero, it selects one that separates the matching position from the others with maximal margin. This makes the learned rule robust to perturbations of the training data.

This is not a standard max-margin setting because the least-squares loss does not belong to the usual family of losses that are monotone in the margins. In particular, one of the key properties used to prove max-margin convergence, namely that gradients keep increasing the margins along the trajectory, does not hold here without additional structure.

We identify structural properties of the data and of the dynamics under which this max-margin behavior can still be proved, and we observe these properties in the runs that converge to max-margin solutions in practice [1]. This may be useful beyond this task, for other cases where max-margin behavior appears in non-monotone settings, for example in other regression in-context learning tasks.

What helps theory for Muon on quadratics

Related paper:

[1] Antoine Gonon, Andreea-Alexandra Mușat, Nicolas Boumal. Insights on Muon from Simple Quadratics, arXiv 2026 · PDF · Code

In [1], we look for structure on quadratics that could support positive theory for Muon. Two tempting structures fail or are insufficient, and one spectral structure seems more promising.

The goal is not to explain full language-model training directly. It is to find which simplified properties are actually useful if one wants to build theory. Quadratics are common simplified models in optimization, which is why we start there.

Muon is an optimizer that emerged in Keller Jordan’s modded-nanoGPT benchmark as an alternative to AdamW for training language models. Roughly, it is SGD with momentum, except that the momentum matrix is replaced by its polar factor before applying the step. If the momentum matrix at step \(t\) is

\[M_t=\beta M_{t-1}+\nabla L(W_t),\]

then a simplified Muon update is

\[W_{t+1}=W_t-\eta\,\operatorname{Polar}_k(M_t).\]

For an exact SVD \(M=U\Sigma V^\top\), the exact polar factor is \(UV^\top\): it keeps the singular vectors and sends all singular values to one. In practice, this polar factor is only approximated to balance computational cost.

The first empirical observation in [1] is that taking an exact polar step does not improve the iteration count needed to reach a given loss value. This challenges the most direct way of accounting for polar approximation error: start from bounds for exact polar steps, add an approximation error term, and propagate it through the analysis. Such bounds usually worsen monotonically as the approximation error increases. This monotonic behavior is not what we see on quadratics, nor in practical benchmarks where replacing the polar approximation by an exact SVD preserves the training curves in terms of iteration count (but not in wall-clock time, of course, which is the reason to use the approximation in the first place).

The second empirical finding is that one-step comparisons are not enough. Looking for conditions under which Muon gives a larger loss decrease than gradient descent at the current iterate will not, without additional structure, guarantee a global advantage over the whole trajectory.

The third finding is the most suggestive one for building positive theory. Conditioning alone does not predict when Muon wins: even at fixed endpoints of the Hessian spectrum, Muon can win or lose against gradient descent. However, the full shape of the spectrum between these endpoints seems helpful for predicting finite-time performance. In particular, spectra with more mass near the small-eigenvalue end often favor Muon in our experiments [1].

But which quadratic spectra are meaningful to study? Should they mimic Hessians observed during real non-quadratic training? Or should they be controlled spectra used to isolate one feature at a time?

In the latter direction, work on optimizer gaps in language models shows that heavy-tailed data can induce, already for simple losses, Hessian and gradient structures where different optimizers behave very differently. For example, gradient descent can be slow on rare classes, while adaptive or normalized methods are less affected. Related work on Muon also studies tail-heavy associative-memory settings.

Rescaling symmetries in ReLU networks

Related papers:

[1] Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval. A Path-Norm Toolkit for Modern Networks: Consequences, Promises and Challenges, ICLR 2024 (Spotlight) · PDF · Code
[2] Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval. A Rescaling-Invariant Lipschitz Bound Based on Path-Metrics for Modern ReLU Network Parameterizations, ICML 2025 · PDF
[3] Damien Rouchouse*, Antoine Gonon*, Rémi Gribonval, Benjamin Guedj. Non-Vacuous Generalization Bounds: Can Rescaling Invariances Help?, arXiv 2025 · PDF

Many neural networks have redundant parameterizations: the weights can change while the function stays the same. Some theoretical bounds or algorithmic decisions still react to the weight change. This line of work tries to build bounds and decisions that do not, when this reaction is undesirable.

The redundancy I studied is the rescaling symmetry of ReLU networks. For one ReLU neuron,

\[v\,\operatorname{ReLU}(\langle u,x\rangle) = \frac{v}{\lambda}\,\operatorname{ReLU}(\langle \lambda u,x\rangle), \qquad \lambda>0.\]

The weights changed, but the function did not. This is harmless if one only evaluates the network. It becomes an issue when a bound, a distance, or a pruning score depends on the size of \(u\) and \(v\). Then two identical functions can receive different guarantees or different pruning decisions. This symmetry has specific properties that make it important to consider: it creates continuous surfaces of local minima (unlike discrete permutation symmetries, for example), and it modifies the size of the weights (unlike permutations).

Note that invariance is often not a standalone property that is useful by itself. For instance, for generalization, it must be combined with other ideas, such as flatness, compressibility, or other structural properties, to build theory that says something meaningful in practice. It is more of a principle that cuts across several settings: whatever structure a bound or method tries to exploit, quite often it is better if it does not depend on arbitrary rescalings of the weights.

The path-lifting is a way to write quantities that stay stable under these rescalings. The basic object is attached to a path \(p\) in the network’s computational graph: \(p\) follows neuron-to-neuron connections, \(e\) denotes one edge along that path, and \(w_e\) is the weight carried by edge \(e\). The path coordinate is the product of all weights encountered along \(p\):

\[\prod_{e\in p} w_e.\]

Under a node rescaling, some weights are multiplied by \(\lambda\), others by \(1/\lambda\), but complete path products remain unchanged. These products are therefore closer to the represented function than individual weights are.

The reason this is useful is not only that the quantities are invariant. The collection of all these path products locally linearizes the network: on a fixed activation pattern of the neurons, the function behaves like a linear function of these lifted coordinates. This gives a natural way to reformulate usual questions about the network in a way that does not depend on rescalings, for example to build complexity measures for generalization, pruning rules, or perturbation bounds.

In my PhD, I extended this path-based view beyond plain multilayer ReLU networks to architectures closer to practice: general DAGs, biases, skip connections, pooling, and frozen batch normalization [1]. This made it possible to compute path-norm quantities for models such as ResNets, rather than only for fully-connected networks.

I then used these coordinates in three directions. First, for Rademacher generalization bounds, where the complexity measure becomes invariant to rescaling [1]. Second, for Lipschitz bounds, both with respect to the input and with respect to the weights, with applications to pruning [2]. Third, for PAC-Bayes bounds, where the same invariance can remove one avoidable source of looseness in the bound [3]. I also released a Python package to help compute related invariant quantities. For example, the path-norm can be computed as pathnorm(model) on a compatible PyTorch model.

Kronecker-sparse matrix multiplication on GPU

Related paper:

[1] Antoine Gonon*, Léon Zheng*, Pascal Carrivain*, Quoc-Tung Le. Fast Inference with Kronecker-Sparse Matrices, ICML 2025 · PDF · Code

Matrix multiplication is where neural networks spend a large fraction of their inference time and energy. In [1], we study whether a particular kind of structure, called Kronecker sparsity, can make these multiplications faster on GPU.

In this work, we propose a CUDA kernel that reduces memory movements for matrix multiplication between a dense and a Kronecker-sparse matrix. The targeted use case is inference in Transformers, with Kronecker-sparse weight matrices in the linear layers, and dense inputs and outputs. We give a heuristic to predict when this new kernel helps, based on benchmarks over hundreds of sparsity patterns compatible with usual Transformer dimensions.

A Kronecker-sparse matrix is sparse, but not with arbitrary locations for the nonzero entries. Its support is constrained by a Kronecker product, specified by four integers \(a,b,c,d\):

\[S = I_a \otimes \mathbf{1}_{b\times c} \otimes I_d, \qquad W_{\mathrm{KS}} = S \odot W.\]

Only the support has this Kronecker structure: the nonzero weights themselves are free. This includes structured patterns related to Butterfly-like and Monarch-like matrices. On paper, these matrices can reduce the number of multiply-adds.

On GPU, fewer multiply-adds are not enough. Efficient inference also depends on how data is accessed and moved around (e.g., are memory accesses made to neighboring memory locations?). Existing Kronecker-sparse implementations often reduce the multiplication to a block-diagonal one by explicitly permuting the input and output. In our benchmarks, these memory-rewriting operations could take up to half of the runtime. The proposed kernel uses a different storage and tiling strategy tailored to the Kronecker-sparse support, which avoids those permutations.

The kernel is particularly effective in float32, in batch-size-last memory layout, and for sparsity patterns where memory rewriting is large compared to arithmetic. For a Kronecker pattern \((a,b,c,d)\), a simple proxy for this is

\[\frac{\text{memory rewrites}}{\text{scalar multiplications}} \approx \frac{b+c}{bc}.\]

The larger this ratio is, the more there is to gain by avoiding unnecessary memory movements. In favorable regimes, we observe a median speedup around \(1.4\times\) and about \(15\%\) lower energy use. We also released a PyTorch package that lets users replace torch.nn.Linear with ksmm.KSLinear and specify the desired sparsity pattern.