Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeAn error indicator-based adaptive reduced order model for nonlinear structural mechanics -- application to high-pressure turbine blades
The industrial application motivating this work is the fatigue computation of aircraft engines' high-pressure turbine blades. The material model involves nonlinear elastoviscoplastic behavior laws, for which the parameters depend on the temperature. For this application, the temperature loading is not accurately known and can reach values relatively close to the creep temperature: important nonlinear effects occur and the solution strongly depends on the used thermal loading. We consider a nonlinear reduced order model able to compute, in the exploitation phase, the behavior of the blade for a new temperature field loading. The sensitivity of the solution to the temperature makes {the classical unenriched proper orthogonal decomposition method} fail. In this work, we propose a new error indicator, quantifying the error made by the reduced order model in computational complexity independent of the size of the high-fidelity reference model. In our framework, when the {error indicator} becomes larger than a given tolerance, the reduced order model is updated using one time step solution of the high-fidelity reference model. The approach is illustrated on a series of academic test cases and applied on a setting of industrial complexity involving 5 million degrees of freedom, where the whole procedure is computed in parallel with distributed memory.
Fast Updating Truncated SVD for Representation Learning with Sparse Matrices
Updating a truncated Singular Value Decomposition (SVD) is crucial in representation learning, especially when dealing with large-scale data matrices that continuously evolve in practical scenarios. Aligning SVD-based models with fast-paced updates becomes increasingly important. Existing methods for updating truncated SVDs employ Rayleigh-Ritz projection procedures, where projection matrices are augmented based on original singular vectors. However, these methods suffer from inefficiency due to the densification of the update matrix and the application of the projection to all singular vectors. To address these limitations, we introduce a novel method for dynamically approximating the truncated SVD of a sparse and temporally evolving matrix. Our approach leverages sparsity in the orthogonalization process of augmented matrices and utilizes an extended decomposition to independently store projections in the column space of singular vectors. Numerical experiments demonstrate a remarkable efficiency improvement of an order of magnitude compared to previous methods. Remarkably, this improvement is achieved while maintaining a comparable precision to existing approaches.
A Unified Perspective on Orthogonalization and Diagonalization
This paper makes a formal connection between two families of widely used matrix factorization algorithms in numerical linear algebra. One family consists of the Jacobi eigenvalue algorithm and its variants for computing the Hermitian eigendecomposition and singular value decomposition. The other consists of Gaussian elimination and the Gram-Schmidt procedure with various pivoting rules for computing the Cholesky decomposition and QR decomposition respectively. Both families are cast as special cases of a more general class of factorization algorithms. We provide a randomized pivoting rule that applies to this general class (which differs substantially from the usual pivoting rules for Gaussian elimination / Gram-Schmidt) which results in the same linear rate of convergence for each algorithm, irrespective of which factorization it computes. A second important consequence of this randomized pivoting rule is a provable, effective bound on the numerical stability of the Jacobi eigenvalue algorithm, which addresses a longstanding open problem of Demmel and Veseli\'c `92.
Are Random Decompositions all we need in High Dimensional Bayesian Optimisation?
Learning decompositions of expensive-to-evaluate black-box functions promises to scale Bayesian optimisation (BO) to high-dimensional problems. However, the success of these techniques depends on finding proper decompositions that accurately represent the black-box. While previous works learn those decompositions based on data, we investigate data-independent decomposition sampling rules in this paper. We find that data-driven learners of decompositions can be easily misled towards local decompositions that do not hold globally across the search space. Then, we formally show that a random tree-based decomposition sampler exhibits favourable theoretical guarantees that effectively trade off maximal information gain and functional mismatch between the actual black-box and its surrogate as provided by the decomposition. Those results motivate the development of the random decomposition upper-confidence bound algorithm (RDUCB) that is straightforward to implement - (almost) plug-and-play - and, surprisingly, yields significant empirical gains compared to the previous state-of-the-art on a comprehensive set of benchmarks. We also confirm the plug-and-play nature of our modelling component by integrating our method with HEBO, showing improved practical gains in the highest dimensional tasks from Bayesmark.
SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.
A theory of meta-factorization
We introduce meta-factorization, a theory that describes matrix decompositions as solutions of linear matrix equations: the projector and the reconstruction equation. Meta-factorization reconstructs known factorizations, reveals their internal structures, and allows for introducing modifications, as illustrated with SVD, QR, and UTV factorizations. The prospect of meta-factorization also provides insights into computational aspects of generalized matrix inverses and randomized linear algebra algorithms. The relations between the Moore-Penrose pseudoinverse, generalized Nystr\"{o}m method, and the CUR decomposition are revealed here as an illustration. Finally, meta-factorization offers hints on the structure of new factorizations and provides the potential of creating them.
Existence, Stability and Scalability of Orthogonal Convolutional Neural Networks
Imposing orthogonality on the layers of neural networks is known to facilitate the learning by limiting the exploding/vanishing of the gradient; decorrelate the features; improve the robustness. This paper studies the theoretical properties of orthogonal convolutional layers.We establish necessary and sufficient conditions on the layer architecture guaranteeing the existence of an orthogonal convolutional transform. The conditions prove that orthogonal convolutional transforms exist for almost all architectures used in practice for 'circular' padding.We also exhibit limitations with 'valid' boundary conditions and 'same' boundary conditions with zero-padding.Recently, a regularization term imposing the orthogonality of convolutional layers has been proposed, and impressive empirical results have been obtained in different applications (Wang et al. 2020).The second motivation of the present paper is to specify the theory behind this.We make the link between this regularization term and orthogonality measures. In doing so, we show that this regularization strategy is stable with respect to numerical and optimization errors and that, in the presence of small errors and when the size of the signal/image is large, the convolutional layers remain close to isometric.The theoretical results are confirmed with experiments and the landscape of the regularization term is studied. Experiments on real data sets show that when orthogonality is used to enforce robustness, the parameter multiplying the regularization termcan be used to tune a tradeoff between accuracy and orthogonality, for the benefit of both accuracy and robustness.Altogether, the study guarantees that the regularization proposed in Wang et al. (2020) is an efficient, flexible and stable numerical strategy to learn orthogonal convolutional layers.
Implicit Multiple Tensor Decomposition
Recently, triple decomposition has attracted increasing attention for decomposing third-order tensors into three factor tensors. However, this approach is limited to third-order tensors and enforces uniformity in the lower dimensions across all factor tensors, which restricts its flexibility and applicability. To address these issues, we propose the Multiple decomposition, a novel framework that generalizes triple decomposition to arbitrary order tensors and allows the short dimensions of the factor tensors to differ. We establish its connections with other classical tensor decompositions. Furthermore, implicit neural representation (INR) is employed to continuously represent the factor tensors in Multiple decomposition, enabling the method to generalize to non-grid data. We refer to this INR-based Multiple decomposition as Implicit Multiple Tensor Decomposition (IMTD). Then, the Proximal Alternating Least Squares (PALS) algorithm is utilized to solve the IMTD-based tensor reconstruction models. Since the objective function in IMTD-based models often lacks the Kurdyka-Lojasiewicz (KL) property, we establish a KL-free convergence analysis for the algorithm. Finally, extensive numerical experiments further validate the effectiveness of the proposed method.
Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization
Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we develop the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case, along with normalized SGD and signSGD with momentum (Cutkosky and Mehta, 2020; Sun et al., 2023). In addition, we prove state-of-the-art convergence results for the proposed algorithm in a range of scenarios, which involve arbitrary non-Euclidean norms, constrained and composite problems, and non-convex, star-convex, first- and second-order smooth functions. Finally, our theoretical findings provide an explanation for several practical observations, including the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022) and the importance of weight decay in the training of large-scale language models.
Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models
Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA's effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.
Orthogonal Finetuning Made Scalable
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
Language model compression with weighted low-rank factorization
Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher effective rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT. The code is publicly available at https://github.com/fei407/PSOFT.
Subspace power method for symmetric tensor decomposition
We introduce the Subspace Power Method (SPM) for calculating the CP decomposition of low-rank real symmetric tensors. This algorithm calculates one new CP component at a time, alternating between applying the shifted symmetric higher-order power method (SS-HOPM) to a certain modified tensor, constructed from a matrix flattening of the original tensor; and using appropriate deflation steps. We obtain rigorous guarantees for SPM regarding convergence and global optima for input tensors of dimension d and order m of CP rank up to O(d^{lfloor m/2rfloor}), via results in classical algebraic geometry and optimization theory. As a by-product of our analysis we prove that SS-HOPM converges unconditionally, settling a conjecture in [Kolda, T.G., Mayo, J.R.: Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications 32(4), 1095-1124 (2011)]. We present numerical experiments which demonstrate that SPM is efficient and robust to noise, being up to one order of magnitude faster than state-of-the-art CP decomposition algorithms in certain experiments. Furthermore, prior knowledge of the CP rank is not required by SPM.
Conditionally Strongly Log-Concave Generative Models
There is a growing gap between the impressive results of deep image generative models and classical algorithms that offer theoretical guarantees. The former suffer from mode collapse or memorization issues, limiting their application to scientific data. The latter require restrictive assumptions such as log-concavity to escape the curse of dimensionality. We partially bridge this gap by introducing conditionally strongly log-concave (CSLC) models, which factorize the data distribution into a product of conditional probability distributions that are strongly log-concave. This factorization is obtained with orthogonal projectors adapted to the data distribution. It leads to efficient parameter estimation and sampling algorithms, with theoretical guarantees, although the data distribution is not globally log-concave. We show that several challenging multiscale processes are conditionally log-concave using wavelet packet orthogonal projectors. Numerical results are shown for physical fields such as the varphi^4 model and weak lensing convergence maps with higher resolution than in previous works.
Convolution Aware Initialization
Initialization of parameters in deep neural networks has been shown to have a big impact on the performance of the networks (Mishkin & Matas, 2015). The initialization scheme devised by He et al, allowed convolution activations to carry a constrained mean which allowed deep networks to be trained effectively (He et al., 2015a). Orthogonal initializations and more generally orthogonal matrices in standard recurrent networks have been proved to eradicate the vanishing and exploding gradient problem (Pascanu et al., 2012). Majority of current initialization schemes do not take fully into account the intrinsic structure of the convolution operator. Using the duality of the Fourier transform and the convolution operator, Convolution Aware Initialization builds orthogonal filters in the Fourier space, and using the inverse Fourier transform represents them in the standard space. With Convolution Aware Initialization we noticed not only higher accuracy and lower loss, but faster convergence. We achieve new state of the art on the CIFAR10 dataset, and achieve close to state of the art on various other tasks.
Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data
Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, real-world data is often not naturally posed in this setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions. Then, we convert each GP into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications. We release the code of FunBaT at https://github.com/xuangu-fang/Functional-Bayesian-Tucker-Decomposition.
AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates
Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning. However, traditional approaches such as SVD/QR decomposition incur prohibitive computational costs of O(n^3) and underperform compared to well-tuned SGD with momentum, since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton-Schulz iterations, reducing complexity to O(n^2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi-orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton-Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to strong baselines such as AdamW and Muon. Code is available at: https://github.com/ryyzn9/AuON
Online Orthogonal Dictionary Learning Based on Frank-Wolfe Method
Dictionary learning is a widely used unsupervised learning method in signal processing and machine learning. Most existing works of dictionary learning are in an offline manner. There are mainly two offline ways for dictionary learning. One is to do an alternative optimization of both the dictionary and the sparse code; the other way is to optimize the dictionary by restricting it over the orthogonal group. The latter one is called orthogonal dictionary learning which has a lower complexity implementation, hence, it is more favorable for lowcost devices. However, existing schemes on orthogonal dictionary learning only work with batch data and can not be implemented online, which is not applicable for real-time applications. This paper proposes a novel online orthogonal dictionary scheme to dynamically learn the dictionary from streaming data without storing the historical data. The proposed scheme includes a novel problem formulation and an efficient online algorithm design with convergence analysis. In the problem formulation, we relax the orthogonal constraint to enable an efficient online algorithm. In the algorithm design, we propose a new Frank-Wolfe-based online algorithm with a convergence rate of O(ln t/t^(1/4)). The convergence rate in terms of key system parameters is also derived. Experiments with synthetic data and real-world sensor readings demonstrate the effectiveness and efficiency of the proposed online orthogonal dictionary learning scheme.
Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module's output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module's capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module's output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +4.3\%p top-1 accuracy gain for ViT-B on ImageNet-1k.
CLOVER: Constrained Learning with Orthonormal Vectors for Eliminating Redundancy
To adapt a well-trained large model to downstream tasks, we propose constraining learning within its original latent space by leveraging linear combinations of its basis vectors. This approach ensures stable training without compromising the model's capabilities. Traditionally, constructing orthonormal bases from a matrix requires a transfer matrix, which significantly increases storage and computational overhead for parameters and feature maps. In this paper, we introduce Absorb and Decompose for Q, K, V, and O matrices, enabling their orthogonalization without the need for transfer matrices. Furthermore, the Absorb-Decompose operation eliminates redundant vectors, reducing the encoder attention parameters of Whisper-large-v3 by 46.42% without requiring additional training. For parameter-efficient and stable fine-tuning, we orthonormalized Q, K, V, and O and fine-tuned only the singular values, allowing efficient adaptation while constraining changes to the original latent space. When fine-tuning LLaMA-2-7B on eight commonsense reasoning datasets, our method outperforms LoRA by 5.4% and DoRA by 4.4%.
Solving High Frequency and Multi-Scale PDEs with Gaussian Processes
Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We apply the inverse Fourier transform to obtain the covariance function (by Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. Next, we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to promote computational efficiency and scalability, without low-rank approximations. We show the advantage of our method in systematic experiments. The code is released at https://github.com/xuangu-fang/Gaussian-Process-Slover-for-High-Freq-PDE.
Neural Spectral Methods: Self-supervised learning in the spectral domain
We present Neural Spectral Methods, a technique to solve parametric Partial Differential Equations (PDEs), grounded in classical spectral methods. Our method uses orthogonal bases to learn PDE solutions as mappings between spectral coefficients. In contrast to current machine learning approaches which enforce PDE constraints by minimizing the numerical quadrature of the residuals in the spatiotemporal domain, we leverage Parseval's identity and introduce a new training strategy through a spectral loss. Our spectral loss enables more efficient differentiation through the neural network, and substantially reduces training complexity. At inference time, the computational cost of our method remains constant, regardless of the spatiotemporal resolution of the domain. Our experimental results demonstrate that our method significantly outperforms previous machine learning approaches in terms of speed and accuracy by one to two orders of magnitude on multiple different problems. When compared to numerical solvers of the same accuracy, our method demonstrates a 10times increase in performance speed.
Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.
Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation
With the increasingly powerful performances and enormous scales of Pretrained Language Models (PLMs), promoting parameter efficiency in fine-tuning has become a crucial need for effective and efficient adaptation to various downstream tasks. One representative line of fine-tuning methods is Orthogonal Fine-tuning (OFT), which rigorously preserves the angular distances within the parameter space to preserve the pretrained knowledge. Despite the empirical effectiveness, OFT still suffers low parameter efficiency at O(d^2) and limited capability of downstream adaptation. Inspired by Givens rotation, in this paper, we proposed quasi-Givens Orthogonal Fine-Tuning (qGOFT) to address the problems. We first use O(d) Givens rotations to accomplish arbitrary orthogonal transformation in SO(d) with provable equivalence, reducing parameter complexity from O(d^2) to O(d). Then we introduce flexible norm and relative angular adjustments under soft orthogonality regularization to enhance the adaptation capability of downstream semantic deviations. Extensive experiments on various tasks and PLMs validate the effectiveness of our methods.
Bayesian Algorithms for Kronecker-structured Sparse Vector Recovery With Application to IRS-MIMO Channel Estimation
We study the sparse recovery problem with an underdetermined linear system characterized by a Kronecker-structured dictionary and a Kronecker-supported sparse vector. We cast this problem into the sparse Bayesian learning (SBL) framework and rely on the expectation-maximization method for a solution. To this end, we model the Kronecker-structured support with a hierarchical Gaussian prior distribution parameterized by a Kronecker-structured hyperparameter, leading to a non-convex optimization problem. The optimization problem is solved using the alternating minimization (AM) method and a singular value decomposition (SVD)-based method, resulting in two algorithms. Further, we analytically guarantee that the AM-based method converges to the stationary point of the SBL cost function. The SVD-based method, though it adopts approximations, is empirically shown to be more efficient and accurate. We then apply our algorithm to estimate the uplink wireless channel in an intelligent reflecting surface-aided MIMO system and extend the AM-based algorithm to address block sparsity in the channel. We also study the SBL cost to show that the minima of the cost function are achieved at sparse solutions and that incorporating the Kronecker structure reduces the number of local minima of the SBL cost function. Our numerical results demonstrate the effectiveness of our algorithms compared to the state-of-the-art.
Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields
In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.
Physics-informed cluster analysis and a priori efficiency criterion for the construction of local reduced-order bases
Nonlinear model order reduction has opened the door to parameter optimization and uncertainty quantification in complex physics problems governed by nonlinear equations. In particular, the computational cost of solving these equations can be reduced by means of local reduced-order bases. This article examines the benefits of a physics-informed cluster analysis for the construction of cluster-specific reduced-order bases. We illustrate that the choice of the dissimilarity measure for clustering is fundamental and highly affects the performances of the local reduced-order bases. It is shown that clustering with an angle-based dissimilarity on simulation data efficiently decreases the intra-cluster Kolmogorov N-width. Additionally, an a priori efficiency criterion is introduced to assess the relevance of a ROM-net, a methodology for the reduction of nonlinear physics problems introduced in our previous work in [T. Daniel, F. Casenave, N. Akkari, D. Ryckelynck, Model order reduction assisted by deep neural networks (ROM-net), Advanced Modeling and Simulation in Engineering Sciences 7 (16), 2020]. This criterion also provides engineers with a very practical method for ROM-nets' hyperparameters calibration under constrained computational costs for the training phase. On five different physics problems, our physics-informed clustering strategy significantly outperforms classic strategies for the construction of local reduced-order bases in terms of projection errors.
Finding Manifolds With Bilinear Autoencoders
Sparse autoencoders are a standard tool for uncovering interpretable latent representations in neural networks. Yet, their interpretation depends on the inputs, making their isolated study incomplete. Polynomials offer a solution; they serve as algebraic primitives that can be analysed without reference to input and can describe structures ranging from linear concepts to complicated manifolds. This work uses bilinear autoencoders to efficiently decompose representations into quadratic polynomials. We discuss improvements that induce importance ordering, clustering, and activation sparsity. This is an initial step toward nonlinear yet analysable latents through their algebraic properties.
Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation
While following different technical routes, both low-rank and orthogonal adaptation techniques can efficiently adapt large-scale pre-training models in specific tasks or domains based on a small piece of trainable parameters. In this study, we bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity. The analysis motivates us to regularize the orthogonality of the HRs, leading to different implementations of the proposed Householder reflection adaptation (HRA) method. Compared with state-of-the-art methods, HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators. The code is available at https://github.com/DaShenZi721/HRA
Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes
Despite their many desirable properties, Gaussian processes (GPs) are often compared unfavorably to deep neural networks (NNs) for lacking the ability to learn representations. Recent efforts to bridge the gap between GPs and deep NNs have yielded a new class of inter-domain variational GPs in which the inducing variables correspond to hidden units of a feedforward NN. In this work, we examine some practical issues associated with this approach and propose an extension that leverages the orthogonal decomposition of GPs to mitigate these limitations. In particular, we introduce spherical inter-domain features to construct more flexible data-dependent basis functions for both the principal and orthogonal components of the GP approximation and show that incorporating NN activation features under this framework not only alleviates these shortcomings but is more scalable than alternative strategies. Experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.
D-PAD: Deep-Shallow Multi-Frequency Patterns Disentangling for Time Series Forecasting
In time series forecasting, effectively disentangling intricate temporal patterns is crucial. While recent works endeavor to combine decomposition techniques with deep learning, multiple frequencies may still be mixed in the decomposed components, e.g., trend and seasonal. Furthermore, frequency domain analysis methods, e.g., Fourier and wavelet transforms, have limitations in resolution in the time domain and adaptability. In this paper, we propose D-PAD, a deep-shallow multi-frequency patterns disentangling neural network for time series forecasting. Specifically, a multi-component decomposing (MCD) block is introduced to decompose the series into components with different frequency ranges, corresponding to the "shallow" aspect. A decomposition-reconstruction-decomposition (D-R-D) module is proposed to progressively extract the information of frequencies mixed in the components, corresponding to the "deep" aspect. After that, an interaction and fusion (IF) module is used to further analyze the components. Extensive experiments on seven real-world datasets demonstrate that D-PAD achieves the state-of-the-art performance, outperforming the best baseline by an average of 9.48% and 7.15% in MSE and MAE, respectively.
MoDeGPT: Modular Decomposition for Large Language Model Compression
Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices with limited resources. Recently, compression methods using low-rank matrix techniques have shown promise, yet these often lead to degraded accuracy or introduce significant overhead in parameters and inference latency. This paper introduces Modular Decomposition (MoDeGPT), a novel structured compression framework that does not need recovery fine-tuning while resolving the above drawbacks. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions via reconstructing the module-level outputs. MoDeGPT is developed based on a theoretical framework that utilizes three well-established matrix decomposition algorithms -- Nystr\"om approximation, CR decomposition, and SVD -- and applies them to our redefined transformer modules. Our comprehensive experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods that rely on gradient information, and saves 98% of compute costs on compressing a 13B model. On Llama-2/3 and OPT models, MoDeGPT maintains 90-95% zero-shot performance with 25-30% compression rates. Moreover, the compression can be done on a single GPU within a few hours and increases the inference throughput by up to 46%.
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.
A nonintrusive Reduced Basis Method applied to aeroacoustic simulations
The Reduced Basis Method can be exploited in an efficient way only if the so-called affine dependence assumption on the operator and right-hand side of the considered problem with respect to the parameters is satisfied. When it is not, the Empirical Interpolation Method is usually used to recover this assumption approximately. In both cases, the Reduced Basis Method requires to access and modify the assembly routines of the corresponding computational code, leading to an intrusive procedure. In this work, we derive variants of the EIM algorithm and explain how they can be used to turn the Reduced Basis Method into a nonintrusive procedure. We present examples of aeroacoustic problems solved by integral equations and show how our algorithms can benefit from the linear algebra tools available in the considered code.
O-MMGP: Optimal Mesh Morphing Gaussian Process Regression for Solving PDEs with non-Parametric Geometric Variations
We address the computational challenges of solving parametric PDEs with non parametrized geometric variations and non-reducible problems, such as those involving shocks and discontinuities of variable positions. Traditional dimensionality reduction methods like POD struggle with these scenarios due to slowly decaying Kolmogorov widths. To overcome this, we propose a novel non-linear dimensionality reduction technique to reduce the required modes for representation. The non-linear reduction is obtained through a POD after applying a transformation on the fields, which we call optimal mappings, and is a solution to an optimization problem in infinite dimension. The proposed learning framework combines morphing techniques, non-linear dimensionality reduction, and Gaussian Process Regression (GPR). The problem is reformulated on a reference geometry before applying the dimensionality reduction. Our method learns both the optimal mapping, and the solution fields, using a series of GPR models, enabling efficient and accurate modeling of complex parametric PDEs with geometrical variability. The results obtained concur with current state-of-the-art models. We mainly compare our method with the winning solution of the ML4CFD NeurIPS 2024 competition.
ROOT: Robust Orthogonalized Optimizer for Neural Network Training
The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.
Projections onto Spectral Matrix Cones
Semidefinite programming is a fundamental problem class in convex optimization, but despite recent advances in solvers, solving large-scale semidefinite programs remains challenging. Generally the matrix functions involved are spectral or unitarily invariant, i.e., they depend only on the eigenvalues or singular values of the matrix. This paper investigates how spectral matrix cones -- cones defined from epigraphs and perspectives of spectral or unitarily invariant functions -- can be used to enhance first-order conic solvers for semidefinite programs. Our main result shows that projecting a matrix can be reduced to projecting its eigenvalues or singular values, which we demonstrate can be done at a negligible cost compared to the eigenvalue or singular value decomposition itself. We have integrated support for spectral matrix cone projections into the Splitting Conic Solver (SCS). Numerical experiments show that SCS with this enhancement can achieve speedups of up to an order of magnitude for solving semidefinite programs arising in experimental design, robust principal component analysis, and graph partitioning.
Generative Principal Component Analysis
In this paper, we study the problem of principal component analysis with generative modeling assumptions, adopting a general model for the observed matrix that encompasses notable special cases, including spiked matrix recovery and phase retrieval. The key assumption is that the underlying signal lies near the range of an L-Lipschitz continuous generative model with bounded k-dimensional inputs. We propose a quadratic estimator, and show that it enjoys a statistical rate of order frac{klog L{m}}, where m is the number of samples. We also provide a near-matching algorithm-independent lower bound. Moreover, we provide a variant of the classic power method, which projects the calculated data onto the range of the generative model during each iteration. We show that under suitable conditions, this method converges exponentially fast to a point achieving the above-mentioned statistical rate. We perform experiments on various image datasets for spiked matrix and phase retrieval models, and illustrate performance gains of our method to the classic power method and the truncated power method devised for sparse principal component analysis.
4-bit Shampoo for Memory-Efficient Network Training
Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification demonstrates that our 4-bit Shampoo achieves comparable test accuracy to its 32-bit counterpart while being more memory-efficient. The source code will be made available.
An Algorithm for Computing with Brauer's Group Equivariant Neural Network Layers
The learnable, linear neural network layers between tensor power spaces of R^{n} that are equivariant to the orthogonal group, O(n), the special orthogonal group, SO(n), and the symplectic group, Sp(n), were characterised in arXiv:2212.08630. We present an algorithm for multiplying a vector by any weight matrix for each of these groups, using category theoretic constructions to implement the procedure. We achieve a significant reduction in computational cost compared with a naive implementation by making use of Kronecker product matrices to perform the multiplication. We show that our approach extends to the symmetric group, S_n, recovering the algorithm of arXiv:2303.06208 in the process.
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization
Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.
Intrinsic Image Decomposition via Ordinal Shading
Intrinsic decomposition is a fundamental mid-level vision problem that plays a crucial role in various inverse rendering and computational photography pipelines. Generating highly accurate intrinsic decompositions is an inherently under-constrained task that requires precisely estimating continuous-valued shading and albedo. In this work, we achieve high-resolution intrinsic decomposition by breaking the problem into two parts. First, we present a dense ordinal shading formulation using a shift- and scale-invariant loss in order to estimate ordinal shading cues without restricting the predictions to obey the intrinsic model. We then combine low- and high-resolution ordinal estimations using a second network to generate a shading estimate with both global coherency and local details. We encourage the model to learn an accurate decomposition by computing losses on the estimated shading as well as the albedo implied by the intrinsic model. We develop a straightforward method for generating dense pseudo ground truth using our model's predictions and multi-illumination data, enabling generalization to in-the-wild imagery. We present an exhaustive qualitative and quantitative analysis of our predicted intrinsic components against state-of-the-art methods. Finally, we demonstrate the real-world applicability of our estimations by performing otherwise difficult editing tasks such as recoloring and relighting.
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities.
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.
SPRIGHT: A Fast and Robust Framework for Sparse Walsh-Hadamard Transform
We consider the problem of computing the Walsh-Hadamard Transform (WHT) of some N-length input vector in the presence of noise, where the N-point Walsh spectrum is K-sparse with K = {O}(N^{delta}) scaling sub-linearly in the input dimension N for some 0<delta<1. Over the past decade, there has been a resurgence in research related to the computation of Discrete Fourier Transform (DFT) for some length-N input signal that has a K-sparse Fourier spectrum. In particular, through a sparse-graph code design, our earlier work on the Fast Fourier Aliasing-based Sparse Transform (FFAST) algorithm computes the K-sparse DFT in time {O}(Klog K) by taking {O}(K) noiseless samples. Inspired by the coding-theoretic design framework, Scheibler et al. proposed the Sparse Fast Hadamard Transform (SparseFHT) algorithm that elegantly computes the K-sparse WHT in the absence of noise using {O}(Klog N) samples in time {O}(Klog^2 N). However, the SparseFHT algorithm explicitly exploits the noiseless nature of the problem, and is not equipped to deal with scenarios where the observations are corrupted by noise. Therefore, a question of critical interest is whether this coding-theoretic framework can be made robust to noise. Further, if the answer is yes, what is the extra price that needs to be paid for being robust to noise? In this paper, we show, quite interestingly, that there is {\it no extra price} that needs to be paid for being robust to noise other than a constant factor. In other words, we can maintain the same sample complexity {O}(Klog N) and the computational complexity {O}(Klog^2 N) as those of the noiseless case, using our SParse Robust Iterative Graph-based Hadamard Transform (SPRIGHT) algorithm.
Stochastic Parameter Decomposition
A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition -- a framework that has been proposed to resolve several issues with current decomposition methods -- decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce Stochastic Parameter Decomposition (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at https://github.com/goodfire-ai/spd.
Unified Low-rank Compression Framework for Click-through Rate Prediction
Deep Click-Through Rate (CTR) prediction models play an important role in modern industrial recommendation scenarios. However, high memory overhead and computational costs limit their deployment in resource-constrained environments. Low-rank approximation is an effective method for computer vision and natural language processing models, but its application in compressing CTR prediction models has been less explored. Due to the limited memory and computing resources, compression of CTR prediction models often confronts three fundamental challenges, i.e., (1). How to reduce the model sizes to adapt to edge devices? (2). How to speed up CTR prediction model inference? (3). How to retain the capabilities of original models after compression? Previous low-rank compression research mostly uses tensor decomposition, which can achieve a high parameter compression ratio, but brings in AUC degradation and additional computing overhead. To address these challenges, we propose a unified low-rank decomposition framework for compressing CTR prediction models. We find that even with the most classic matrix decomposition SVD method, our framework can achieve better performance than the original model. To further improve the effectiveness of our framework, we locally compress the output features instead of compressing the model weights. Our unified low-rank compression framework can be applied to embedding tables and MLP layers in various CTR prediction models. Extensive experiments on two academic datasets and one real industrial benchmark demonstrate that, with 3-5x model size reduction, our compressed models can achieve both faster inference and higher AUC than the uncompressed original models. Our code is at https://github.com/yuhao318/Atomic_Feature_Mimicking.
Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout
We propose Orthogonal Monte Carlo Dropout, a mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity. Low-Rank Adaptation (LoRA), a popular fine-tuning method for large models, typically trains a module to represent a specific concept such as an object or a style. When multiple LoRA modules are merged, for example to generate an object in a particular style, their outputs (semantic vectors) may interfere with each other. Our method guarantees that merged LoRA modules remain orthogonal and thus free from direct interference. However, empirical analysis reveals that such orthogonality does not lead to the semantic disentanglement highlighted in prior work on compositional adaptation. This finding suggests that inter-LoRA orthogonality alone may be insufficient for achieving true semantic compositionality, prompting a re-examination of its role in adapter merging.
Approximately Optimal Core Shapes for Tensor Decompositions
This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
POME: Post Optimization Model Edit via Muon-style Projection
We introduce Post-Optimization Model Edit (POME), a new algorithm that enhances the performance of fine-tuned large language models using only their pretrained and fine-tuned checkpoints, without requiring extra data or further optimization. The core idea is to apply a muon-style projection to ΔW, the difference between the fine-tuned and pretrained weights. This projection uses truncated singular value decomposition (SVD) to equalize the influence of dominant update directions and prune small singular values, which often represent noise. As a simple post-processing step, POME is completely decoupled from the training pipeline. It requires zero modifications and imposes no overhead, making it universally compatible with any optimizer or distributed framework. POME delivers consistent gains, boosting average performance by +2.5\% on GSM8K and +1.0\% on code generation. Its broad applicability -- from 7B foundation models to 72B RLHF-instructed models -- establishes it as a practical, zero-cost enhancement for any fine-tuning pipeline. Code is available at https://github.com/NUS-HPC-AI-Lab/POME.
Generalized Kernel Thinning
The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth square-root kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Mat\'ern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in 100 dimensions and when compressing challenging differential equation posteriors.
Robustifying State-space Models for Long Sequences via Approximate Diagonalization
State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable "perturb-then-diagonalize" (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models.
Weighted least-squares approximation with determinantal point processes and generalized volume sampling
We consider the problem of approximating a function from L^2 by an element of a given m-dimensional space V_m, associated with some feature map varphi, using evaluations of the function at random points x_1,dots,x_n. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features varphi(x_i). We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples n = O(mlog(m)), that means that the expected L^2 error is bounded by a constant times the best approximation error in L^2. Also, further assuming that the function is in some normed vector space H continuously embedded in L^2, we further prove that the approximation is almost surely bounded by the best approximation error measured in the H-norm. This includes the cases of functions from L^infty or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.
Group Orthogonalization Regularization For Vision Models Adaptation and Robustness
As neural networks become deeper, the redundancy within their parameters increases. This phenomenon has led to several methods that attempt to reduce the correlation between convolutional filters. We propose a computationally efficient regularization technique that encourages orthonormality between groups of filters within the same layer. Our experiments show that when incorporated into recent adaptation methods for diffusion models and vision transformers (ViTs), this regularization improves performance on downstream tasks. We further show improved robustness when group orthogonality is enforced during adversarial training. Our code is available at https://github.com/YoavKurtz/GOR.
Low-Rank Approximation, Adaptation, and Other Tales
Low-rank approximation is a fundamental technique in modern data analysis, widely utilized across various fields such as signal processing, machine learning, and natural language processing. Despite its ubiquity, the mechanics of low-rank approximation and its application in adaptation can sometimes be obscure, leaving practitioners and researchers with questions about its true capabilities and limitations. This paper seeks to clarify low-rank approximation and adaptation by offering a comprehensive guide that reveals their inner workings and explains their utility in a clear and accessible way. Our focus here is to develop a solid intuition for how low-rank approximation and adaptation operate, and why they are so effective. We begin with basic concepts and gradually build up to the mathematical underpinnings, ensuring that readers of all backgrounds can gain a deeper understanding of low-rank approximation and adaptation. We strive to strike a balance between informal explanations and rigorous mathematics, ensuring that both newcomers and experienced experts can benefit from this survey. Additionally, we introduce new low-rank decomposition and adaptation algorithms that have not yet been explored in the field, hoping that future researchers will investigate their potential applicability.
Physics-informed Reduced Order Modeling of Time-dependent PDEs via Differentiable Solvers
Reduced-order modeling (ROM) of time-dependent and parameterized differential equations aims to accelerate the simulation of complex high-dimensional systems by learning a compact latent manifold representation that captures the characteristics of the solution fields and their time-dependent dynamics. Although high-fidelity numerical solvers generate the training datasets, they have thus far been excluded from the training process, causing the learned latent dynamics to drift away from the discretized governing physics. This mismatch often limits generalization and forecasting capabilities. In this work, we propose Physics-informed ROM (Φ-ROM) by incorporating differentiable PDE solvers into the training procedure. Specifically, the latent space dynamics and its dependence on PDE parameters are shaped directly by the governing physics encoded in the solver, ensuring a strong correspondence between the full and reduced systems. Our model outperforms state-of-the-art data-driven ROMs and other physics-informed strategies by accurately generalizing to new dynamics arising from unseen parameters, enabling long-term forecasting beyond the training horizon, maintaining continuity in both time and space, and reducing the data cost. Furthermore, Φ-ROM learns to recover and forecast the solution fields even when trained or evaluated with sparse and irregular observations of the fields, providing a flexible framework for field reconstruction and data assimilation. We demonstrate the framework's robustness across various PDE solvers and highlight its broad applicability by providing an open-source JAX implementation that is readily extensible to other PDE systems and differentiable solvers, available at https://phi-rom.github.io.
Differential Privacy of Quantum and Quantum-Inspired-Classical Recommendation Algorithms
We analyze the DP (differential privacy) properties of the quantum recommendation algorithm and the quantum-inspired-classical recommendation algorithm. We discover that the quantum recommendation algorithm is a privacy curating mechanism on its own, requiring no external noise, which is different from traditional differential privacy mechanisms. In our analysis, a novel perturbation method tailored for SVD (singular value decomposition) and low-rank matrix approximation problems is introduced. Using the perturbation method and random matrix theory, we are able to derive that both the quantum and quantum-inspired-classical algorithms are big(mathcal{O}big(frac 1nbig),,, mathcal{O}big(1{min{m,n}}big)big)-DP under some reasonable restrictions, where m and n are numbers of users and products in the input preference database respectively. Nevertheless, a comparison shows that the quantum algorithm has better privacy preserving potential than the classical one.
Single Image Reflection Separation via Component Synergy
The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet.
A nonintrusive method to approximate linear systems with nonlinear parameter dependence
We consider a family of linear systems A_mu alpha=C with system matrix A_mu depending on a parameter mu and for simplicity parameter-independent right-hand side C. These linear systems typically result from the finite-dimensional approximation of a parameter-dependent boundary-value problem. We derive a procedure based on the Empirical Interpolation Method to obtain a separated representation of the system matrix in the form A_muapproxsum_{m}beta_m(mu)A_{mu_m} for some selected values of the parameter. Such a separated representation is in particular useful in the Reduced Basis Method. The procedure is called nonintrusive since it only requires to access the matrices A_{mu_m}. As such, it offers a crucial advantage over existing approaches that instead derive separated representations requiring to enter the code at the level of assembly. Numerical examples illustrate the performance of our new procedure on a simple one-dimensional boundary-value problem and on three-dimensional acoustic scattering problems solved by a boundary element method.
ReALLM: A general framework for LLM compression and fine-tuning
We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on b bits and a neural decoder model D_phi with its weights on b_phi bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of 3 bits without any training. With a budget of 2 bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.
Fast Differentiable Matrix Square Root
Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}.
On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models
Diffusion models are generative models that have recently demonstrated impressive performances in terms of sampling quality and density estimation in high dimensions. They rely on a forward continuous diffusion process and a backward continuous denoising process, which can be described by a time-dependent vector field and is used as a generative model. In the original formulation of the diffusion model, this vector field is assumed to be the score function (i.e. it is the gradient of the log-probability at a given time in the diffusion process). Curiously, on the practical side, most studies on diffusion models implement this vector field as a neural network function and do not constrain it be the gradient of some energy function (that is, most studies do not constrain the vector field to be conservative). Even though some studies investigated empirically whether such a constraint will lead to a performance gain, they lead to contradicting results and failed to provide analytical results. Here, we provide three analytical results regarding the extent of the modeling freedom of this vector field. {Firstly, we propose a novel decomposition of vector fields into a conservative component and an orthogonal component which satisfies a given (gauge) freedom. Secondly, from this orthogonal decomposition, we show that exact density estimation and exact sampling is achieved when the conservative component is exactly equals to the true score and therefore conservativity is neither necessary nor sufficient to obtain exact density estimation and exact sampling. Finally, we show that when it comes to inferring local information of the data manifold, constraining the vector field to be conservative is desirable.
LoRA^2 : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models
Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend the LoRA to multiple scales, dubbed as LoRA^2. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA^2. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C
Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization
To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate for stationary kernels than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.
Self-Calibration and Bilinear Inverse Problems via Linear Least Squares
Whenever we use devices to take measurements, calibration is indispensable. While the purpose of calibration is to reduce bias and uncertainty in the measurements, it can be quite difficult, expensive, and sometimes even impossible to implement. We study a challenging problem called self-calibration, i.e., the task of designing an algorithm for devices so that the algorithm is able to perform calibration automatically. More precisely, we consider the setup y = A(d) x + epsilon where only partial information about the sensing matrix A(d) is known and where A(d) linearly depends on d. The goal is to estimate the calibration parameter d (resolve the uncertainty in the sensing process) and the signal/object of interests x simultaneously. For three different models of practical relevance, we show how such a bilinear inverse problem, including blind deconvolution as an important example, can be solved via a simple linear least squares approach. As a consequence, the proposed algorithms are numerically extremely efficient, thus potentially allowing for real-time deployment. We also present a variation of the least squares approach, which leads to a~spectral method, where the solution to the bilinear inverse problem can be found by computing the singular vector associated with the smallest singular value of a certain matrix derived from the bilinear system. Explicit theoretical guarantees and stability theory are derived for both techniques; and the number of sampling complexity is nearly optimal (up to a poly-log factor). Applications in imaging sciences and signal processing are discussed and numerical simulations are presented to demonstrate the effectiveness and efficiency of our approach.
A Fast and Provable Algorithm for Sparse Phase Retrieval
We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations of first-order methods while preserving their hallmark per-iteration computational efficiency. We provide theoretical guarantees that our algorithm converges to the s-sparse ground truth signal x^{natural} in R^n (up to a global sign) at a quadratic convergence rate after at most O(log (Vertx^{natural} Vert /x_{min}^{natural})) iterations, using Omega(s^2log n) Gaussian random samples. Numerical experiments show that our algorithm achieves a significantly faster convergence rate than state-of-the-art methods.
Feature Learning and Generalization in Deep Networks with Orthogonal Weights
Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of sim 20, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.
The greedy side of the LASSO: New algorithms for weighted sparse recovery via loss function-based orthogonal matching pursuit
We propose a class of greedy algorithms for weighted sparse recovery by considering new loss function-based generalizations of Orthogonal Matching Pursuit (OMP). Given a (regularized) loss function, the proposed algorithms alternate the iterative construction of the signal support via greedy index selection and a signal update based on solving a local data-fitting problem restricted to the current support. We show that greedy selection rules associated with popular weighted sparsity-promoting loss functions admit explicitly computable and simple formulas. Specifically, we consider ell^0 - and ell^1 -based versions of the weighted LASSO (Least Absolute Shrinkage and Selection Operator), the Square-Root LASSO (SR-LASSO) and the Least Absolute Deviations LASSO (LAD-LASSO). Through numerical experiments on Gaussian compressive sensing and high-dimensional function approximation, we demonstrate the effectiveness of the proposed algorithms and empirically show that they inherit desirable characteristics from the corresponding loss functions, such as SR-LASSO's noise-blind optimal parameter tuning and LAD-LASSO's fault tolerance. In doing so, our study sheds new light on the connection between greedy sparse recovery and convex relaxation.
Dimensionality Reduction for General KDE Mode Finding
Finding the mode of a high dimensional probability distribution D is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when D is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (LeeLiMusco:2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.'s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy (1-epsilon) for any epsilon > 0. Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless P = NP. Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction.
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models
Current parameter-efficient fine-tuning (PEFT) methods build adapters without considering the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter finetuning, and meanwhile the finetuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable adapters from weight decomposition oriented by the context of downstream task or world knowledge. Concretely, we collect a few data samples, and perform singular value decomposition for each linear layer of a pre-trained LLM multiplied by the covariance matrix of the input activation using these samples. By doing so, the context of the representative samples is captured through deciding the factorizing orientation. Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation. For the former, we use question-answering samples to obtain the covariance matrices, and use the decomposed components with the smallest r singular values to initialize a learnable adapter, with the others frozen such that the world knowledge is better preserved. For the latter, we use the instruction data from the finetuning task, such as math or coding, to orientate the decomposition and train the largest r components that capture the main characteristics of the task to learn. We conduct extensive experiments on Math, Code, and Instruction Following tasks. Our knowledge-preserved adaptation not only achieves better performance than LoRA on finetuning tasks, but also mitigates the forgetting of world knowledge. Our instruction-previewed adaptation is able to further enhance the finetuning performance, surpassing full-parameter finetuning and the state-of-the-art PEFT methods.
Nearly Optimal Algorithms with Sublinear Computational Complexity for Online Kernel Regression
The trade-off between regret and computational cost is a fundamental problem for online kernel regression, and previous algorithms worked on the trade-off can not keep optimal regret bounds at a sublinear computational complexity. In this paper, we propose two new algorithms, AOGD-ALD and NONS-ALD, which can keep nearly optimal regret bounds at a sublinear computational complexity, and give sufficient conditions under which our algorithms work. Both algorithms dynamically maintain a group of nearly orthogonal basis used to approximate the kernel mapping, and keep nearly optimal regret bounds by controlling the approximate error. The number of basis depends on the approximate error and the decay rate of eigenvalues of the kernel matrix. If the eigenvalues decay exponentially, then AOGD-ALD and NONS-ALD separately achieves a regret of O(L(f)) and O(d_{eff}(mu)T) at a computational complexity in O(ln^2{T}). If the eigenvalues decay polynomially with degree pgeq 1, then our algorithms keep the same regret bounds at a computational complexity in o(T) in the case of p>4 and pgeq 10, respectively. L(f) is the cumulative losses of f and d_{eff}(mu) is the effective dimension of the problem. The two regret bounds are nearly optimal and are not comparable.
A New PHO-rmula for Improved Performance of Semi-Structured Networks
Recent advances to combine structured regression models and deep neural networks for better interpretability, more expressiveness, and statistically valid uncertainty quantification demonstrate the versatility of semi-structured neural networks (SSNs). We show that techniques to properly identify the contributions of the different model components in SSNs, however, lead to suboptimal network estimation, slower convergence, and degenerated or erroneous predictions. In order to solve these problems while preserving favorable model properties, we propose a non-invasive post-hoc orthogonalization (PHO) that guarantees identifiability of model components and provides better estimation and prediction quality. Our theoretical findings are supported by numerical experiments, a benchmark comparison as well as a real-world application to COVID-19 infections.
Effective Spectral Unmixing via Robust Representation and Learning-based Sparsity
Hyperspectral unmixing (HU) plays a fundamental role in a wide range of hyperspectral applications. It is still challenging due to the common presence of outlier channels and the large solution space. To address the above two issues, we propose a novel model by emphasizing both robust representation and learning-based sparsity. Specifically, we apply the ell_{2,1}-norm to measure the representation error, preventing outlier channels from dominating our objective. In this way, the side effects of outlier channels are greatly relieved. Besides, we observe that the mixed level of each pixel varies over image grids. Based on this observation, we exploit a learning-based sparsity method to simultaneously learn the HU results and a sparse guidance map. Via this guidance map, the sparsity constraint in the ell_{p}!left(!0!<! p!leq!1right)-norm is adaptively imposed according to the learnt mixed level of each pixel. Compared with state-of-the-art methods, our model is better suited to the real situation, thus expected to achieve better HU results. The resulted objective is highly non-convex and non-smooth, and so it is hard to optimize. As a profound theoretical contribution, we propose an efficient algorithm to solve it. Meanwhile, the convergence proof and the computational complexity analysis are systematically provided. Extensive evaluations verify that our method is highly promising for the HU task---it achieves very accurate guidance maps and much better HU results compared with state-of-the-art methods.
On the Posterior Distribution in Denoising: Application to Uncertainty Quantification
Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (\ie the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/ .
Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models
The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.
Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs
Memory complexity and data scarcity have so far prohibited learning solution operators of partial differential equations (PDEs) at high resolutions. We address these limitations by introducing a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization, called multi-grid tensorized neural operator (MG-TFNO). MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena, through a decomposition of both the input domain and the operator's parameter space. Our contributions are threefold: i) we enable parallelization over input samples with a novel multi-grid-based domain decomposition, ii) we represent the parameters of the model in a high-order latent subspace of the Fourier domain, through a global tensor factorization, resulting in an extreme reduction in the number of parameters and improved generalization, and iii) we propose architectural improvements to the backbone FNO. Our approach can be used in any operator learning setting. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression. The tensorization combined with the domain decomposition, yields over 150x reduction in the number of parameters and 7x reduction in the domain size without losses in accuracy, while slightly enabling parallelism.
Semi-Parametric Neural Image Synthesis
Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. Our work questions the underlying paradigm of compressing large training data into ever growing parametric representations. We rather present an orthogonal, semi-parametric approach. We complement comparably small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training we retrieve a set of nearest neighbors from this external database for each training instance and condition the generative model on these informative samples. While the retrieval approach is providing the (local) content, the model is focusing on learning the composition of scenes based on this content. As demonstrated by our experiments, simply swapping the database for one with different contents transfers a trained model post-hoc to a novel domain. The evaluation shows competitive performance on tasks which the generative model has not been trained on, such as class-conditional synthesis, zero-shot stylization or text-to-image synthesis without requiring paired text-image data. With negligible memory and computational overhead for the external database and retrieval we can significantly reduce the parameter count of the generative model and still outperform the state-of-the-art.
Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning
Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free 2^nd-order optimizers for deep learning in low precision settings. Code: https://github.com/yorkerlin/StructuredNGD-DL
Polynomial Preconditioning for Gradient Methods
We study first-order methods with preconditioning for solving structured nonlinear convex optimization problems. We propose a new family of preconditioners generated by symmetric polynomials. They provide first-order optimization methods with a provable improvement of the condition number, cutting the gaps between highest eigenvalues, without explicit knowledge of the actual spectrum. We give a stochastic interpretation of this preconditioning in terms of coordinate volume sampling and compare it with other classical approaches, including the Chebyshev polynomials. We show how to incorporate a polynomial preconditioning into the Gradient and Fast Gradient Methods and establish the corresponding global complexity bounds. Finally, we propose a simple adaptive search procedure that automatically chooses the best possible polynomial preconditioning for the Gradient Method, minimizing the objective along a low-dimensional Krylov subspace. Numerical experiments confirm the efficiency of our preconditioning strategies for solving various machine learning problems.
Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization
Efficient k-nearest neighbor search is a fundamental task, foundational for many problems in NLP. When the similarity is measured by dot-product between dual-encoder vectors or ell_2-distance, there already exist many scalable and efficient search methods. But not so when similarity is measured by more accurate and expensive black-box neural similarity models, such as cross-encoders, which jointly encode the query and candidate neighbor. The cross-encoders' high computational cost typically limits their use to reranking candidates retrieved by a cheaper model, such as dual encoder or TF-IDF. However, the accuracy of such a two-stage approach is upper-bounded by the recall of the initial candidate set, and potentially requires additional training to align the auxiliary retrieval model with the cross-encoder model. In this paper, we present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Retrieval is made efficient with CUR decomposition, a matrix decomposition approach that approximates all pairwise cross-encoder distances from a small subset of rows and columns of the distance matrix. Indexing items using our approach is computationally cheaper than training an auxiliary dual-encoder model through distillation. Empirically, for k > 10, our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods that re-rank items retrieved using a dual-encoder or TF-IDF.
Self-Supervised Learning with Lie Symmetries for Partial Differential Equations
Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs.
Accurate and efficient evaluation of the a posteriori error estimator in the reduced basis method
The reduced basis method is a model reduction technique yielding substantial savings of computational time when a solution to a parametrized equation has to be computed for many values of the parameter. Certification of the approximation is possible by means of an a posteriori error bound. Under appropriate assumptions, this error bound is computed with an algorithm of complexity independent of the size of the full problem. In practice, the evaluation of the error bound can become very sensitive to round-off errors. We propose herein an explanation of this fact. A first remedy has been proposed in [F. Casenave, Accurate a posteriori error evaluation in the reduced basis method. C. R. Math. Acad. Sci. Paris 350 (2012) 539--542.]. Herein, we improve this remedy by proposing a new approximation of the error bound using the Empirical Interpolation Method (EIM). This method achieves higher levels of accuracy and requires potentially less precomputations than the usual formula. A version of the EIM stabilized with respect to round-off errors is also derived. The method is illustrated on a simple one-dimensional diffusion problem and a three-dimensional acoustic scattering problem solved by a boundary element method.
Householder Projector for Unsupervised Latent Semantics Discovery
Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only 1% of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.
Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models
We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-r LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence X, pretrained weights W^star, and adapter matrices alpha B A / r. Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only W_V and W_Q) and full adaptations (e.g., W_Q, W_V, and W_K) of weights in attention heads.
The Edge of Orthogonality: A Simple View of What Makes BYOL Tick
Self-predictive unsupervised learning methods such as BYOL or SimSiam have shown impressive results, and counter-intuitively, do not collapse to trivial representations. In this work, we aim at exploring the simplest possible mathematical arguments towards explaining the underlying mechanisms behind self-predictive unsupervised learning. We start with the observation that those methods crucially rely on the presence of a predictor network (and stop-gradient). With simple linear algebra, we show that when using a linear predictor, the optimal predictor is close to an orthogonal projection, and propose a general framework based on orthonormalization that enables to interpret and give intuition on why BYOL works. In addition, this framework demonstrates the crucial role of the exponential moving average and stop-gradient operator in BYOL as an efficient orthonormalization mechanism. We use these insights to propose four new closed-form predictor variants of BYOL to support our analysis. Our closed-form predictors outperform standard linear trainable predictor BYOL at 100 and 300 epochs (top-1 linear accuracy on ImageNet).
Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared U-matrices while enabling effective merging of the expert-specific V components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared U-matrix across experts in the same group, then pursues frequency-based merging for individual V-matrices, and finalizes expert reconstruction using the merged V-matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%|86\% of original performance with 25\%|50\% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at https://github.com/lliai/MoERazor.
Dimensionality Reduction in Sentence Transformer Vector Databases with Fast Fourier Transform
Dimensionality reduction in vector databases is pivotal for streamlining AI data management, enabling efficient storage, faster computation, and improved model performance. This paper explores the benefits of reducing vector database dimensions, with a focus on computational efficiency and overcoming the curse of dimensionality. We introduce a novel application of Fast Fourier Transform (FFT) to dimensionality reduction, a method previously underexploited in this context. By demonstrating its utility across various AI domains, including Retrieval-Augmented Generation (RAG) models and image processing, this FFT-based approach promises to improve data retrieval processes and enhance the efficiency and scalability of AI solutions. The incorporation of FFT may not only optimize operations in real-time processing and recommendation systems but also extend to advanced image processing techniques, where dimensionality reduction can significantly improve performance and analysis efficiency. This paper advocates for the broader adoption of FFT in vector database management, marking a significant stride towards addressing the challenges of data volume and complexity in AI research and applications. Unlike many existing approaches, we directly handle the embedding vectors produced by the model after processing a test input.
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
A Fast Incremental Gaussian Mixture Model
This work builds upon previous efforts in online incremental learning, namely the Incremental Gaussian Mixture Network (IGMN). The IGMN is capable of learning from data streams in a single-pass by improving its model after analyzing each data point and discarding it thereafter. Nevertheless, it suffers from the scalability point-of-view, due to its asymptotic time complexity of Obigl(NKD^3bigr) for N data points, K Gaussian components and D dimensions, rendering it inadequate for high-dimensional data. In this paper, we manage to reduce this complexity to Obigl(NKD^2bigr) by deriving formulas for working directly with precision matrices instead of covariance matrices. The final result is a much faster and scalable algorithm which can be applied to high dimensional tasks. This is confirmed by applying the modified algorithm to high-dimensional classification datasets.
Strivec: Sparse Tri-Vector Radiance Fields
We propose Strivec, a novel neural representation that models a 3D scene as a radiance field with sparsely distributed and compactly factorized local tensor feature grids. Our approach leverages tensor decomposition, following the recent work TensoRF, to model the tensor grids. In contrast to TensoRF which uses a global tensor and focuses on their vector-matrix decomposition, we propose to utilize a cloud of local tensors and apply the classic CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple vectors that express local feature distributions along spatial axes and compactly encode a local neural field. We also apply multi-scale tensor grids to discover the geometry and appearance commonalities and exploit spatial coherence with the tri-vector factorization at multiple local scales. The final radiance field properties are regressed by aggregating neural features from multiple local tensors across all scales. Our tri-vector tensors are sparsely distributed around the actual scene surface, discovered by a fast coarse reconstruction, leveraging the sparsity of a 3D scene. We demonstrate that our model can achieve better rendering quality while using significantly fewer parameters than previous methods, including TensoRF and Instant-NGP.
Linearly-Recurrent Autoencoder Networks for Learning Dynamics
This paper describes a method for learning low-dimensional approximations of nonlinear dynamical systems, based on neural-network approximations of the underlying Koopman operator. Extended Dynamic Mode Decomposition (EDMD) provides a useful data-driven approximation of the Koopman operator for analyzing dynamical systems. This paper addresses a fundamental problem associated with EDMD: a trade-off between representational capacity of the dictionary and over-fitting due to insufficient data. A new neural network architecture combining an autoencoder with linear recurrent dynamics in the encoded state is used to learn a low-dimensional and highly informative Koopman-invariant subspace of observables. A method is also presented for balanced model reduction of over-specified EDMD systems in feature space. Nonlinear reconstruction using partially linear multi-kernel regression aims to improve reconstruction accuracy from the low-dimensional state when the data has complex but intrinsically low-dimensional structure. The techniques demonstrate the ability to identify Koopman eigenfunctions of the unforced Duffing equation, create accurate low-dimensional models of an unstable cylinder wake flow, and make short-time predictions of the chaotic Kuramoto-Sivashinsky equation.
Variational Quantum algorithm for Poisson equation
The Poisson equation has wide applications in many areas of science and engineering. Although there are some quantum algorithms that can efficiently solve the Poisson equation, they generally require a fault-tolerant quantum computer which is beyond the current technology. In this paper, we propose a Variational Quantum Algorithm (VQA) to solve the Poisson equation, which can be executed on Noise Intermediate-Scale Quantum (NISQ) devices. In detail, we first adopt the finite difference method to transform the Poisson equation into a linear system. Then, according to the special structure of the linear system, we find an explicit tensor product decomposition, with only 2log n+1 items, of its coefficient matrix under a specific set of simple operators, where n is the dimension of the coefficient matrix. This implies that the proposed VQA only needs O(log n) measurements, which dramatically reduce quantum resources. Additionally, we perform quantum Bell measurements to efficiently evaluate the expectation values of simple operators. Numerical experiments demonstrate that our algorithm can effectively solve the Poisson equation.
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture
In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.
A New Perspective on Shampoo's Preconditioner
Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the optimal Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the square of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.
PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication
Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
Pre-training with Random Orthogonal Projection Image Modeling
Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn structural information about objects and scenes. The intermediate feature representations obtained from MIM are suitable for fine-tuning on downstream tasks. In this paper, we propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. Our proposed Random Orthogonal Projection Image Modeling (ROPIM) reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees. Since ROPIM uses a random subspace for the projection that realizes the masking step, the readily available complement of the subspace can be used during unmasking to promote recovery of removed information. In this paper, we show that using random orthogonal projection leads to superior performance compared to crop-based masking. We demonstrate state-of-the-art results on several popular benchmarks.
This is SPIRAL-TAP: Sparse Poisson Intensity Reconstruction ALgorithms - Theory and Practice
The observations in many applications consist of counts of discrete events, such as photons hitting a detector, which cannot be effectively modeled using an additive bounded or Gaussian noise model, and instead require a Poisson noise model. As a result, accurate reconstruction of a spatially or temporally distributed phenomenon (f*) from Poisson data (y) cannot be effectively accomplished by minimizing a conventional penalized least-squares objective function. The problem addressed in this paper is the estimation of f* from y in an inverse problem setting, where (a) the number of unknowns may potentially be larger than the number of observations and (b) f* admits a sparse approximation. The optimization formulation considered in this paper uses a penalized negative Poisson log-likelihood objective function with nonnegativity constraints (since Poisson intensities are naturally nonnegative). In particular, the proposed approach incorporates key ideas of using separable quadratic approximations to the objective function at each iteration and penalization terms related to l1 norms of coefficient vectors, total variation seminorms, and partition-based multiscale estimation methods.
Implicit Regularization Effects of the Sobolev Norms in Image Processing
In this paper, we propose to use the general L^2-based Sobolev norms, i.e., H^s norms where sin R, to measure the data discrepancy due to noise in image processing tasks that are formulated as optimization problems. As opposed to a popular trend of developing regularization methods, we emphasize that an implicit regularization effect can be achieved through the class of Sobolev norms as the data-fitting term. Specifically, we analyze that the implicit regularization comes from the weights that the H^s norm imposes on different frequency contents of an underlying image. We further analyze the underlying noise assumption of using the Sobolev norm as the data-fitting term from a Bayesian perspective, build the connections with the Sobolev gradient-based methods and discuss the preconditioning effects on the convergence rate of the gradient descent algorithm, leading to a better understanding of functional spaces/metrics and the optimization process involved in image processing. Numerical results in full waveform inversion, image denoising and deblurring demonstrate the implicit regularization effects.
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).
Unearthing InSights into Mars: Unsupervised Source Separation with Limited Data
Source separation involves the ill-posed problem of retrieving a set of source signals that have been observed through a mixing operator. Solving this problem requires prior knowledge, which is commonly incorporated by imposing regularity conditions on the source signals, or implicitly learned through supervised or unsupervised methods from existing data. While data-driven methods have shown great promise in source separation, they often require large amounts of data, which rarely exists in planetary space missions. To address this challenge, we propose an unsupervised source separation scheme for domains with limited data access that involves solving an optimization problem in the wavelet scattering covariance representation spacex2014an interpretable, low-dimensional representation of stationary processes. We present a real-data example in which we remove transient, thermally-induced microtiltsx2014known as glitchesx2014from data recorded by a seismometer during NASA's InSight mission on Mars. Thanks to the wavelet scattering covariances' ability to capture non-Gaussian properties of stochastic processes, we are able to separate glitches using only a few glitch-free data snippets.
LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning
Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the hypothesis space. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (iDCT) with selective locations of learnable components. We begin with a comprehensive theoretical comparison between frequency-domain and low-rank decompositions for fine-tuning pre-trained large models. Our analysis reveals that frequency-domain decomposition with carefully selected frequency components can surpass the expressivity of traditional low-rank-based methods. Furthermore, we demonstrate that iDCT offers a more efficient implementation compared to inverse Discrete Fourier Transform (iDFT), allowing for better selection and tuning of frequency components while maintaining equivalent expressivity to the optimal iDFT-based adaptation. By employing finite-difference approximation to estimate gradients for discrete locations of learnable coefficients on the DCT spectrum, LoCA dynamically selects the most informative frequency components during training. Experiments on diverse language and vision fine-tuning tasks demonstrate that LoCA offers enhanced parameter efficiency while maintains computational feasibility comparable to low-rank-based methods.
Quantum algorithm for solving linear systems of equations
Solving linear systems of equations is a common problem that arises both on its own and as a subroutine in more complex problems: given a matrix A and a vector b, find a vector x such that Ax=b. We consider the case where one doesn't need to know the solution x itself, but rather an approximation of the expectation value of some operator associated with x, e.g., x'Mx for some matrix M. In this case, when A is sparse, N by N and has condition number kappa, classical algorithms can find x and estimate x'Mx in O(N sqrt(kappa)) time. Here, we exhibit a quantum algorithm for this task that runs in poly(log N, kappa) time, an exponential improvement over the best classical algorithm.
Approximating the Top Eigenvector in Random Order Streams
When rows of an n times d matrix A are given in a stream, we study algorithms for approximating the top eigenvector of the matrix {A}^TA (equivalently, the top right singular vector of A). We consider worst case inputs A but assume that the rows are presented to the streaming algorithm in a uniformly random order. We show that when the gap parameter R = σ_1(A)^2/σ_2(A)^2 = Ω(1), then there is a randomized algorithm that uses O(h cdot d cdot polylog(d)) bits of space and outputs a unit vector v that has a correlation 1 - O(1/R) with the top eigenvector v_1. Here h denotes the number of heavy rows in the matrix, defined as the rows with Euclidean norm at least |{A}|_F/d cdot operatorname{polylog(d)}. We also provide a lower bound showing that any algorithm using O(hd/R) bits of space can obtain at most 1 - Ω(1/R^2) correlation with the top eigenvector. Thus, parameterizing the space complexity in terms of the number of heavy rows is necessary for high accuracy solutions. Our results improve upon the R = Ω(log n cdot log d) requirement in a recent work of Price and Xun (FOCS 2024). We note that the algorithm of Price and Xun works for arbitrary order streams whereas our algorithm requires a stronger assumption that the rows are presented in a uniformly random order. We additionally show that the gap requirements in their analysis can be brought down to R = Ω(log^2 d) for arbitrary order streams and R = Ω(log d) for random order streams. The requirement of R = Ω(log d) for random order streams is nearly tight for their analysis as we obtain a simple instance with R = Ω(log d/loglog d) for which their algorithm, with any fixed learning rate, cannot output a vector approximating the top eigenvector v_1.
Proper losses for discrete generative models
We initiate the study of proper losses for evaluating generative models in the discrete setting. Unlike traditional proper losses, we treat both the generative model and the target distribution as black-boxes, only assuming ability to draw i.i.d. samples. We define a loss to be black-box proper if the generative distribution that minimizes expected loss is equal to the target distribution. Using techniques from statistical estimation theory, we give a general construction and characterization of black-box proper losses: they must take a polynomial form, and the number of draws from the model and target distribution must exceed the degree of the polynomial. The characterization rules out a loss whose expectation is the cross-entropy between the target distribution and the model. By extending the construction to arbitrary sampling schemes such as Poisson sampling, however, we show that one can construct such a loss.
Spectral Adapter: Fine-Tuning in Spectral Space
Recent developments in Parameter-Efficient Fine-Tuning (PEFT) methods for pretrained deep neural networks have captured widespread interest. In this work, we study the enhancement of current PEFT methods by incorporating the spectral information of pretrained weight matrices into the fine-tuning procedure. We investigate two spectral adaptation mechanisms, namely additive tuning and orthogonal rotation of the top singular vectors, both are done via first carrying out Singular Value Decomposition (SVD) of pretrained weights and then fine-tuning the top spectral space. We provide a theoretical analysis of spectral fine-tuning and show that our approach improves the rank capacity of low-rank adapters given a fixed trainable parameter budget. We show through extensive experiments that the proposed fine-tuning model enables better parameter efficiency and tuning performance as well as benefits multi-adapter fusion. The code will be open-sourced for reproducibility.
Solving High-Dimensional PDEs with Latent Spectral Models
Deep models have achieved impressive progress in solving partial differential equations (PDEs). A burgeoning paradigm is learning neural operators to approximate the input-output mappings of PDEs. While previous deep models have explored the multiscale architectures and various operator designs, they are limited to learning the operators as a whole in the coordinate space. In real physical science problems, PDEs are complex coupled equations with numerical solvers relying on discretization into high-dimensional coordinate space, which cannot be precisely approximated by a single operator nor efficiently learned due to the curse of dimensionality. We present Latent Spectral Models (LSM) toward an efficient and precise solver for high-dimensional PDEs. Going beyond the coordinate space, LSM enables an attention-based hierarchical projection network to reduce the high-dimensional data into a compact latent space in linear time. Inspired by classical spectral methods in numerical analysis, we design a neural spectral block to solve PDEs in the latent space that approximates complex input-output mappings via learning multiple basis operators, enjoying nice theoretical guarantees for convergence and approximation. Experimentally, LSM achieves consistent state-of-the-art and yields a relative gain of 11.5% averaged on seven benchmarks covering both solid and fluid physics. Code is available at https://github.com/thuml/Latent-Spectral-Models.
Nuclear Norm Regularization for Deep Learning
Penalizing the nuclear norm of a function's Jacobian encourages it to locally behave like a low-rank linear map. Such functions vary locally along only a handful of directions, making the Jacobian nuclear norm a natural regularizer for machine learning problems. However, this regularizer is intractable for high-dimensional problems, as it requires computing a large Jacobian matrix and taking its singular value decomposition. We show how to efficiently penalize the Jacobian nuclear norm using techniques tailor-made for deep learning. We prove that for functions parametrized as compositions f = g circ h, one may equivalently penalize the average squared Frobenius norm of Jg and Jh. We then propose a denoising-style approximation that avoids the Jacobian computations altogether. Our method is simple, efficient, and accurate, enabling Jacobian nuclear norm regularization to scale to high-dimensional deep learning problems. We complement our theory with an empirical study of our regularizer's performance and investigate applications to denoising and representation learning.
Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products
Developing equivariant neural networks for the E(3) group plays an important role in modeling 3D data across real-world applications. Enforcing this equivariance primarily involves the tensor products of irreducible representations (irreps). However, the computational complexity of such operations increases significantly as higher-order tensors are used. In this work, we propose a systematic approach to substantially accelerate the computation of the tensor products of irreps. We mathematically connect the commonly used Clebsch-Gordan coefficients to the Gaunt coefficients, which are integrals of products of three spherical harmonics. Through Gaunt coefficients, the tensor product of irreps becomes equivalent to the multiplication between spherical functions represented by spherical harmonics. This perspective further allows us to change the basis for the equivariant operations from spherical harmonics to a 2D Fourier basis. Consequently, the multiplication between spherical functions represented by a 2D Fourier basis can be efficiently computed via the convolution theorem and Fast Fourier Transforms. This transformation reduces the complexity of full tensor products of irreps from O(L^6) to O(L^3), where L is the max degree of irreps. Leveraging this approach, we introduce the Gaunt Tensor Product, which serves as a new method to construct efficient equivariant operations across different model architectures. Our experiments on the Open Catalyst Project and 3BPA datasets demonstrate both the increased efficiency and improved performance of our approach.
Predictability-Aware Compression and Decompression Framework for Multichannel Time Series Data
Real-world multichannel time series prediction faces growing demands for efficiency across edge and cloud environments, making channel compression a timely and essential problem. Motivated by success of Multiple-Input Multiple-Output (MIMO) methods, we propose a predictability-aware compression-decompression framework to reduce runtime, lower communication cost, and maintain prediction accuracy across diverse predictors. The core idea involves using a circular periodicity key matrix with orthogonality to capture underlying time series predictability during compression and to mitigate reconstruction errors during decompression by relaxing oversimplified data assumptions. Theoretical and empirical analyses show that the proposed framework is both time-efficient and scalable under a large number of channels. Extensive experiments on six datasets across various predictors demonstrate that the proposed method achieves superior overall performance by jointly considering prediction accuracy and runtime, while maintaining strong compatibility with diverse predictors.
Faster Algorithms for Structured Matrix Multiplication via Flip Graph Search
We give explicit low-rank bilinear non-commutative schemes for multiplying structured n times n matrices with 2 leq n leq 5, which serve as building blocks for recursive algorithms with improved multiplicative factors in asymptotic complexity. Our schemes are discovered over F_2 or F_3 and lifted to Z or Q. Using a flip graph search over tensor decompositions, we derive schemes for general, upper-triangular, lower-triangular, symmetric, and skew-symmetric inputs, as well as products of a structured matrix with its transpose. In particular, we obtain 4 times 4 rank-34 schemes: (i) multiplying a general matrix by its transpose using 10 recursive calls, improving the factor from 26/41 (0.634) to 8/13 (0.615); and (ii) multiplying an upper-triangular matrix by a general matrix using 12 recursive calls, improving the factor from 8/13 (0.615) to 22/37 (0.595). Additionally, using F_3 flip graphs, we discover schemes over Q that fundamentally require the inverse of 2, including a 2 times 2 symmetric-symmetric multiplication of rank 5 and a 3 times 3 skew-symmetric-general multiplication of rank 14 (improving upon AlphaTensor's 15).
Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes
The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.
SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values
Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks. However, fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, mitigate this issue by adjusting only a small subset of parameters. Nevertheless, these methods typically employ random initialization for low-rank matrices, which can lead to inefficiencies in gradient descent and diminished generalizability due to suboptimal starting points. To address these limitations, we propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to initialize low-rank matrices using critical singular values as trainable parameters. Specifically, SVFit performs SVD on the pre-trained weight matrix to obtain the best rank-r approximation matrix, emphasizing the most critical singular values that capture over 99% of the matrix's information. These top-r singular values are then used as trainable parameters to scale the fundamental subspaces of the matrix, facilitating rapid domain adaptation. Extensive experiments across various pre-trained models in natural language understanding, text-to-image generation, and image classification tasks reveal that SVFit outperforms LoRA while requiring 16 times fewer trainable parameters.
Bilinear Subspace Variational Bayesian Inference for Joint Scattering Environment Sensing and Data Recovery in ISAC Systems
This paper considers a joint scattering environment sensing and data recovery problem in an uplink integrated sensing and communication (ISAC) system. To facilitate joint scatterers localization and multi-user (MU) channel estimation, we introduce a three-dimensional (3D) location-domain sparse channel model to capture the joint sparsity of the MU channel (i.e., different user channels share partially overlapped scatterers). Then the joint problem is formulated as a bilinear structured sparse recovery problem with a dynamic position grid and imperfect parameters (such as time offset and user position errors). We propose an expectation maximization based turbo bilinear subspace variational Bayesian inference (EM-Turbo-BiSVBI) algorithm to solve the problem effectively, where the E-step performs Bayesian estimation of the the location-domain sparse MU channel by exploiting the joint sparsity, and the M-step refines the dynamic position grid and learns the imperfect factors via gradient update. Two methods are introduced to greatly reduce the complexity with almost no sacrifice on the performance and convergence speed: 1) a subspace constrained bilinear variational Bayesian inference (VBI) method is proposed to avoid any high-dimensional matrix inverse; 2) the multiple signal classification (MUSIC) and subspace constrained VBI methods are combined to obtain a coarse estimation result to reduce the search range. Simulations verify the advantages of the proposed scheme over baseline schemes.
Efficient Online Processing with Deep Neural Networks
The capabilities and adoption of deep neural networks (DNNs) grow at an exhilarating pace: Vision models accurately classify human actions in videos and identify cancerous tissue in medical scans as precisely than human experts; large language models answer wide-ranging questions, generate code, and write prose, becoming the topic of everyday dinner-table conversations. Even though their uses are exhilarating, the continually increasing model sizes and computational complexities have a dark side. The economic cost and negative environmental externalities of training and serving models is in evident disharmony with financial viability and climate action goals. Instead of pursuing yet another increase in predictive performance, this dissertation is dedicated to the improvement of neural network efficiency. Specifically, a core contribution addresses the efficiency aspects during online inference. Here, the concept of Continual Inference Networks (CINs) is proposed and explored across four publications. CINs extend prior state-of-the-art methods developed for offline processing of spatio-temporal data and reuse their pre-trained weights, improving their online processing efficiency by an order of magnitude. These advances are attained through a bottom-up computational reorganization and judicious architectural modifications. The benefit to online inference is demonstrated by reformulating several widely used network architectures into CINs, including 3D CNNs, ST-GCNs, and Transformer Encoders. An orthogonal contribution tackles the concurrent adaptation and computational acceleration of a large source model into multiple lightweight derived models. Drawing on fusible adapter networks and structured pruning, Structured Pruning Adapters achieve superior predictive accuracy under aggressive pruning using significantly fewer learned weights compared to fine-tuning with pruning.
CURing Large Models: Compression via CUR Decomposition
Large deep learning models have achieved remarkable success but are resource-intensive, posing challenges such as memory usage. We introduce CURing, a novel model compression method based on CUR matrix decomposition, which approximates weight matrices as the product of selected columns (C) and rows (R), and a small linking matrix (U). We apply this decomposition to weights chosen based on the combined influence of their magnitudes and activations. By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss. For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.
DecompX: Explaining Transformers Decisions by Propagating Token Decomposition
An emerging solution for explaining Transformer-based models is to use vector-based analysis on how the representations are formed. However, providing a faithful vector-based explanation for a multi-layer model could be challenging in three aspects: (1) Incorporating all components into the analysis, (2) Aggregating the layer dynamics to determine the information flow and mixture throughout the entire model, and (3) Identifying the connection between the vector-based analysis and the model's predictions. In this paper, we present DecompX to tackle these challenges. DecompX is based on the construction of decomposed token representations and their successive propagation throughout the model without mixing them in between layers. Additionally, our proposal provides multiple advantages over existing solutions for its inclusion of all encoder components (especially nonlinear feed-forward networks) and the classification head. The former allows acquiring precise vectors while the latter transforms the decomposition into meaningful prediction-based values, eliminating the need for norm- or summation-based vector aggregation. According to the standard faithfulness evaluations, DecompX consistently outperforms existing gradient-based and vector-based approaches on various datasets. Our code is available at https://github.com/mohsenfayyaz/DecompX.
