DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

The Influence Function (IF), initially a concept in robust statistics [Ham05R], has gained popularity in machine learning for quantifying the impact of individual training data points on a model’s output. Popularized in the realm of machine learning by [Koh17U], the IF is employed as a substitute for leave-one-out (LOO) retraining to reduce computational complexity. 1 1 This approximation to the LOO is typically justified by the fact that the IF is a first-order approximation of the LOO. However, [Bae22I] show that for non-convex problems, the IF is a better approximations of a different object: the Proximal Bregman Response Function.

To estimate the influence of a point $z=(x, y)$ on a test point $z_{t} = (x_{t}, y_{t})$, consider a model $f_{\theta}$ with training data points $\{z_i=(x_i, y_i)\}_{i=1}^n$. Then the following expression measures the influence of the point $z$ on the test point $z_t$: \begin{equation*} \mathcal{I}(z_{t}, z) = \nabla_{\theta} \ell(z_{t}, \theta)^T(H_{\theta} + \lambda I_d)^{-1} \nabla_{\theta} \ell(z, \theta) \end{equation*} with $\nabla_{\theta} \ell(z, \theta)=\nabla_{\theta} \ell(y, f_{\theta}(x))$ and where \begin{equation*} H(\theta)=n^{-1} \sum_{i=1}^n\nabla^2_{\theta} \ell(y_i, f_{\theta}(x_i)) =n^{-1} \sum_{i=1}^n\nabla^2_{\theta} \ell_i \end{equation*} represents the Hessian matrix of the model’s loss over the training data set with respect to the model’s parameters. For a comprehensive derivation, see section $2$ of [Koh17U]. Additionally, our series description Data valuation offers a classification of this topic within the broader context of data valuation.

The most computationally demanding part of this expression is solving for $(H_{\theta} + \lambda I_d)^{-1} \nabla_{\theta} \ell(z, \theta)$. The process of directly inverting matrices is frequently unfeasible, which necessitates the use of different iterative solvers. However, in the case of large-scale models, these iterative solvers can become a hindrance themselves. To tackle this issue, the authors of [Kwo23D] propose a computational scheme that yields an explicit formula, thereby circumventing the need for expensive iterative procedures.

Approximate Hessian inversion

To demonstrate the proposed approximation method, consider that the model $f_{\theta}$ is structured as a composition of layers: $$ f_{\theta}(x) = f_{\theta_L}(x) \circ \dots f_{\theta_1}(x), $$ i.e. it consists of $L$ layers and $\theta_l \in \mathbb{R}^{d_l}, l \in [L]$ denotes the parameters of the $l$-th layer. The approximation method comprises the following steps (click on the steps for mathematical details):

Replace the Hessian $H(\theta)$ with the Gauss-Newton matrix $G(\theta)$:
$$ G(\theta)=n^{-1} \sum_{i=1}^n \nabla_{\theta}\ell_i\nabla_{\theta}\ell_i^T $$ which results in $$ \mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T (G(\theta) + \lambda I_d)^{-1} \nabla_{\theta} \ell(z, \theta) $$
Simplify the problem by breaking it down into a block diagonal structure, where each block $G_l(\theta)$ corresponds to the l-th layer:
$$ G_{l}(\theta) = n^{-1} \sum_{i=1}^n \nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T} + \lambda_l I_{d_l}, $$ which leads to $$ \mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T \operatorname{diag}(G_1(\theta)^{-1}, \dots, G_L(\theta)^{-1}) \nabla_{\theta} \ell(z, \theta) $$
Substitute the arithmetic mean of the rank-$1$ updates in $G_l(\theta)$, with the inverse harmonic mean $R_l(\theta)$ of the rank-1 updates:
\begin{align*} G_l(\theta)^{-1} &= \left( n^{-1} \sum_{i=1}^n \nabla_{\theta_l} \ell(z_i, \theta) \nabla_{\theta_l} \ell(z_i, \theta)^{T} + \lambda_l I_{d_l}\right)^{-1} \\\ R_{l}(\theta)&= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l} \ell(z_i, \theta) \nabla_{\theta_l} \ell(z_i, \theta)^{T} + \lambda_l I_{d_l} \right)^{-1} \end{align*}
Use the Sherman–Morrison formula to get an explicit representation of the inverses in the definition of $R_l(\theta):$
\begin{align*} R_l(\theta) &= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T} + \lambda_l I_{d_l}\right)^{-1} \\ &= n^{-1} \sum_{i=1}^n \lambda_l^{-1} \underbrace{\left(I_{d_l} - \frac{\nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T}}{\lambda_l + \|\nabla_{\theta_l} \ell_i\|_2^2}\right)} _{=\mathrel{\mathop:} U_{l,i}}, \end{align*} which means application of $R_l(\theta)$ boils down to computing $n$ rank-$1$ updates.

The final formula is expressed as follows: \begin{equation*} \mathcal{I}_{\text{DataInf}}(z_{t}, z) = - \sum_{l=1}^L\nabla_{\theta_l} \ell(z_{t}, \theta)^T \left(\frac{1}{n \lambda_l} \sum_{i=1}^n U_{l,i} \nabla_{\theta_l}\ell(z, \theta)\right) \end{equation*} with \begin{equation*} U_{l,i} = I_{d_l} - \frac{\nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^T}{\lambda_l + \|\nabla_{\theta_l} \ell_i\|_2^2}. \end{equation*}

This can be calculated for each layer separately. The total computation requires $\mathcal{O}(\sum_{l=1}^Lnd_l)$ operations and $\mathcal{O}(\max_{l} d_l)$ memory. While steps $1$ and $2$ are standard techniques for reducing computational complexity, steps $3$ and $4$ are crucial in deriving an explicit formula.

The swapping of the inversion in step $3$ may introduce a significant error, necessitating an upper error bound. Under a boundedness condition on the gradients, the authors demonstrate that this error increases at most quadratically with the number of layer parameters (i.e. it is $\mathcal{O}(d_l^2)$).

While the proposed approach is applicable for generic neural networks, the derived bound indicates that the induced error becomes more tolerable as the number of parameters per layer decreases.

Experiments for LoRA-tuned LLMs

The Low-Rank Adaption technique (LoRA, [Hu21L]) is designed to reduce the number of trainable parameters during the fine-tuning of large models. It replaces the full parameter matrix update by freezing the pre-trained matrix and adds a trainable low-rank update to it. For instance, if the $l$-th layer of a model contains a weight matrix of size $m\times m$, the original number of trainable parameters of this layer would be $d_l=m^2$. In contrast, with LoRA’s low-rank update of rank $r$, the number of parameters is reduced to $d_l=rm$, where $r$ is significantly smaller than $m$.

With this in mind, the authors conduct tests of the proposed approximation method on a LoRA-tuned Large Language Model (LLMs).

Approximation error analysis: The authors evaluate the proposed approximation method against two other approaches:
- LiSSA: Solve the equations $G_l(\theta)v=\nabla_{\theta}\ell(z, \theta)$ using the approach of [Aga17S].
- Hessian-free: Not solving the equations at all and setting the influence values to $\nabla_{\theta}\ell(z_t, \theta)^T\nabla_{\theta} \ell(z, \theta)$, see also [Pru20E].
To mimic a noisy data scenario, labels for $20$% of randomly selected data points are flipped. Figure 1 depicts the correlation with the exact solutions obtained by directly solving equations involving $G_l(\theta)$.
Figure 1 of [Kwo23D]: Correlation with exact block-inversion for different datasets and different values of the rank used in the LoRA-fine-tuning for the RoBERTa model.
Mislabeled data detection: For the noisy datasets utilized in the approximation error analysis, the authors conduct mislabeled data detection experiments. They employ the area under the curve (AUC) values as a metric, comparing the binary annotations for randomly flipped labels (0 for clean, 1 for flipped) against the calculated influence values. Figure 2 shows the results for the various methods, where Exact refers to solving equations involving $G_l(\theta)$ with a direct solver.
Figure 2 of [Kwo23D]: Area Under the Curve (AUC) values for identifying mislabeled data points, for LoRA (rank$=4$) fine-tuned RoBERTa model

In addition, the authors design experiments for employing the proposed method on text-to-image generation tasks with diffusion models. For more details, see section $4.3$ of [Kwo23D].

Discussion

The method introduced is notably efficient in computational terms. This efficiency primarily stems from the inversion process alteration in step 3, which one could view as a very coarse substitute for the inverse of the Gauss-Newton block matrix. Even though the correlation with the exact inverse computation is debatable, the performance in downstream tasks such as the detection of mislabeled or most influential data points is convincing. It would be very interesting to see a comparison with a recently proposed approximation method based on Kronecker product factorizations (see Studying Large Language Model Generalization with Influence Functions for a discussion of the Kronecker product method).

We intend to integrate this method into pyDVL: the python Data Valuation Library.

Influence functions efficiently estimate the effect of removing a single training data point on a model's learned parameters. While influence estimates align well with leave-one-out retraining for linear models, recent works have shown this alignment is often poor in neural networks. In this work, we investigate the specific factors that cause this discrepancy by decomposing it into five separate …

Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image …

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively …

Approximate Hessian inversion

Experiments for LoRA-tuned LLMs

Discussion

References

In this series →