DP-FTRL and matrix factorization mechanism (III)

January 1, 2024

\[\newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\Real}{\mathbb{R}} \newcommand{\cN}{\mathcal{N}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\ex}[1]{\mathbb{E}\left[ #1 \right] } \newcommand{\cL}{\mathcal{L}}\]

In this series of blog posts, I plan to write down some of my personal understanding (of course shaped by many wonderful papers) of DP-FTRL and matrix factorization mechanism.

As motivated at the end of the previous post, in this post, our goal is to give a general way to compute the total noise (error) in prefix sum for any factorization \(\bA = \bB\bC\). As a result, one can then optimize this error (objective) under privacy constraints to search for an even better mechanism than the tree-based algorithm (i.e., better pair \((\bB, \bC)\)). From this, we will see that one needs to be careful in choosing \(\bC\) and \(\bB\) to balance the (i) sensitivity (hence the variance for each noise, determined by \(\bC\)) and (ii) the total number of accumulated noise (determined by \(\bB\)). The key behind tree-based algorithm is that its pair of \((\bC, \bB)\) guarantees that both (i) and (ii) are on the log order, which is in sharp contrast to Simple I and II.

The calculation for the error in the matrix factorization mechanism is standard ( e.g.,[LMHMR15]). Here, we first slightly deviate from existing works by explicitly focusing on the task of prefix sum and aiming to bound the maximal error with high probability rather than expectation. Another difference is that for clarity, we explicitly consider the fact that the input data \(\bG\) is a \(T \times d\) matrix rather than \(T \times 1\) vector in standard previous works. Later, we will turn to the general way to bound the expected error.

To start with, we can easily see that the max error across \(T\) prefix outputs (here \(\bA = \bB \bC\) is the all-ones lower triangular matrix) is given by

\[\text{Max-Error}(\bB,\bC) := \max_{i \in [T]} \norm{\bB_{[i,:]} \bZ}_2,\]

where \(\bB_{[i,:]}\) is the \(i\)-th row of \(\bB\) and each element of \(\bZ\) is \(i.i.d\) sampled from \(\cN(0,\sigma^2)\), which is used to provide \((\epsilon,\delta)\)-DP. To this end, \(\sigma = \sigma_{\epsilon,\delta} \cdot \mathrm{sens}(\bC)\) (assume \(\ell_2\) norm of \(g_t\) is bounded by \(1\) again) with \(\sigma_{\epsilon,\delta} := O\left(\frac{\sqrt{\log(1/\delta)}}{\epsilon}\right)\) and \(\mathrm{sens}(\bC)\) is the \(\ell_2\)-sensitivity of \(\bC \bG\) (here it is Frobenious norm since \(\bC \bG\) is a matrix) ¹. Let us first compute the high-probability max-error for a general \(\sigma\) and then plug in the proper value of it for privacy. For each \(i \in [T]\), we have each element of the \(d\)-dimensional row vector \(\bB_{[i,:]} \bZ\) is distributed according to \(\cN(0, \norm{\bB_{[i,:]}}_2^2)\). Hence, by the standard concentration of Gaussian vector and union bound over all \(i\), we have with high probability

\(\text{Max-Error}(\bB,\bC) \le \max_{i \in [T]} O\left(\sigma \cdot \sqrt{d} \cdot \norm{\bB_{[i,:]}}_2\right).\tag{3.1}\)

It remains to determine the value of \(\sigma\), i.e., \(\mathrm{sens}(\bC)\). To this end, we can write \(\bC \bG\) as a linear combination of the columns in \(\bC\)

\[\bC \bG = \bC_{[:,1]} g_1^{\top} + \cdots +\bC_{[:,j]} g_j^{\top} + \cdots + \bC_{[:,T]} g_T^{\top},\]

where \(\bC_{[:,j]}\) is the \(j\)-th column of \(\bC\) and \(g_j^{\top} \in \Real^{1 \times d}\). Thus, considering an arbitrary neighboring sequence of \(\{g_j'\}_j\), the maximal Frobenious norm change is given by \(\max_{j \in [T]} \norm{\bC_{[:j]} (g_j^{\top} - g_j^{\prime \top})}_{F}\). By the fact that the Frobenious norm is sub-multiplicative, we have

\(\mathrm{sens}(\bC) \le \max_{j \in [T]} \norm{\bC_{[:j]}}_F \norm{g_j^{\top} - g_j^{\prime \top}}_{F} = \max_{j \in [T]} \norm{\bC_{[:j]}}_2 \norm{g_j^{\top} - g_j^{\prime \top}}_{2} \le 2 \max_{j \in [T]} \norm{\bC_{[:j]}}_2,\)

where in the last step we use the bounded norm of gradients. Combining the above with (3.1) and the fact \(\sigma = \sigma_{\epsilon,\delta} \cdot \mathrm{sens}(\bC)\) together, we conclude that for \((\epsilon,\delta)\)-DP under \((\bB,\bC)\) for the task of prefix sum, the max-error across all \(T\) noisy prefix sum is upper bounded by the following with high probability

\(\text{Max-Error}(\bB,\bC) \le \max_{i \in [T]} O\left(\sigma_{\epsilon,\delta} \cdot \max_{j \in [T]} \norm{\bC_{[:j]}}_2 \cdot \sqrt{d} \cdot \norm{\bB_{[i,:]}}_2\right).\tag{3.2}\)

Remark 3.1: From the above bound, one can now see that a good choice of \((\bB,\bC)\) such that \(\bA = \bB \bC\) is the one that minimizes the norm of columns in \(\bC\) and the norm of rows in \(\bB\), simultaneously. Moreover, for any factorization, one can simply plug the bound in (3.2) into (1.2) to obtain the performance of DP-FTRL with the corresponding mechanism instead of the tree-based algorithm.

We now turn to a general task \(\bA\) (instead of prefix sum task above) and discuss the standard optimization objective in the matrix factorization mechanism. In particular, the standard goal in the literature is to minimize the expected error given by

\(\cL_{\text{normalize}}(\bB,\bC) := \frac{1}{\sqrt{T}}\cdot \ex{\norm{\bB \bZ}_F},\tag{3.3}\)

where one can think of \(\ex{\norm{\bB \bZ}}_F\) as the total expected error in all \(T\) outputs and hence normalized by \(1/\sqrt{T}\) to have the same unit as the maximal expected error. In most existing works, the bound for (3.3) is used directly in the regret of DP-FTRL (i.e., (1.2)). However, in this case, one can at most get an expected regret and hence an expected population excess risk. Let us first put this aside and focus on how to bound the term in (3.3).

For \((\epsilon,\delta)\)-DP, the value of \(\sigma\) for \(\bZ\) is the same, i.e., \(\sigma \approx \sigma_{\epsilon,\delta} \cdot \mathrm{sens}(\bC) = \sigma_{\epsilon,\delta} \cdot \max_{j \in [T]} \norm{\bC_{[:j]}}_2\). To bound the error, we have

\[\ex{\norm{\bB \bZ}_F} \le \sqrt{\ex{\norm{\bB \bZ}_F^2}} = \sqrt{d \cdot \sigma^2 \cdot \norm{\bB}_F^2}.\]

With this, we have that under \((\epsilon,\delta)\)-DP

\[\label{eq:ex-error-private} \cL_{\text{normalize,DP}}(\bB,\bC) \le O\left(\frac{\sqrt{d}\sigma_{\epsilon,\delta}}{\sqrt{T}}\cdot \mathrm{sens}(\bC) \cdot \norm{\bB}_F \right).\]

Hence, in most works, the objective is to optimize the following objective

\[\label{eq:opt} \min_{\bB, \bC} \norm{\bB}_F^2 \text{ such that } \bB \bC = \bA \text{ and } \mathrm{sens}(\bC) = 1\]

This problem is well-studied and can be solved using numerical optimization algorithms (e.g.,[YYZH16], [MMHM18]).

Remark 3.2: One may wonder if we can get a high probability bound on \(\norm{\bB\bZ}_F\). It turns out that with some use of concentration inequality, one can get it done. One straightforward way is to apply the (high-dimensional) Hanson-Wright inequality (since \(\bZ\) is a matrix rather than vector), see Exercise 6.2.7 in [Ver18].

Summary. Let us summarize what we have covered so far in this series: In (I), we introduce DP-FTRL, tree-based algorithm and general DP-FTRL regret bound as a function of the maximal error in prefix sum (i.e., (1.2)). Then, in (II), we view existing private mechanisms for prefix sum as special cases of matrix factorization mechanism, which motivates us to find better pair \((\bB, \bC)\). Hence, in this post, we derive optimization problems over \((\bB, \bC)\) to minimize the total error under privacy guarantee. As a result, we can now substitute the best error into (1.2) to obtain better regret upper bound.

But, as hinted by Q2 in the end of (I), it is still unclear whether minimizing the maximal error in prefix sum leads to tight final actual regret guarantee. This leads us to the topic of the next post:)

THE END

Now, it’s time to take a break by appreciating the masterpiece of Monet.

Monet

Water Lilies

courtesy of https://www.wikiart.org/

One subtlety here is that gradients in \(\bG\) in model training is adaptive rather than fixed. However, as shown in [DMRSG22], for the Gaussian mechanism, it suffices to consider the non-adaptive one. ↩︎