Singular Value Decomposition (SVD)

SVD is one of those tools that shows up everywhere once you know to look for it-image compression, recommendation systems, noise reduction, natural language processing. It's arguably the most important factorization in all of linear algebra. This post will show you what it actually means, geometrically and practically, and walk through computing it by hand. I'm assuming you're comfortable with matrix multiplication, eigenvalues, and eigenvectors.

The Big Picture

Here's the core insight: every matrix-no matter its shape-represents some kind of transformation. Multiply a vector by a matrix, and the vector gets stretched, rotated, squished, or projected somewhere new.


SVD says something remarkable: any matrix transformation, however complicated it looks, is secretly just three simple operations stacked together:


  1. Rotate (align with the ""natural axes" of the transformation)
  2. Scale (stretch or shrink along each axis)
  3. Rotate again (orient into the output space)

That's it. Rotation, scaling, rotation. Every matrix. The power of SVD is that it tells you exactly what those rotations and scalings are.

Start: Unit Circle

v₁v₂

We begin with a unit circle and two perpendicular vectors (basis vectors). Watch what happens as we apply each step of SVD.

Step through the animation above. See how the unit circle becomes an ellipse? The singular values are the lengths of the ellipse's axes. The singular vectors are the directions of those axes. SVD extracts this hidden geometry from any matrix.

The Decomposition

For any m×nm \times n matrix AA, we can write:

A=UΣVTA = U \Sigma V^T

Three matrices, each with a specific job:


  • VTV^T (n×nn \times n) - Rotates the input to align with the transformation's natural axes. Its rows are called right singular vectors.
  • Σ\Sigma (m×nm \times n) - A diagonal matrix of singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0. These are the scaling factors-how much the matrix stretches along each axis.
  • UU (m×mm \times m) - Rotates the scaled result into the output space. Its columns are the left singular vectors.
Am × n=Um × mrotation×Σσ₁σ₂scaling×VTn × nrotation

Both U and V are orthogonal matrices-their columns are perpendicular unit vectors. This means rotations only, no distortion. All the stretching happens in Σ.

The Magic of Singular Values

Here's where it gets interesting. In most real-world matrices, the singular values drop off fast. The first few are large; the rest are tiny. This means a handful of components carry almost all the "information."


Think about it: if σ₃ is 100 times smaller than σ₁, the third component contributes almost nothing to the final result. We could throw it away and barely notice the difference.

Singular values to keep (k):3
σ110σ24σ31.5σ40.5σ50.1
99.8%of information captured

Three components and we're capturing almost everything important.

This is the foundation of low-rank approximation. Keep only the top k singular values, set the rest to zero, and you get the best possible rank-k approximation of the original matrix. This isn't just "a good" approximation-it's mathematically optimal. That's the Eckart–Young–Mirsky theorem.

Application: Image Compression

A grayscale image is just a matrix of pixel values. A 1000×1000 image? That's a million numbers. But most of those pixels are redundant-neighboring pixels tend to be similar, patterns repeat.


SVD exploits this redundancy. Instead of storing all million values, we compute the SVD and keep only the top k components. A rank-k approximation needs just k(m+n+1)k(m + n + 1) numbers instead of mnmn.


For a 1000×1000 image with k = 50: that's about 100,000 numbers instead of 1,000,000. A 10× reduction, often with barely visible quality loss.

k = Original

100% data

k = 50

50% data

k = 20

20% data

k = 5

5% data

k = 1

1% data

Each image shows what you'd get keeping only the top k singular values. The first few capture the overall structure; later ones add fine detail.

Computing SVD

Here's the procedure. It's not too bad for small matrices:

  1. Compute ATAA^T A (this gives an n × n symmetric matrix)
  2. Find the eigenvalues and eigenvectors of ATAA^T A. The eigenvectors become the columns of V.
  3. The singular values are the square roots: σi=λi\sigma_i = \sqrt{\lambda_i}
  4. Compute each column of U: ui=1σiAvi\vec{u}_i = \frac{1}{\sigma_i} A \vec{v}_i
  5. Extend U to a full orthonormal basis if needed (when m > rank)

Worked Example

Let's work through a concrete example. Take:

A=[3045]A = \begin{bmatrix} 3 & 0 \\ 4 & 5 \end{bmatrix}

Step 1: Compute ATA

ATA=[3405][3045]=[25202025]A^T A = \begin{bmatrix} 3 & 4 \\ 0 & 5 \end{bmatrix} \begin{bmatrix} 3 & 0 \\ 4 & 5 \end{bmatrix} = \begin{bmatrix} 25 & 20 \\ 20 & 25 \end{bmatrix}

Notice this is symmetric (it always will be).

Step 2: Find eigenvalues

Solve det(ATAλI)=0\det(A^T A - \lambda I) = 0:

(25λ)2400=0(25-\lambda)^2 - 400 = 0

25λ=±2025 - \lambda = \pm 20

So λ1=45\lambda_1 = 45 and λ2=5\lambda_2 = 5.

Step 3: Singular values

σ1=45=356.71,σ2=52.24\sigma_1 = \sqrt{45} = 3\sqrt{5} \approx 6.71, \quad \sigma_2 = \sqrt{5} \approx 2.24

Step 4: Find V (eigenvectors of ATA)

For λ₁ = 45, solve (ATA45I)v=0(A^T A - 45I)\vec{v} = 0:

[20202020]v1=0    v1=12[11]\begin{bmatrix} -20 & 20 \\ 20 & -20 \end{bmatrix} \vec{v}_1 = 0 \implies \vec{v}_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}

For λ₂ = 5:

[20202020]v2=0    v2=12[11]\begin{bmatrix} 20 & 20 \\ 20 & 20 \end{bmatrix} \vec{v}_2 = 0 \implies \vec{v}_2 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}

Step 5: Find U

Use ui=1σiAvi\vec{u}_i = \frac{1}{\sigma_i} A \vec{v}_i:

u1=135[3045]12[11]=[1/103/10]\vec{u}_1 = \frac{1}{3\sqrt{5}} \begin{bmatrix} 3 & 0 \\ 4 & 5 \end{bmatrix} \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1/\sqrt{10} \\ 3/\sqrt{10} \end{bmatrix}

u2=15[3045]12[11]=[3/101/10]\vec{u}_2 = \frac{1}{\sqrt{5}} \begin{bmatrix} 3 & 0 \\ 4 & 5 \end{bmatrix} \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} 3/\sqrt{10} \\ -1/\sqrt{10} \end{bmatrix}

The Result

Putting it together (approximately):

A=[0.320.950.950.32]U[6.71002.24]Σ[0.710.710.710.71]VTA = \underbrace{\begin{bmatrix} 0.32 & 0.95 \\ 0.95 & -0.32 \end{bmatrix}}_U \underbrace{\begin{bmatrix} 6.71 & 0 \\ 0 & 2.24 \end{bmatrix}}_\Sigma \underbrace{\begin{bmatrix} 0.71 & 0.71 \\ 0.71 & -0.71 \end{bmatrix}}_{V^T}

You can verify: multiply U × Σ × VT and you'll get back our original matrix A = [[3, 0], [4, 5]].

What This Tells Us

Look at the singular values: σ₁ ≈ 6.71 is about three times larger than σ₂ ≈ 2.24. The matrix stretches space much more in one direction than the other.


How much does the first component dominate? We can measure "energy" as the sum of squared singular values:

σ12σ12+σ22=4550=90%\frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} = \frac{45}{50} = 90\%

The first singular value alone captures 90% of what this matrix "does." If we only needed a rough approximation, we could use just that one component and throw away half the data.

Connection to PCA

If you've read my PCA post, you might notice this feels familiar. That's because PCA is secretly just SVD in disguise.


When you do PCA, you center your data matrix X and find the eigenvalues of the covariance matrix XTX. But those eigenvalues are exactly the squared singular values of X, and the eigenvectors are exactly the right singular vectors.


So PCA = SVD of centered data. In practice, computing SVD directly is often more numerically stable than forming the covariance matrix explicitly.

The Takeaway

SVD reveals the hidden skeleton of any matrix. By decomposing into rotations and scalings, we see exactly what a matrix transformation does, and more importantly, which parts matter most.


The key ideas:


  • Any matrix = rotate × scale × rotate
  • Singular values measure importance of each component
  • Most real matrices have fast-decaying singular values
  • Keeping top-k gives the optimal rank-k approximation
  • PCA is just SVD on centered data

Whether you're compressing images, building recommender systems, or doing dimensionality reduction, SVD is probably involved somewhere. It's one of those tools that, once you understand it, you start seeing everywhere.