# Minimal Reproduction: k-means++

Paper: David Arthur and Sergei Vassilvitskii (2007), **k-means++: The Advantages of Careful Seeding**.

Source checked: Semantic Scholar / SODA 2007 metadata and search results.

## Goal

Reproduce the paper's core empirical/theoretical intuition: choosing initial centers with distance-squared (`D^2`) sampling produces a lower k-means objective than uniform random initialization.

This is a lightweight reproduction, not a re-run of the original paper's full experimental suite.

## Setup

- Language: Python 3
- Dependencies: none beyond Python standard library
- Dataset: synthetic 2D Gaussian mixture
- `n = 905`, `k = 8`
- Trials: 160 paired random seeds
- Optimizer after initialization: Lloyd iterations, up to 60 iterations

## Results

From `summary.json`:

| Metric | Random seeding | k-means++ seeding |
|---|---:|---:|
| Initial SSE mean | 12,310.08 | 4,303.35 |
| Final SSE mean | 3,123.94 | 2,190.16 |
| Final SSE median | 2,666.35 | 1,948.70 |

Additional result:

- Mean final objective reduction vs random: **29.89%**
- k-means++ wins paired trials: **77.5%**
- Mean final ratio `kmeans++ / random`: **0.7875**

## Interpretation

The reproduction supports the paper's central claim: careful `D^2` seeding gives substantially better initial cluster centers and, after Lloyd refinement, usually reaches a lower k-means objective than uniform random seeding.

## Files

- `reproduce_kmeanspp.py` — full dependency-free reproduction script
- `summary.json` — aggregate metrics
- `trials.csv` — per-trial raw results
- `kmeanspp_results.svg` — visual summary

Run again:

```bash
python3 reproductions/kmeanspp_arthur2007/reproduce_kmeanspp.py
```