Uniqcount: estimating unique elements with a tiny amount of memory

Edoardo 🇮🇹 Feb 19, 2026 7 min read
I built uniqcount after reading the Quanta piece and the revised paper. This post explains the algorithm, the Swift implementation choices, and what I could reproduce from the article’s claims.

Introduction #

The problem: given a stream of values, estimate how many distinct values appeared.

For small inputs, exact counting is easy: insert everything into a set and read count. For large streams, exact counting can use too much memory, so approximate counting is useful.

Problem #

Write the stream as:

$$A = \langle a_1, a_2, \dots, a_m \rangle$$

The target quantity is:

$$F_0(A) = |{a_1, a_2, \dots, a_m}|$$

So $F_0(A)$ is the number of distinct values seen at least once.

Estimator #

In this post I use the estimator from the paper; people call it CVM after the authors (Chakraborty, Vinodchandran, and Meel). It keeps:

  • a sample set $X$
  • a sampling probability $p$

Interpret $p$ as a coin bias:

  • if $p = 1$, keep every eligible value
  • if $p = 1/2$, keep on heads
  • if $p = 1/4$, keep with probability $25%$

For each new stream element $a_i$:

  1. remove $a_i$ from $X$ (if present)
  2. add $a_i$ back with probability $p$
  3. if $|X|$ reaches a threshold, keep each item in $X$ with probability $1/2$, then set $p \leftarrow p/2$

Final estimate:

$$\hat{F_0} = \frac{|X|}{p}$$

The remove-then-add step is important. It gives each distinct token one fresh coin flip at the current rate. For example, with cat, cat, dog and $p = 1/2$, the second cat replaces the first cat sample decision instead of accumulating extra weight.

A useful intuition is Quanta’s whiteboard example with capacity 100:

  • fill the board at $p = 1$
  • when full, flip coins and keep about half, then set $p = 1/2$
  • repeat each time the board fills

Their sample run ends with $k = 6$ (so $p = 1/2^6$) and $|X| = 61$:

$$\hat{F_0} = 61 \cdot 2^6 = 3904$$

The exact distinct count reported there is 3967, so the estimate is close.

Choosing the threshold #

The paper sets threshold from $\varepsilon$, $\delta$, and stream length $m$:

$$\text{threshold} = \left\lceil \frac{12}{\varepsilon^2} \log_2\left(\frac{8m}{\delta}\right) \right\rceil$$

  • $\varepsilon$: target relative error
  • $\delta$: allowed failure probability
  • $m$: stream length

Theorem 2 states that with this threshold, output is within $(1 \pm \varepsilon)$ of $F_0$ with probability at least $1 - \delta$.

For Hamlet in this post ($m = 36107$), $\varepsilon = 0.1$, $\delta = 0.05$ gives threshold 26955. That is much larger than the exact distinct count (4962), so this setting behaves like exact counting on this dataset.

Swift implementation #

I implemented this in Swift as a small CLI tool. The repo is n3d1117/uniqcount.

High-level codebase layout:

  • Core.swift: estimator logic, tokenizer, threshold calculation, and trial runner
  • UniqCount.swift: CLI entry point (parse args, load input, run experiment)
  • Reporting.swift: aggregates trial stats and prints run/trial/summary tables

High-level execution flow:

  1. parse CLI options (--memory or --epsilon --delta, trials, seed)
  2. load the file once and tokenize it once
  3. resolve threshold
  4. run independent trials in parallel with deterministic per-trial seeds
  5. aggregate results and print either one estimate or a full report

CLI modes:

  • fixed threshold via --memory
  • formula-based threshold via --epsilon --delta

Default run:

1swift run uniqcount --path /tmp/hamlet.txt

Optimizations #

  • Tokenize once, then map to integer IDs Each unique token gets a UInt32 ID, and all trials run on IDs rather than strings.

  • Bit-mask sampling in the hot loop Sampling uses (rng.next() & mask) == 0 for probabilities of the form (1/2^k).

  • Cached scale factor Estimate uses (|X| \cdot 2^k) with a cached multiplier instead of repeated pow calls.

  • Deterministic downsampling order Before thinning, the set is sorted so fixed-seed runs stay reproducible across launches.

  • Parallel trials with stable output order Trials run concurrently, but each trial writes to a fixed result index.

  • ASCII fast path + Unicode fallback tokenizer ASCII input uses a byte-level path; non-ASCII uses a Unicode-safe path.

Results #

Hamlet #

I used hamlet.txt from Project Gutenberg, with seed 42 and 30 trials per config.

The tokenization rule I used is:

  • keep letters, digits, and apostrophe (')
  • lowercase everything
  • split on all other characters

Examples:

  • KING, -> king
  • can't -> can't
  • to-be -> to, be
  • Act 3 -> act, 3

Dataset stats:

  • tokens: 36107
  • exact distinct: 4962

I used 30 trials because this is a randomized estimator. A single run can be lucky or unlucky; 30 runs gives stable mean/max error while keeping runtime short.

ConfigThresholdMean relative errorMax relative errorMean trial time
--memory 1001008.279%36.719%13.329 ms
--memory 2502506.355%21.241%15.203 ms
--memory 5005003.929%15.760%18.847 ms
--memory 100010002.631%5.764%28.637 ms
--memory 200020001.760%5.119%32.285 ms

As memory increases, both average and worst-case error decrease. In practical terms, with exact distinct count 4962, a mean relative error of 2.631% means the estimate is off by about 131 tokens on average.
At 8.279%, the average gap is about 411 tokens.

Paper-parameter runs:

ConfigThresholdMean relative errorMax relative errorMean trial time
--epsilon 0.2 --delta 0.164390.000%0.000%17.520 ms
--epsilon 0.15 --delta 0.05119800.000%0.000%15.178 ms
--epsilon 0.1 --delta 0.05269550.000%0.000%15.568 ms

Each threshold is larger than 4962, so downsampling does not trigger and output is exact on this dataset.

Quanta reports averages of 3955 (memory = 100) and 3964 (memory = 1000) for true value 3967 over five runs. I reproduced the same trend (higher memory gives tighter estimates), but not identical absolute numbers, which depends on tokenization and dataset details.

Reproducing #

1curl -L 'https://www.gutenberg.org/cache/epub/1524/pg1524.txt' -o /tmp/hamlet.txt
2git clone https://github.com/n3d1117/uniqcount.git
3cd uniqcount
1swift run uniqcount --path /tmp/hamlet.txt --memory 1000 --trials 30 --seed 42 --report
1swift run uniqcount --path /tmp/hamlet.txt --epsilon 0.1 --delta 0.05 --trials 30 --seed 42 --report

Output #

This is a real run with --memory 1000, --trials 10, --seed 42.

1swift run --quiet uniqcount --path /tmp/hamlet.txt --memory 1000 --trials 10 --seed 42 --report
1Run:
2+----------------+-----------------+
3| Field | Value |
4+----------------+-----------------+
5| path | /tmp/hamlet.txt |
6| tokens | 36107 |
7| exact_distinct | 4962 |
8| trials | 10 |
9| threshold | 1000 |
10| seed | 42 |
11+----------------+-----------------+
12
13Trials:
14+-------+-------+----------+----------+--------+--------+
15| Trial | Exact | Estimate | RelError | TimeMs | Status |
16+-------+-------+----------+----------+--------+--------+
17| 1 | 4962 | 4952 | 0.202% | 24.383 | ok |
18| 2 | 4962 | 4936 | 0.524% | 26.288 | ok |
19| 3 | 4962 | 5192 | 4.635% | 25.172 | ok |
20| 4 | 4962 | 4936 | 0.524% | 25.575 | ok |
21| 5 | 4962 | 4840 | 2.459% | 24.973 | ok |
22| 6 | 4962 | 5192 | 4.635% | 25.656 | ok |
23| 7 | 4962 | 4840 | 2.459% | 27.580 | ok |
24| 8 | 4962 | 5208 | 4.958% | 23.779 | ok |
25| 9 | 4962 | 4800 | 3.265% | 25.830 | ok |
26| 10 | 4962 | 4776 | 3.748% | 24.304 | ok |
27+-------+-------+----------+----------+--------+--------+
28
29Summary:
30+---------------------+-----------+
31| Metric | Value |
32+---------------------+-----------+
33| trials | 10 |
34| failures(bottom) | 0 |
35| failure_rate | 0.000% |
36| mean_relative_error | 2.741% |
37| max_relative_error | 4.958% |
38| mean_trial_time | 25.354 ms |
39+---------------------+-----------+

Conclusion #

For this use case, the algorithm is valid.

  • The paper provides a formal $(\varepsilon, \delta)$ guarantee.
  • The implementation reproduces the expected memory/error tradeoff.
  • The Quanta-style claim is reproducible at the behavior level: small memory is noisier, larger memory is steadier.

In practice, use small fixed memory when bounded space matters most. Use a formula-based threshold when you want stronger guarantees and can spend more memory.

References #