Uniqcount: estimating unique elements with a tiny amount of memory
Table of Contents
uniqcount after reading the Quanta piece and the revised paper.
This post explains the algorithm, the Swift implementation choices, and what I could reproduce from the article’s claims.Introduction #
The problem: given a stream of values, estimate how many distinct values appeared.
For small inputs, exact counting is easy: insert everything into a set and read count.
For large streams, exact counting can use too much memory, so approximate counting is useful.
Problem #
Write the stream as:
$$A = \langle a_1, a_2, \dots, a_m \rangle$$
The target quantity is:
$$F_0(A) = |{a_1, a_2, \dots, a_m}|$$
So $F_0(A)$ is the number of distinct values seen at least once.
Estimator #
In this post I use the estimator from the paper; people call it CVM after the authors (Chakraborty, Vinodchandran, and Meel). It keeps:
- a sample set $X$
- a sampling probability $p$
Interpret $p$ as a coin bias:
- if $p = 1$, keep every eligible value
- if $p = 1/2$, keep on heads
- if $p = 1/4$, keep with probability $25%$
For each new stream element $a_i$:
- remove $a_i$ from $X$ (if present)
- add $a_i$ back with probability $p$
- if $|X|$ reaches a threshold, keep each item in $X$ with probability $1/2$, then set $p \leftarrow p/2$
Final estimate:
$$\hat{F_0} = \frac{|X|}{p}$$
The remove-then-add step is important. It gives each distinct token one fresh coin flip at the current rate.
For example, with cat, cat, dog and $p = 1/2$, the second cat replaces the first cat sample decision instead of accumulating extra weight.
A useful intuition is Quanta’s whiteboard example with capacity 100:
- fill the board at $p = 1$
- when full, flip coins and keep about half, then set $p = 1/2$
- repeat each time the board fills
Their sample run ends with $k = 6$ (so $p = 1/2^6$) and $|X| = 61$:
$$\hat{F_0} = 61 \cdot 2^6 = 3904$$
The exact distinct count reported there is 3967, so the estimate is close.
Choosing the threshold #
The paper sets threshold from $\varepsilon$, $\delta$, and stream length $m$:
$$\text{threshold} = \left\lceil \frac{12}{\varepsilon^2} \log_2\left(\frac{8m}{\delta}\right) \right\rceil$$
- $\varepsilon$: target relative error
- $\delta$: allowed failure probability
- $m$: stream length
Theorem 2 states that with this threshold, output is within $(1 \pm \varepsilon)$ of $F_0$ with probability at least $1 - \delta$.
For Hamlet in this post ($m = 36107$), $\varepsilon = 0.1$, $\delta = 0.05$ gives threshold 26955. That is much larger than the exact distinct count (4962), so this setting behaves like exact counting on this dataset.
Swift implementation #
I implemented this in Swift as a small CLI tool. The repo is n3d1117/uniqcount.
High-level codebase layout:
Core.swift: estimator logic, tokenizer, threshold calculation, and trial runnerUniqCount.swift: CLI entry point (parse args, load input, run experiment)Reporting.swift: aggregates trial stats and prints run/trial/summary tables
High-level execution flow:
- parse CLI options (
--memoryor--epsilon --delta, trials, seed) - load the file once and tokenize it once
- resolve threshold
- run independent trials in parallel with deterministic per-trial seeds
- aggregate results and print either one estimate or a full report
CLI modes:
- fixed threshold via
--memory - formula-based threshold via
--epsilon --delta
Default run:
1swift run uniqcount --path /tmp/hamlet.txtOptimizations #
Tokenize once, then map to integer IDs Each unique token gets a
UInt32ID, and all trials run on IDs rather than strings.Bit-mask sampling in the hot loop Sampling uses
(rng.next() & mask) == 0for probabilities of the form (1/2^k).Cached scale factor Estimate uses (|X| \cdot 2^k) with a cached multiplier instead of repeated
powcalls.Deterministic downsampling order Before thinning, the set is sorted so fixed-seed runs stay reproducible across launches.
Parallel trials with stable output order Trials run concurrently, but each trial writes to a fixed result index.
ASCII fast path + Unicode fallback tokenizer ASCII input uses a byte-level path; non-ASCII uses a Unicode-safe path.
Results #
Hamlet #
I used hamlet.txt from Project Gutenberg, with seed 42 and 30 trials per config.
The tokenization rule I used is:
- keep letters, digits, and apostrophe (
') - lowercase everything
- split on all other characters
Examples:
KING,->kingcan't->can'tto-be->to,beAct 3->act,3
Dataset stats:
- tokens: 36107
- exact distinct: 4962
I used 30 trials because this is a randomized estimator. A single run can be lucky or unlucky; 30 runs gives stable mean/max error while keeping runtime short.
| Config | Threshold | Mean relative error | Max relative error | Mean trial time |
|---|---|---|---|---|
--memory 100 | 100 | 8.279% | 36.719% | 13.329 ms |
--memory 250 | 250 | 6.355% | 21.241% | 15.203 ms |
--memory 500 | 500 | 3.929% | 15.760% | 18.847 ms |
--memory 1000 | 1000 | 2.631% | 5.764% | 28.637 ms |
--memory 2000 | 2000 | 1.760% | 5.119% | 32.285 ms |
As memory increases, both average and worst-case error decrease.
In practical terms, with exact distinct count 4962, a mean relative error of 2.631% means the estimate is off by about 131 tokens on average.
At 8.279%, the average gap is about 411 tokens.
Paper-parameter runs:
| Config | Threshold | Mean relative error | Max relative error | Mean trial time |
|---|---|---|---|---|
--epsilon 0.2 --delta 0.1 | 6439 | 0.000% | 0.000% | 17.520 ms |
--epsilon 0.15 --delta 0.05 | 11980 | 0.000% | 0.000% | 15.178 ms |
--epsilon 0.1 --delta 0.05 | 26955 | 0.000% | 0.000% | 15.568 ms |
Each threshold is larger than 4962, so downsampling does not trigger and output is exact on this dataset.
Quanta reports averages of 3955 (memory = 100) and 3964 (memory = 1000) for true value 3967 over five runs.
I reproduced the same trend (higher memory gives tighter estimates), but not identical absolute numbers, which depends on tokenization and dataset details.
Reproducing #
1curl -L 'https://www.gutenberg.org/cache/epub/1524/pg1524.txt' -o /tmp/hamlet.txt2git clone https://github.com/n3d1117/uniqcount.git3cd uniqcount1swift run uniqcount --path /tmp/hamlet.txt --memory 1000 --trials 30 --seed 42 --report1swift run uniqcount --path /tmp/hamlet.txt --epsilon 0.1 --delta 0.05 --trials 30 --seed 42 --reportOutput #
This is a real run with --memory 1000, --trials 10, --seed 42.
1swift run --quiet uniqcount --path /tmp/hamlet.txt --memory 1000 --trials 10 --seed 42 --report 1Run: 2+----------------+-----------------+ 3| Field | Value | 4+----------------+-----------------+ 5| path | /tmp/hamlet.txt | 6| tokens | 36107 | 7| exact_distinct | 4962 | 8| trials | 10 | 9| threshold | 1000 |10| seed | 42 |11+----------------+-----------------+1213Trials:14+-------+-------+----------+----------+--------+--------+15| Trial | Exact | Estimate | RelError | TimeMs | Status |16+-------+-------+----------+----------+--------+--------+17| 1 | 4962 | 4952 | 0.202% | 24.383 | ok |18| 2 | 4962 | 4936 | 0.524% | 26.288 | ok |19| 3 | 4962 | 5192 | 4.635% | 25.172 | ok |20| 4 | 4962 | 4936 | 0.524% | 25.575 | ok |21| 5 | 4962 | 4840 | 2.459% | 24.973 | ok |22| 6 | 4962 | 5192 | 4.635% | 25.656 | ok |23| 7 | 4962 | 4840 | 2.459% | 27.580 | ok |24| 8 | 4962 | 5208 | 4.958% | 23.779 | ok |25| 9 | 4962 | 4800 | 3.265% | 25.830 | ok |26| 10 | 4962 | 4776 | 3.748% | 24.304 | ok |27+-------+-------+----------+----------+--------+--------+2829Summary:30+---------------------+-----------+31| Metric | Value |32+---------------------+-----------+33| trials | 10 |34| failures(bottom) | 0 |35| failure_rate | 0.000% |36| mean_relative_error | 2.741% |37| max_relative_error | 4.958% |38| mean_trial_time | 25.354 ms |39+---------------------+-----------+Conclusion #
For this use case, the algorithm is valid.
- The paper provides a formal $(\varepsilon, \delta)$ guarantee.
- The implementation reproduces the expected memory/error tradeoff.
- The Quanta-style claim is reproducible at the behavior level: small memory is noisier, larger memory is steadier.
In practice, use small fixed memory when bounded space matters most. Use a formula-based threshold when you want stronger guarantees and can spend more memory.