US Small/Mid-Cap Universe Build — 2026-06-19

Why this was done (the lead being pursued)

The surviving idea from the flag-rider work (shelved June 10) is cross-sectional momentum: each week, rank a basket of names, go long the strongest / short the weakest, gated by market regime. That kind of strategy lives or dies on the breadth of the universe — ranking only the ~1,400 US tickers already in big_bots.db is too thin and skews large-cap. Before downloading any price data, this session scoped how many US small/mid-cap names are missing and whether they fit on disk. No candles have been downloaded yet — this is the universe definition step only.

Method

Pulled the whole-US-market list once from the NASDAQ bulk screener (cached to nasdaq_screener_raw.json, 2.2 MB — no repeat downloads).
Dropped non-common-equity by descriptor (warrants, rights, units, preferred, debentures, notes, funds, ETFs, when-issued).
Applied a market-cap floor of > $250M.
Subtracted names already present in big_bots.db (read-only on shares, exchange = NYSE/NASDAQ).
Second-pass clean (added this session) to remove debt/preferred instruments that slipped the name filter.

big_bots.db was treated strictly read-only throughout. All outputs live in /root/research/.

Result — universe funnel

Stage	Count
Master screener rows	7,141
Common-equity universe (after type filter)	5,592
…with market cap > $250M	3,558
Already in DB	1,427
NEW names (first pass)	2,173
NEW names after second-pass clean	2,164

Second-pass clean — 9 rows removed (all genuinely non-equity)

Ticker	Reason	Name
CCZ	ZONES (exchangeable debt)	Comcast Holdings ZONES
EMP, EAI, ENJ, ENO	mortgage bonds	Entergy LLC First Mortgage Bonds
ELC	mortgage bonds	Entergy Louisiana Collateral Trust Mortgage Bonds
LILAP	preferred	Liberty Latin America Fixed Rate Cumulative Perpetual Preference Shares
GJS	structured debt	Goldman Sachs STRATS Trust
GBAB	bond CEF	Guggenheim Taxable Municipal Bond & IG Debt Trust

The clean filter was deliberately narrow: it targets only rate-bearing debt and preferred wording (%, "bonds", "debentures", "STRATS", "ZONES", "preferred/ preference/perpetual"). ADRs (e.g. GGAL, PONY, MNSO), REITs (COLD, MPT, BXMT) and LPs/partnerships (Green Brick, Artisan Partners) were kept on purpose — they trade like equities and belong in a momentum universe. Full audit: us_universe_dropped.csv.

Cap-band distribution of the 2,164 new names

Band	Names
> $5B	36
$2–5B	621
$1–2B	526
$500M–1B	493
$250–500M	488

So the bulk is genuinely small/mid-cap (~1,500 names under $2B), with a tail of larger names that simply weren't in the DB yet (IMO, CBOE, WULF, RIOT…).

Disk projection (the gate before downloading)

Backfilling 2,164 names at ~3,000 daily rows each: - Lean schema (OHLCV only, ~100 B/row): ~0.61 GB - Full schema (~504 B/row): ~3.06 GB

Free disk at build time: ~26 GB. Either fits, but lean is the obvious choice given big_bots.db is already 16 GB on a 47 GB disk.

Honest conclusion / caveats

This is infrastructure, not a result — no strategy has been tested yet and no prices downloaded. The deliverable is a clean, deduped, cap-filtered ticker list.
Survivorship bias warning (per CLAUDE.md): the NASDAQ screener lists currently listed companies only. Backfilling their history is survivorship-clean only for recent data — consistent with the earlier decision against a 5-yr backfill. Restrict any backtest on this universe to the recent ~2 yr window.
The cap figures are a single-day snapshot from the screener; they define membership, not point-in-time ranking. Cross-sectional ranking must use price/return data, not this static cap.
Two borderline names remain in the kept list worth a glance later: CET (Central Securities Corp) and QNT are closed-end-fund-ish; they passed because their names say "Common Stock". Low priority — won't distort a 2,164-name ranking.

Files produced this session

build_us_universe.py / build_us_universe.log — universe build + funnel report
us_universe_over_250m.csv — first-pass list (2,173, kept for audit)
clean_universe.py — second-pass debt/preferred filter
us_universe_clean.csv — final list, 2,164 names (download target)
us_universe_dropped.csv — the 9 removed rows + reasons

Backfill — measured size & decision (added later 2026-06-19)

A 30-ticker stratified sample was run through the existing backfill.py logic to measure real size before committing. The earlier "~3.06 GB" projection assumed 3,000 daily rows/ticker; reality is ~7,300 rows/ticker because backfill.py pulls period="max" (some names go back to the 1960s–70s) across daily + weekly + monthly. Measured options (504 B/row, extrapolated to 2,164 names):

Option	History	Projected rows	~Size
A — max history, all timeframes (as-is)	1962→now	15.8 M	~8.0 GB
B — max history, daily only	1962→now	12.6 M	~6.4 GB
C — full schema, since 2014 (~12 yr)	12 yr	6.5 M	~3.3 GB
D — full schema, since 2024-06 (~2 yr)	2 yr	1.3 M	~0.6 GB

Decision (Jacques): Option A — full max history (~8 GB), accepting the 5 GB ceiling override. Disk: 26 GB free → ~18 GB free after. Reminder for any backtest on this data: the pre-~2024 history is survivorship-biased (currently-listed names only) — keep tuning/validation on the recent window per the standing decision.

What was run

backfill_research_universe.py — research-only wrapper that reuses backfill.py's download/indicator/insert logic but drives off us_universe_clean.csv and tracks progress in its own research_backfill_progress table. It deliberately does NOT touch the shares table, so the live bots' watchlist is unchanged.
Writes go to ohlcv_snapshots with event_type='historical', INSERT OR IGNORE (safe alongside live services). Daily + weekly + monthly, full 34-col schema.
Status: launched 2026-06-19 ~11:49 UTC, ~2.8 h ETA, resumable. Check with python3 backfill_research_universe.py status.

Next step (after backfill completes)

Build the cross-sectional momentum prototype on this universe: weekly return ranking, long strongest / short weakest, BTC/regime gate, applying the mandatory validation rules (costs, 8-position cap, pre-2026 tuning vs 2026 OOS).