US Small/Mid-Cap Universe Build — 2026-06-19
Why this was done (the lead being pursued)
The surviving idea from the flag-rider work (shelved June 10) is cross-sectional
momentum: each week, rank a basket of names, go long the strongest / short the
weakest, gated by market regime. That kind of strategy lives or dies on the
breadth of the universe — ranking only the ~1,400 US tickers already in
big_bots.db is too thin and skews large-cap. Before downloading any price data,
this session scoped how many US small/mid-cap names are missing and whether they
fit on disk. No candles have been downloaded yet — this is the universe definition
step only.
Method
- Pulled the whole-US-market list once from the NASDAQ bulk screener
(cached to
nasdaq_screener_raw.json, 2.2 MB — no repeat downloads). - Dropped non-common-equity by descriptor (warrants, rights, units, preferred, debentures, notes, funds, ETFs, when-issued).
- Applied a market-cap floor of > $250M.
- Subtracted names already present in
big_bots.db(read-only onshares, exchange =NYSE/NASDAQ). - Second-pass clean (added this session) to remove debt/preferred instruments that slipped the name filter.
big_bots.db was treated strictly read-only throughout. All outputs live in
/root/research/.
Result — universe funnel
| Stage | Count |
|---|---|
| Master screener rows | 7,141 |
| Common-equity universe (after type filter) | 5,592 |
| …with market cap > $250M | 3,558 |
| Already in DB | 1,427 |
| NEW names (first pass) | 2,173 |
| NEW names after second-pass clean | 2,164 |
Second-pass clean — 9 rows removed (all genuinely non-equity)
| Ticker | Reason | Name |
|---|---|---|
| CCZ | ZONES (exchangeable debt) | Comcast Holdings ZONES |
| EMP, EAI, ENJ, ENO | mortgage bonds | Entergy LLC First Mortgage Bonds |
| ELC | mortgage bonds | Entergy Louisiana Collateral Trust Mortgage Bonds |
| LILAP | preferred | Liberty Latin America Fixed Rate Cumulative Perpetual Preference Shares |
| GJS | structured debt | Goldman Sachs STRATS Trust |
| GBAB | bond CEF | Guggenheim Taxable Municipal Bond & IG Debt Trust |
The clean filter was deliberately narrow: it targets only rate-bearing debt and
preferred wording (%, "bonds", "debentures", "STRATS", "ZONES", "preferred/
preference/perpetual"). ADRs (e.g. GGAL, PONY, MNSO), REITs (COLD, MPT, BXMT) and
LPs/partnerships (Green Brick, Artisan Partners) were kept on purpose — they
trade like equities and belong in a momentum universe. Full audit:
us_universe_dropped.csv.
Cap-band distribution of the 2,164 new names
| Band | Names |
|---|---|
| > $5B | 36 |
| $2–5B | 621 |
| $1–2B | 526 |
| $500M–1B | 493 |
| $250–500M | 488 |
So the bulk is genuinely small/mid-cap (~1,500 names under $2B), with a tail of larger names that simply weren't in the DB yet (IMO, CBOE, WULF, RIOT…).
Disk projection (the gate before downloading)
Backfilling 2,164 names at ~3,000 daily rows each: - Lean schema (OHLCV only, ~100 B/row): ~0.61 GB - Full schema (~504 B/row): ~3.06 GB
Free disk at build time: ~26 GB. Either fits, but lean is the obvious choice given
big_bots.db is already 16 GB on a 47 GB disk.
Honest conclusion / caveats
- This is infrastructure, not a result — no strategy has been tested yet and no prices downloaded. The deliverable is a clean, deduped, cap-filtered ticker list.
- Survivorship bias warning (per CLAUDE.md): the NASDAQ screener lists currently listed companies only. Backfilling their history is survivorship-clean only for recent data — consistent with the earlier decision against a 5-yr backfill. Restrict any backtest on this universe to the recent ~2 yr window.
- The cap figures are a single-day snapshot from the screener; they define membership, not point-in-time ranking. Cross-sectional ranking must use price/return data, not this static cap.
- Two borderline names remain in the kept list worth a glance later: CET (Central Securities Corp) and QNT are closed-end-fund-ish; they passed because their names say "Common Stock". Low priority — won't distort a 2,164-name ranking.
Files produced this session
build_us_universe.py/build_us_universe.log— universe build + funnel reportus_universe_over_250m.csv— first-pass list (2,173, kept for audit)clean_universe.py— second-pass debt/preferred filterus_universe_clean.csv— final list, 2,164 names (download target)us_universe_dropped.csv— the 9 removed rows + reasons
Backfill — measured size & decision (added later 2026-06-19)
A 30-ticker stratified sample was run through the existing backfill.py logic to
measure real size before committing. The earlier "~3.06 GB" projection assumed 3,000
daily rows/ticker; reality is ~7,300 rows/ticker because backfill.py pulls
period="max" (some names go back to the 1960s–70s) across daily + weekly +
monthly. Measured options (504 B/row, extrapolated to 2,164 names):
| Option | History | Projected rows | ~Size |
|---|---|---|---|
| A — max history, all timeframes (as-is) | 1962→now | 15.8 M | ~8.0 GB |
| B — max history, daily only | 1962→now | 12.6 M | ~6.4 GB |
| C — full schema, since 2014 (~12 yr) | 12 yr | 6.5 M | ~3.3 GB |
| D — full schema, since 2024-06 (~2 yr) | 2 yr | 1.3 M | ~0.6 GB |
Decision (Jacques): Option A — full max history (~8 GB), accepting the 5 GB ceiling override. Disk: 26 GB free → ~18 GB free after. Reminder for any backtest on this data: the pre-~2024 history is survivorship-biased (currently-listed names only) — keep tuning/validation on the recent window per the standing decision.
What was run
backfill_research_universe.py— research-only wrapper that reusesbackfill.py's download/indicator/insert logic but drives offus_universe_clean.csvand tracks progress in its ownresearch_backfill_progresstable. It deliberately does NOT touch thesharestable, so the live bots' watchlist is unchanged.- Writes go to
ohlcv_snapshotswithevent_type='historical',INSERT OR IGNORE(safe alongside live services). Daily + weekly + monthly, full 34-col schema. - Status: launched 2026-06-19 ~11:49 UTC, ~2.8 h ETA, resumable.
Check with
python3 backfill_research_universe.py status.
Next step (after backfill completes)
Build the cross-sectional momentum prototype on this universe: weekly return ranking, long strongest / short weakest, BTC/regime gate, applying the mandatory validation rules (costs, 8-position cap, pre-2026 tuning vs 2026 OOS).