Big BotsResearch ← all reports

US Small/Mid-Cap Universe Build — 2026-06-19

Why this was done (the lead being pursued)

The surviving idea from the flag-rider work (shelved June 10) is cross-sectional momentum: each week, rank a basket of names, go long the strongest / short the weakest, gated by market regime. That kind of strategy lives or dies on the breadth of the universe — ranking only the ~1,400 US tickers already in big_bots.db is too thin and skews large-cap. Before downloading any price data, this session scoped how many US small/mid-cap names are missing and whether they fit on disk. No candles have been downloaded yet — this is the universe definition step only.

Method

  1. Pulled the whole-US-market list once from the NASDAQ bulk screener (cached to nasdaq_screener_raw.json, 2.2 MB — no repeat downloads).
  2. Dropped non-common-equity by descriptor (warrants, rights, units, preferred, debentures, notes, funds, ETFs, when-issued).
  3. Applied a market-cap floor of > $250M.
  4. Subtracted names already present in big_bots.db (read-only on shares, exchange = NYSE/NASDAQ).
  5. Second-pass clean (added this session) to remove debt/preferred instruments that slipped the name filter.

big_bots.db was treated strictly read-only throughout. All outputs live in /root/research/.

Result — universe funnel

Stage Count
Master screener rows 7,141
Common-equity universe (after type filter) 5,592
…with market cap > $250M 3,558
Already in DB 1,427
NEW names (first pass) 2,173
NEW names after second-pass clean 2,164

Second-pass clean — 9 rows removed (all genuinely non-equity)

Ticker Reason Name
CCZ ZONES (exchangeable debt) Comcast Holdings ZONES
EMP, EAI, ENJ, ENO mortgage bonds Entergy LLC First Mortgage Bonds
ELC mortgage bonds Entergy Louisiana Collateral Trust Mortgage Bonds
LILAP preferred Liberty Latin America Fixed Rate Cumulative Perpetual Preference Shares
GJS structured debt Goldman Sachs STRATS Trust
GBAB bond CEF Guggenheim Taxable Municipal Bond & IG Debt Trust

The clean filter was deliberately narrow: it targets only rate-bearing debt and preferred wording (%, "bonds", "debentures", "STRATS", "ZONES", "preferred/ preference/perpetual"). ADRs (e.g. GGAL, PONY, MNSO), REITs (COLD, MPT, BXMT) and LPs/partnerships (Green Brick, Artisan Partners) were kept on purpose — they trade like equities and belong in a momentum universe. Full audit: us_universe_dropped.csv.

Cap-band distribution of the 2,164 new names

Band Names
> $5B 36
$2–5B 621
$1–2B 526
$500M–1B 493
$250–500M 488

So the bulk is genuinely small/mid-cap (~1,500 names under $2B), with a tail of larger names that simply weren't in the DB yet (IMO, CBOE, WULF, RIOT…).

Disk projection (the gate before downloading)

Backfilling 2,164 names at ~3,000 daily rows each: - Lean schema (OHLCV only, ~100 B/row): ~0.61 GB - Full schema (~504 B/row): ~3.06 GB

Free disk at build time: ~26 GB. Either fits, but lean is the obvious choice given big_bots.db is already 16 GB on a 47 GB disk.

Honest conclusion / caveats

Files produced this session

Backfill — measured size & decision (added later 2026-06-19)

A 30-ticker stratified sample was run through the existing backfill.py logic to measure real size before committing. The earlier "~3.06 GB" projection assumed 3,000 daily rows/ticker; reality is ~7,300 rows/ticker because backfill.py pulls period="max" (some names go back to the 1960s–70s) across daily + weekly + monthly. Measured options (504 B/row, extrapolated to 2,164 names):

Option History Projected rows ~Size
A — max history, all timeframes (as-is) 1962→now 15.8 M ~8.0 GB
B — max history, daily only 1962→now 12.6 M ~6.4 GB
C — full schema, since 2014 (~12 yr) 12 yr 6.5 M ~3.3 GB
D — full schema, since 2024-06 (~2 yr) 2 yr 1.3 M ~0.6 GB

Decision (Jacques): Option A — full max history (~8 GB), accepting the 5 GB ceiling override. Disk: 26 GB free → ~18 GB free after. Reminder for any backtest on this data: the pre-~2024 history is survivorship-biased (currently-listed names only) — keep tuning/validation on the recent window per the standing decision.

What was run

Next step (after backfill completes)

Build the cross-sectional momentum prototype on this universe: weekly return ranking, long strongest / short weakest, BTC/regime gate, applying the mandatory validation rules (costs, 8-position cap, pre-2026 tuning vs 2026 OOS).