Insights
Building a Survivorship-Free Universe in Python
Alphanume Team · June 9, 2026
Keeping delisted names in your backtest.
The fastest way to inflate a backtest is also the most invisible: build your universe from today's ticker list and pretend you held it since 2010. Every company that went bankrupt, got acquired, or simply faded away is silently excluded, and your strategy never suffers those losses. This is survivorship bias in backtesting, and it can add 2–5 percentage points of phantom annual alpha to an otherwise mediocre strategy. This tutorial shows how to construct a survivorship free backtest python workflow — a point-in-time universe that keeps delisted names in scope, applies size and liquidity filters as of each rebalance date, and assigns terminal returns instead of silently dropping the losers.
Why today's constituents silently inflate backtests
Imagine you screen the Russell 1000 today for stocks trading above $5 with a market cap over $2 billion, then run a backtest back to 2010. You never see Lehman Brothers, Sears, or any of the dozens of mid-caps that imploded over that window — they aren't in today's index. The screen looks like it found a terrific universe. It didn't. It found the survivors.
The bias compounds in two ways. First, the universe itself is cleaner than it ever was in real time — the companies that failed, merged, or drifted below your size threshold are gone. Second, any momentum or quality signal you layer on top is also evaluated only on survivors, which are structurally biased toward positive outcomes over long windows. The fix is not complicated — it just requires point-in-time market data and the discipline to never consult a field that wasn't knowable on the rebalance date.
The data requirement: dead tickers must stay in scope
A survivorship-free universe starts with a data source that retains history for names that no longer trade. Many providers quietly purge delisted tickers or stop updating their historical records the moment a company exits. You need a source that stores size and price as of each historical date — including for names that subsequently delisted — so that a 2015 rebalance sees the 2015 universe, not the 2025 one.
The Historical Market Cap dataset is point-in-time and retains dead tickers, which makes it suitable for the pattern described here. Every row records what was knowable on that date — shares outstanding as reported, not restated — so a size filter built on it reflects the actual investable universe at that moment.
The setup below reuses the same get_data() wrapper from the historical market cap tutorial. One function covers every endpoint, and failing loudly on a bad status code prevents a silent 401 from producing a plausible-looking DataFrame of garbage.
import os
import requests
import pandas as pd
BASE_URL = "https://api.alphanume.com/v1"
API_KEY = os.environ["ALPHANUME_API_KEY"]
def get_data(endpoint, **params):
params["api_key"] = API_KEY
resp = requests.get(f"{BASE_URL}/{endpoint}", params=params, timeout=30)
resp.raise_for_status()
return resp.json()["data"]
Building a point-in-time universe for one rebalance date
For each rebalance date you need to ask: which names were eligible as of this date, using only information available then? That means querying market cap as of the rebalance date — not today — and applying your size threshold to that historical figure. Names that later delisted may well pass this filter in earlier periods; they should be included.
def universe_on_date(candidate_tickers, rebalance_date, min_market_cap=2e9):
"""
Return eligible tickers as of rebalance_date.
candidate_tickers must include dead tickers — not just today's list.
"""
records = []
for ticker in candidate_tickers:
rows = get_data(
"historical-market-cap",
ticker=ticker,
date=rebalance_date,
)
if rows:
records.append({
"ticker": ticker,
"market_cap": float(rows[0]["market_cap"]),
"date": rows[0]["date"],
})
df = pd.DataFrame(records)
if df.empty:
return df
eligible = df[df["market_cap"] >= min_market_cap].copy()
eligible = eligible.sort_values("market_cap", ascending=False)
return eligible.reset_index(drop=True)
The key discipline here is candidate_tickers: this list must include names that were trading on the rebalance date even if they no longer trade today. Sourcing it from today's index is precisely the mistake we are trying to avoid. Keep a separate master list of all tickers — live and dead — that were in scope at any point during your backtest window.
The as-of join with pd.merge_asof
When you have a panel of market-cap observations at irregular dates and a schedule of rebalance dates, you need to match each rebalance date to the most recent observation available on or before that date — never after. pd.merge_asof with direction="backward" does exactly this. It looks back in time to find the last known value, which is the no-look-ahead rule in a single argument.
# cap_panel: DataFrame with columns ["ticker", "obs_date", "market_cap"]
# rebalance_dates: DataFrame with columns ["ticker", "rebalance_date"]
cap_panel = cap_panel.sort_values(["ticker", "obs_date"])
rebalance_dates = rebalance_dates.sort_values(["ticker", "rebalance_date"])
# As-of join: for each (ticker, rebalance_date) find the latest obs_date
# that is <= rebalance_date. Never pulls a future observation.
merged = pd.merge_asof(
rebalance_dates,
cap_panel,
left_on="rebalance_date",
right_on="obs_date",
by="ticker",
direction="backward",
)
# Drop rows where no historical observation precedes the rebalance date
merged = merged.dropna(subset=["market_cap"])
Both DataFrames must be sorted by the join key before calling merge_asof — pandas will raise if they aren't. The by="ticker" argument ensures the join is scoped within each name, so AAPL's last known cap never bleeds into MSFT's row. Using direction="forward" or omitting the argument entirely would pull the next future observation, which is a look-ahead violation.
Handling terminal returns for delisted names
Dropping a ticker when it delists is the most common residual form of survivorship bias. A company that files for bankruptcy in month six of a twelve-month holding period still generated a return — a very bad one — and excluding it biases your average upward. The correct treatment is to assign the terminal return explicitly and then stop tracking the name.
In practice this means keeping a separate table of delisting events with the date and the final return (often sourced from the delisting distribution or the last observable price relative to entry). At each rebalance, check whether any held name has a delisting event within the holding window; if so, use the terminal return and zero out the position for subsequent periods. Do not let the position simply vanish from the returns panel, which is what happens when you inner-join prices and the ticker disappears.
Forward-fill prices until the delisting date, then assign the terminal return on that date, and mark the position as closed. Never back-fill — filling a gap backward means you knew a future price on an earlier date, which violates the no-look-ahead rule just as surely as using today's index membership.
Rebalance loop skeleton
The following skeleton ties the pieces together: iterate over rebalance dates, build the point-in-time universe for each, apply size and any additional filters using only as-of data, compute weights, and record the selected names. The loop is deliberately minimal — slot your own signal and weighting logic inside.
import numpy as np
REBALANCE_DATES = pd.date_range("2015-01-31", "2024-12-31", freq="QE")
ALL_TICKERS = [...] # full historical universe, including dead tickers
MIN_CAP = 2e9 # $2 billion minimum market cap
portfolio_history = []
for rd in REBALANCE_DATES:
iso_date = rd.strftime("%Y-%m-%d")
# Step 1: build point-in-time eligible universe
eligible = universe_on_date(ALL_TICKERS, iso_date, min_market_cap=MIN_CAP)
if eligible.empty:
continue
# Step 2: apply additional filters on point-in-time fields
# (e.g. liquidity, sector) — always use as-of data, never today's values
tickers_selected = eligible["ticker"].tolist()
# Step 3: assign equal weights (replace with your own signal)
n = len(tickers_selected)
weights = {t: 1.0 / n for t in tickers_selected}
portfolio_history.append({
"rebalance_date": iso_date,
"holdings": weights,
})
# portfolio_history now contains point-in-time snapshots.
# Next step: fetch returns for each holding window, including terminal
# returns for any names that delisted before the next rebalance.
The comment on line three of the loop is the most important line in the skeleton: ALL_TICKERS must be the full historical universe. If it comes from a live index, the survivorship problem has already been reintroduced before the first iteration runs.
Checklist before you trust the output
Point-in-time backtesting requires sustained discipline across every data join in the pipeline. Before interpreting results, verify each of the following.
Universe construction: every rebalance date queries market cap as of that date; the candidate list includes tickers that subsequently delisted; no field used in eligibility filtering is sourced from a date after the rebalance. As-of joins: all merge_asof calls use direction="backward"; both input frames are sorted by the join key before the call; the join is scoped by ticker so one name's history cannot contaminate another's. Return accounting: delisted names have explicit terminal returns in the panel; positions do not silently disappear from the returns DataFrame; gaps are forward-filled, never back-filled.
A backtest that passes all three checks will typically show lower returns than a naive survivors-only run on the same signal — and that lower number is the one you should believe.