Insights

Parsing 424B5 Offering Filings in Python

Alphanume Team · June 8, 2026

Extracting deal terms from prospectus supplements.

Learning to parse sec filings python-side opens up a rich vein of primary-source deal data — but the 424B5 form is one of the messiest documents the SEC hosts. A 424B5 is a priced takedown prospectus supplement: the filing a public company submits when it actually prices a follow-on offering, telling the world exactly how many shares sold and at what price. Unlike a machine-readable XBRL earnings release, a 424B5 is freeform narrative HTML (or, for older filings, plain text). Every underwriter shop formats them differently. Some bury the offering price in a cover-page table; others spell it out mid-sentence in the third paragraph. Some label gross proceeds explicitly; others leave you to multiply shares by price yourself. This tutorial walks through a practical pipeline — fetch the raw document, strip its HTML scaffolding, and extract the numbers you care about with anchored regular expressions and a few contextual heuristics. We will also be honest about where this approach breaks down. For a deeper background on the form itself, see what a 424B5 filing is before diving into code.

Fetching the filing from EDGAR

The SEC's EDGAR full-text search index exposes every filing via a predictable URL structure. Before you can parse anything you need the raw document. The SEC requires a descriptive User-Agent header on every request — skip it and your IP gets rate-limited within minutes. A lightweight fetch_filing helper is all you need for this step.

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Alphanume Research contact@alphanume.com",
    "Accept-Encoding": "gzip, deflate",
}

def fetch_filing(url: str) -> str:
    """Return plain text extracted from a 424B5 filing URL."""
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    # Drop script/style noise before extracting text
    for tag in soup(["script", "style"]):
        tag.decompose()
    return soup.get_text(separator="\n")

Passing separator="\n" to get_text keeps table cells and paragraph breaks on separate lines rather than collapsing everything into one long run-on string — a small detail that makes the regular expressions in the next step significantly more reliable. If the filing is served as a .txt wrapper (the old SGML envelope EDGAR still uses for some filings), BeautifulSoup handles it gracefully: it will strip the SGML tags and return the inner HTML content as text just the same. For the mechanics of navigating EDGAR's index JSON to find the right document URL in the first place, see the guide on querying SEC EDGAR with Python.

Understanding what you are looking for

A 424B5 must disclose a handful of deal-critical figures: the number of shares offered, the public offering price per share, gross proceeds (shares times price), and net proceeds to the company after underwriting discounts. The lead underwriter or joint bookrunners are listed in a section titled something like "Underwriting" or "Plan of Distribution." In practice the exact phrasing varies — "Number of Shares" or "Shares Offered" or simply a bare integer followed by "shares of common stock" — which is why a single rigid regex will not hold across issuers. The strategy here is to write several candidate patterns for each field, try them in priority order, and return the first match. Failing loudly on a None is better than silently returning a wrong number.

Normalizing raw number strings

Before building the extraction function it helps to isolate the normalization logic. Prospectus numbers come in forms like $14.50, 25,000,000, $362.5 million, and $1.2 billion. A small helper converts all of them to plain Python floats.

import re

def parse_number(raw: str) -> float | None:
    """
    Convert a raw number string from a prospectus to a float.
    Handles commas, dollar signs, and 'million'/'billion' suffixes.
    Returns None if the string cannot be parsed.
    """
    if raw is None:
        return None
    text = raw.strip().lower().replace(",", "").replace("$", "").replace(" ", "")
    multiplier = 1.0
    if text.endswith("billion"):
        multiplier = 1_000_000_000.0
        text = text[:-7]
    elif text.endswith("million"):
        multiplier = 1_000_000.0
        text = text[:-7]
    try:
        return float(text) * multiplier
    except ValueError:
        return None

The function is intentionally conservative — it returns None rather than guessing when the input does not parse. Every downstream field that calls it should check for None and decide whether to raise, skip, or fall back to a secondary pattern.

The core extract_terms function

This is the heart of the pipeline. Each field uses two or three candidate patterns tried in order. The shares-offered pattern anchors on the phrase "shares of common stock" and looks left for the integer. The price pattern anchors on "per share" and looks left for the dollar amount. Proceeds patterns look for "gross proceeds" or "net proceeds" followed by a dollar figure. The underwriter search scans for the "Underwriting" section header and grabs the first capitalized multi-word name that follows.

def extract_terms(text: str) -> dict:
    """
    Extract key deal terms from the plain-text body of a 424B5 filing.
    Returns a dict with keys: shares_offered, price_per_share,
    gross_proceeds, net_proceeds, lead_underwriter.
    Values are floats (or str for underwriter); None where not found.
    """
    result = {
        "shares_offered": None,
        "price_per_share": None,
        "gross_proceeds": None,
        "net_proceeds": None,
        "lead_underwriter": None,
    }

    # --- shares offered ---
    shares_patterns = [
        r"([\d,]+)\s+shares\s+of\s+(?:our\s+)?common\s+stock",
        r"offering\s+of\s+([\d,]+)\s+shares",
        r"sale\s+of\s+([\d,]+)\s+shares",
    ]
    for pat in shares_patterns:
        m = re.search(pat, text, re.IGNORECASE)
        if m:
            result["shares_offered"] = parse_number(m.group(1))
            break

    # --- price per share ---
    price_patterns = [
        r"\$([\d,]+(?:\.\d+)?)\s+per\s+share",
        r"price\s+of\s+\$([\d,]+(?:\.\d+)?)",
        r"public\s+offering\s+price[^\d$]*\$([\d,]+(?:\.\d+)?)",
    ]
    for pat in price_patterns:
        m = re.search(pat, text, re.IGNORECASE)
        if m:
            result["price_per_share"] = parse_number(m.group(1))
            break

    # --- gross proceeds ---
    gross_patterns = [
        r"gross\s+proceeds[^\d$]*\$([\d,]+(?:\.\d+)?\s*(?:million|billion)?)",
        r"aggregate\s+gross\s+proceeds[^\d$]*\$([\d,]+(?:\.\d+)?\s*(?:million|billion)?)",
    ]
    for pat in gross_patterns:
        m = re.search(pat, text, re.IGNORECASE)
        if m:
            result["gross_proceeds"] = parse_number(m.group(1))
            break

    # --- net proceeds ---
    net_patterns = [
        r"net\s+proceeds[^\d$]*(?:to\s+us[^\d$]*)?\$([\d,]+(?:\.\d+)?\s*(?:million|billion)?)",
        r"net\s+proceeds\s+to\s+the\s+company[^\d$]*\$([\d,]+(?:\.\d+)?\s*(?:million|billion)?)",
    ]
    for pat in net_patterns:
        m = re.search(pat, text, re.IGNORECASE)
        if m:
            result["net_proceeds"] = parse_number(m.group(1))
            break

    # --- lead underwriter ---
    # Find the Underwriting section, then grab first named firm
    uw_section = re.search(
        r"underwriting\b.{0,2000}",
        text,
        re.IGNORECASE | re.DOTALL,
    )
    if uw_section:
        # Look for a capitalized multi-word name (e.g. "Goldman Sachs")
        firm_m = re.search(
            r"\b([A-Z][a-z]+(?:\s+[A-Z][a-z&]+){1,4})\b",
            uw_section.group(0)[len("underwriting"):],
        )
        if firm_m:
            result["lead_underwriter"] = firm_m.group(1)

    return result

Running extract_terms(fetch_filing(url)) on a live filing URL produces a dict you can append to a list and convert to a pandas DataFrame with pd.DataFrame(rows) — no further transformation needed before storing or analyzing.

Caveats and production reality

Regex against freeform HTML has a ceiling. A few failure modes come up constantly in practice. First, some issuers present shares and price in an HTML table rather than prose — BeautifulSoup flattens the table into newline-separated tokens, which can split a number from its label across lines and defeat the pattern. Adding a secondary pass over the raw HTML using soup.find_all("table") and inspecting cell text is worth the extra code for any issuer you plan to track repeatedly. Second, the underwriter heuristic above will happily return "Goldman Sachs" when the section opens with "Goldman Sachs & Co. LLC acts as" — but it will also return "The following" if the section opens with a sentence fragment. You need spot-checks. Third, amended filings (form type 424B5/A) sometimes restate proceeds after an over-allotment option is exercised; if you are building a proceeds time series, you need to decide whether to keep the original or the amended figure. For any use case where correctness matters more than build time — a live monitor, a systematic dilution screener, anything going into a model — the Stock Dilution dataset is the more defensible choice. It is parsed, normalized, and maintained across the long tail of filers whose layouts would break a one-off scraper within weeks.

Putting it together

Here is the end-to-end call: fetch a filing from a known EDGAR document URL, extract the terms, and print the result. The URL you pass is the direct link to the .htm document inside an EDGAR filing index — not the index page itself.

if __name__ == "__main__":
    # Replace with a real 424B5 document URL from data.sec.gov
    filing_url = (
        "https://www.sec.gov/Archives/edgar/data/XXXXXXXXX/"
        "000000000000000000/example-424b5.htm"
    )
    raw_text = fetch_filing(filing_url)
    terms = extract_terms(raw_text)
    for field, value in terms.items():
        print(f"{field}: {value}")

The pipeline is intentionally modular. Swap fetch_filing for a local file read if you have already cached the raw HTML. Replace extract_terms with a stricter issuer-specific version for names you track closely. And always cross-check a sample of parsed rows against the source document before treating the output as ground truth — a number that looks plausible is not the same as a number that is correct.