Alphanume

Insights

How to Query SEC EDGAR Filings With Python

Alphanume Team · June 9, 2026

Full-text search and document retrieval, programmatically.

The SEC publishes every filing it receives — 10-Ks, 8-Ks, prospectuses, the lot — through a set of free, well-documented REST APIs. Querying sec edgar python code is genuinely straightforward once you understand two things: what the endpoints are, and what the mandatory User-Agent header means in practice. This tutorial walks through the full workflow: resolving a ticker to a CIK, pulling a company's filing history, filtering by form type, constructing document URLs, and running keyword searches across the entire EDGAR corpus.

Rate limits and the User-Agent requirement

Before writing a single line of code, read the SEC's fair-access policy. EDGAR enforces a rate limit of roughly ten requests per second per IP. Exceed it and you will receive 429 responses — and repeated abuse can result in a block. More importantly, the SEC requires every request to carry a descriptive User-Agent header that identifies your application and a contact email address. A bare Python requests default is not acceptable. Set it once at the session level so every call in your script inherits it automatically.

import time
import requests
import pandas as pd

HEADERS = {
    "User-Agent": "MyResearchApp admin@example.com",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov",
}

session = requests.Session()
session.headers.update(HEADERS)

def get_json(url):
    resp = session.get(url, timeout=30)
    resp.raise_for_status()
    time.sleep(0.12)   # ~8 req/s — stay safely under the 10 req/s cap
    return resp.json()

The time.sleep(0.12) call is not optional politeness — it is the simplest way to stay under the cap without building an explicit token-bucket. Use raise_for_status() so a 429 or 403 fails loudly rather than returning a JSON error body that looks like data.

Mapping a ticker to a CIK

EDGAR's native identifier is the CIK — a numeric company code that must be zero-padded to exactly ten digits whenever it appears in a URL. The SEC publishes a flat JSON file that maps every ticker symbol to its CIK. Download it once, cache it locally, and look up tickers offline rather than hitting the network for every resolution.

TICKERS_URL = "https://www.sec.gov/files/company_tickers.json"

def load_ticker_map():
    data = get_json(TICKERS_URL)
    # data is a dict of {"0": {"cik_str": 320193, "ticker": "AAPL", ...}, ...}
    rows = [v for v in data.values()]
    df = pd.DataFrame(rows)
    df["cik_padded"] = df["cik_str"].astype(str).str.zfill(10)
    return df.set_index("ticker")

def ticker_to_cik(ticker_map, ticker):
    ticker = ticker.upper()
    if ticker not in ticker_map.index:
        raise KeyError(f"Ticker {ticker!r} not found in EDGAR mapping")
    return ticker_map.loc[ticker, "cik_padded"]

ticker_map = load_ticker_map()
cik = ticker_to_cik(ticker_map, "AAPL")
print(cik)   # "0000320193"

The zero-padding step trips up a lot of people — the raw cik_str field is an integer, so leading zeros are stripped. Always convert to a string and call zfill(10) before interpolating the CIK into any EDGAR URL.

Pulling a company's filing history

Once you have a padded CIK, the submissions endpoint returns the company's full filing history as a single JSON object. The recent filings live under filings.recent as a set of parallel arrays — one array each for form, filingDate, accessionNumber, and primaryDocument, all aligned by index. Zip them into a DataFrame for easy filtering.

def get_submissions(cik_padded):
    url = f"https://data.sec.gov/submissions/CIK{cik_padded}.json"
    data = get_json(url)
    recent = data["filings"]["recent"]
    df = pd.DataFrame({
        "form":            recent["form"],
        "filing_date":     pd.to_datetime(recent["filingDate"]),
        "accession_raw":   recent["accessionNumber"],
        "primary_doc":     recent["primaryDocument"],
    })
    df["accession"] = df["accession_raw"].str.replace("-", "", regex=False)
    return df

filings = get_submissions(cik)
print(filings.head())

The accessionNumber field arrives in hyphenated form (e.g. 0001193125-24-050000). Stripping the hyphens gives you the version used in document URLs — keep both columns so you have whichever format you need later.

Filtering by form type and constructing document URLs

With the full filing list in a DataFrame, filtering to a single form type is a one-liner. The more interesting step is constructing the URL that actually serves the document. EDGAR stores every filing under a path built from the CIK and the de-hyphenated accession number; the primary document filename is the last component. This is the pattern for parsing 424B5 filings in Python and for any other form type — only the filter string changes.

EDGAR_ARCHIVE = "https://www.sec.gov/Archives/edgar/full-index"

def build_doc_url(cik_padded, accession, primary_doc):
    # Path: /Archives/edgar/data/{CIK}/{accession}/{primary_doc}
    cik_int = str(int(cik_padded))  # drop leading zeros for this path segment
    return (
        f"https://www.sec.gov/Archives/edgar/data/"
        f"{cik_int}/{accession}/{primary_doc}"
    )

def filter_filings(filings_df, form_type):
    subset = filings_df[filings_df["form"] == form_type].copy()
    subset["doc_url"] = subset.apply(
        lambda r: build_doc_url(cik, r["accession"], r["primary_doc"]),
        axis=1,
    )
    return subset.reset_index(drop=True)

eightk = filter_filings(filings, "8-K")
print(eightk[["filing_date", "form", "doc_url"]].head())

For tracking 8-K filing frequency across a portfolio, collect the filing_date column from each company's submissions DataFrame and concatenate them — no full-text fetching required. The SEC Filing Intensity dataset provides a pre-computed, market-wide version of exactly this aggregation if you need it at scale without hitting EDGAR yourself.

Full-text search with EDGAR's search API

The submissions endpoint tells you about a specific company's filings. To search by keyword across all filers — say, every filing that mentions "going concern" or a specific bond CUSIP — use EDGAR's full-text search API hosted at efts.sec.gov. The endpoint accepts a q parameter for the query string and supports filtering by form type and date range. Results include the accession number and CIK, so you can chain directly into the document-retrieval pattern above.

def edgar_full_text_search(query, form_type=None, start_date=None, end_date=None,
                            hits_per_page=20):
    """
    Search EDGAR full text via efts.sec.gov.
    Returns a list of hit dicts, each containing 'entity_name', 'file_date',
    'form_type', 'accession_no', etc.
    """
    params = {
        "q":    query,
        "dateRange": "custom" if (start_date or end_date) else None,
        "startdt": start_date,
        "enddt":   end_date,
        "forms":   form_type,
        "_source": "file_date,entity_name,period_of_report,form_type,accession_no",
        "hits.hits.total.value": hits_per_page,
        "hits.hits._source": "",
    }
    params = {k: v for k, v in params.items() if v is not None}

    ft_session = requests.Session()
    ft_session.headers.update({
        "User-Agent": "MyResearchApp admin@example.com",
        "Host": "efts.sec.gov",
    })
    url = "https://efts.sec.gov/LATEST/search-index"
    resp = ft_session.get(url, params={"q": query, "forms": form_type or ""},
                          timeout=30)
    resp.raise_for_status()
    time.sleep(0.12)
    hits = resp.json().get("hits", {}).get("hits", [])
    return [h["_source"] for h in hits]

results = edgar_full_text_search("going concern", form_type="10-K",
                                  start_date="2024-01-01", end_date="2024-12-31")
for r in results[:5]:
    print(r.get("entity_name"), r.get("file_date"), r.get("accession_no"))

Full-text search is slower than the submissions endpoint and returns fewer fields, so use it for discovery — finding which companies filed something relevant — then follow up with the submissions or archive endpoints for the actual documents.

Caching and backoff etiquette

A few practices will keep your scripts running reliably and keep you on the SEC's good side. Cache the company_tickers.json file to disk and refresh it no more than once a day — it changes rarely and fetching it on every run is wasteful. Cache submissions JSON for companies you visit repeatedly; a filing history does not change retroactively for past entries. For any loop that touches many tickers or many dates, add exponential backoff on 429 responses rather than letting a burst of retries make the situation worse.

A minimal backoff wrapper is worth adding to any production script. Check the Retry-After header if it is present; otherwise double a base delay up to a reasonable ceiling. Logging the ticker and URL on every 429 makes it easy to spot if one problematic query is causing most of your throttling. Finally, run bulk fetches overnight or on weekends when EDGAR traffic is lighter — the SEC explicitly thanks developers who do this in their documentation.