Insights

How to Classify Stocks by Sector and Industry

Alphanume Team · June 4, 2026

Mapping tickers to a consistent taxonomy is foundational infrastructure — without it, peer comparisons, risk models, and sector strategies collapse into noise.

Before you can run a sector-neutral long-short book, build a peer-relative valuation model, or backtest a sector rotation strategy, every ticker in your universe needs a sector and industry label. That label is not a decoration. It is load-bearing infrastructure. When your risk model neutralizes sector exposure, it relies on those labels being accurate, stable, and consistently assigned. When you compute a stock's valuation percentile relative to peers, the peer group is the label. When you investigate whether a factor performs differently in defensive versus cyclical sectors, the sector assignment is the independent variable. Get the classification wrong — or use a classification that is inconsistent across time — and the error propagates silently into every downstream output. Sector industry classification is one of those inputs that is easy to underestimate until it breaks something important.

Alphanume's Ticker Classification dataset is built specifically to address the data discipline problems that generic label sources ignore: point-in-time membership, consistent hierarchy levels, and coverage of the long tail of small and micro-cap names where vendor data is thinnest.

The major standard schemes

Three classification systems dominate institutional usage, and they make meaningfully different design choices.

GICS (Global Industry Classification Standard) is the joint product of MSCI and S&P Global, and it is the most widely used system among index providers, risk models, and equity research. It operates on a four-level hierarchy: sector (11) → industry group (25) → industry (74) → sub-industry (163). The assignments are made primarily on the basis of a company's principal business activity and revenue sources, and S&P Global reviews and updates them. GICS is licensed — free access is limited, and commercial redistribution requires an agreement.

ICB (Industry Classification Benchmark) is the FTSE Russell equivalent. It uses a parallel four-level structure: industry → supersector → sector → subsector. ICB has stronger coverage of European and global developed-market names, and it makes somewhat different judgment calls on companies that straddle multiple activities. For practitioners building global universes, the choice between GICS and ICB is often dictated by which index family their benchmark tracks.

NAICS and SIC are U.S. government standards, maintained respectively by the Census Bureau and (historically) the SEC. SIC codes appear on SEC filings and EDGAR records, making them universally available for U.S.-listed companies without licensing friction. NAICS replaced SIC for official statistical use and is more granular in several technology and services categories. The trade-off: both are designed for economic and regulatory classification, not for market-structure analysis. A company can receive a SIC code based on its founding-era activity that no longer reflects its current revenue mix at all. For systematic equity work, SIC is a useful fallback for coverage of obscure names, but it should not be a primary system for anything requiring market-perception alignment.

The revenue-based versus market-perception distinction matters in practice. GICS attempts to classify companies by what they actually do as a business. But markets sometimes price companies on narrative rather than revenue: a company that derives 60% of revenue from legacy hardware but is widely perceived as a software business will behave in equity markets more like software peers. Neither approach is universally correct; the right choice depends on whether you are modeling fundamental earnings dynamics or equity return co-movement.

The hard problems in classification

The difficult cases are more common than the textbook examples suggest, and they have real consequences for any model that relies on peer groups or sector exposures.

Conglomerates and multi-segment firms. A company that operates in insurance, asset management, and commercial real estate simultaneously does not fit cleanly into any single sub-industry. Most classification systems resolve this by assigning the company to its largest revenue segment. That resolves the categorization problem but creates a different one: the assigned category may not explain the stock's return behavior, because the market prices the conglomerate discount or the optionality of the diversified structure, not just the largest segment. For factor models, this means that conglomerates will often appear as outliers within their assigned peer group — their fundamentals and price dynamics differ systematically from pure-play peers.

Reclassifications over time. Companies get reclassified. A retailer that pivots to e-commerce and advertising revenue may move from Consumer Discretionary to Communication Services. An energy company that builds out data-center infrastructure may eventually move to Industrials or Technology. These reclassifications happen at specific dates, and the date matters enormously for historical analysis. If you use today's GICS assignment for a company that was reclassified three years ago, you introduce a look-ahead error: your backtest treats the company as a Technology name in periods when the market, index providers, and risk models all classified it as Consumer Discretionary. The magnitude of that error depends on how different the two sectors behave — sometimes it is small, sometimes it is the difference between correct and incorrect sector neutralization.

ADRs, ETFs, and holding companies. American depositary receipts represent foreign-listed companies and should inherit the classification of the underlying operating company, not default to a financial holding company category. ETFs should typically be excluded from any fundamental peer-group analysis entirely — they have no revenue, no earnings, and no balance sheet. Holding companies present a similar problem to conglomerates: the legal entity may have a different SIC code than the operating subsidiaries it owns. Handling these cases requires explicit rules, not just a pass-through of vendor data.

Microcaps and thinly-described names. Vendor classification databases have patchy coverage below a certain market cap threshold. Small and micro-cap companies may be assigned a default or placeholder code rather than a carefully researched one. For strategies that include the small-cap tail of the market — where a large fraction of stocks live — this is not a marginal problem. It requires either accepting higher label noise in that segment or applying supplemental classification logic using business description text, filing data, or revenue segment disclosures.

Point-in-time membership and survivorship

This is the most consequential data discipline issue in classification, and it is frequently handled incorrectly. A sector membership database that shows only current assignments has limited value for historical strategy research. For any date in the past, you need to know what sector a stock was in on that date — not what sector it is in today, and not what sector it was assigned to after a retroactive reclassification.

The survivorship problem compounds this. Stocks that have been delisted — through acquisition, bankruptcy, or voluntary delisting — no longer appear in most current vendor feeds at all. A backtest that only includes currently listed companies is biased toward survivors, and the bias is not uniform across sectors. Sectors with high historical M&A activity or elevated failure rates (biotech, energy exploration) will be systematically underrepresented in a survivorship-biased universe. The correct approach requires a database that retains historical records for delisted names, with dated membership records that reflect what was known at each point in time.

The practical implication: a classification database for backtest use is structurally different from a reference database for current portfolio construction. The backtest database needs a dated record of every classification change, including the effective date and the source of the change. The classification methodology documentation describes how these point-in-time records should be structured and what fields are required to avoid look-ahead contamination.

Building or adopting a classification mapping

For most practitioners, the decision is not whether to build a classification system from scratch but how to adopt and supplement an existing one. A few design choices matter.

Single source versus blended. Using one classification system throughout a process is simpler and avoids internal inconsistency — a stock cannot be in Technology under one scheme and Consumer Discretionary under another within the same model. The cost of a single-source approach is that any gaps or errors in that source propagate. A blended approach uses a primary source (typically GICS for U.S. large caps) with a fallback hierarchy for names not covered: ICB for foreign names, SIC for microcaps. This requires explicit precedence rules and reconciliation logic.

Handling the long tail. The bottom quintile of the market by capitalization contains a disproportionate share of classification problems: companies with vague business descriptions, recent IPOs with limited filing history, shell companies in transition, and names where the primary vendor has not assigned a sub-industry. Building a coverage-completion step — using 10-K business description text or revenue segment data to assign or validate labels for uncovered names — is labor-intensive but necessary for any strategy that includes smaller names.

Revenue-segment-based assignment. For multi-segment firms where the top-level assignment is ambiguous, going directly to segment revenue disclosures in annual filings provides a more granular and defensible basis for classification. A company that reports 55% of revenue from software subscriptions and 45% from professional services can be labeled at the segment level rather than forced into a single category. This increases the complexity of the mapping but reduces the peer-group pollution that comes from lumping fundamentally different businesses together.

Validation and spot checks

Any classification database should be validated before being used in production, and the validation should be ongoing rather than one-time. Two checks that catch the most common errors:

Peer-group sanity. For a sample of names across sectors, pull the 20 closest peers by assigned sub-industry and inspect them manually. Do the business descriptions, revenue profiles, and price return correlations of the group make sense? Outliers within a peer group — names whose fundamentals or return behavior are systematically different from every other member — are often classification errors. A retail pharmacy chain assigned to a pharmaceutical manufacturer sub-industry will appear as a permanent outlier in that peer group.

Temporal consistency. For names known to have undergone reclassification, verify that the effective date in the database matches the actual date the index provider or standard body made the change — not the date a vendor updated their feed. Vendors sometimes lag reclassifications by weeks or months, and that lag introduces look-ahead error into any historical analysis.

Spot checks are not a substitute for systematic validation, but they surface problems that automated checks miss. A classification that passes automated completeness and format checks can still be economically wrong in ways that only domain knowledge reveals.

Why internal consistency beats current-snapshot data

The temptation for practitioners building a new strategy is to pull a current snapshot of sector assignments from a free or low-cost source and treat it as sufficient. For live trading on the current universe, that may be adequate. For any backtest that spans more than a few years, it is almost certainly not.

An internally consistent, point-in-time classification database produces backtests where sector exposures are measured correctly at each historical date, peer groups reflect what was actually comparable at the time, sector-neutral portfolios are neutralized against the sectors the market actually organized stocks into on the relevant dates, and reclassified names contribute to the correct sector's performance record in each period. A current-snapshot database produces none of those properties. It systematically misassigns every name that has been reclassified and omits every name that has been delisted, creating biases whose direction and magnitude are often unknown and difficult to diagnose after the fact.

For cross-sectional strategies — factor models, sector rotation, peer-relative valuation — classification infrastructure is not a commodity input. It is a source of edge when done correctly and a source of hidden error when done carelessly. The time to build it correctly is before it has already contaminated a year of backtest results.