Insights

Using Search and Wikipedia Trends as Alt-Data

Alphanume Team · June 10, 2026

Building attention features without leakage — how Google Trends and Wikipedia page views become tradeable signals when treated with the same rigor as any other alternative data.

Crowds move prices. Before a retail earnings surprise, search volume for the brand spikes. Before a biotech binary event, traffic to the drug's Wikipedia page accelerates. These behavioral traces, collectively described as alternative data, represent one of the few information channels that updates daily, covers nearly every publicly traded company, and predates the formal news cycle. The challenge is that the raw feeds are noisier and more leakage-prone than they appear, and the difference between a well-constructed attention feature and a contaminated one is largely invisible without careful inspection of how the data is built.

This post works through the two flagship sources — Google Trends search interest and Wikipedia page views — from data mechanics to feature engineering to backtesting discipline. The goal is to construct features that could have been known at each historical point in time, without information that only becomes available later. That constraint rules out most naive implementations.

What alternative data attention signals actually measure

Attention in financial markets refers to the allocation of human awareness toward a security or its underlying business. When that awareness increases — because of news, rumors, product launches, or simply social propagation — it tends to precede a specific pattern of price behavior: elevated volume, wider bid-ask spreads, and a short-term return effect that reverses as the marginal investor is exhausted. The signal is not uniformly directional. It is better understood as a measure of crowd engagement whose value depends heavily on the context in which it appears.

Retail attention has been studied in academic literature going back to at least 2004, when researchers noted that Google search volume for individual ticker symbols predicted abnormal volume and short-term price momentum. The early results were striking enough that practitioners began building around the idea immediately. Two decades later the same data sources are widely used, which means the naive version of the signal has been largely arbitraged — what remains is the incremental value from careful engineering, timely data, and combination with other features.

Google Trends: what the data actually is

Google Trends provides a relative search interest index, not raw query counts. The number returned for any term on any date is scaled to the maximum observation within the query window you specify, rescaled to 0–100. This rescaling behavior has a critical implication: the same historical date will return a different index value depending on when you pull the data and what date range you include in the query. A query window ending in 2020 and a query window ending today will often return different values for a date in 2018 — not because search behavior changed, but because the denominator changed.

The second mechanical issue is sampling. Google Trends is based on a sample of queries, not the full corpus. For lower-volume terms, the sample introduces noise; the same query run on consecutive days will sometimes return different values for the same historical week. The data is also available only at weekly granularity for long historical windows, and at daily granularity only for windows of roughly 270 days or fewer. Mixing weekly and daily series for the same term is a common source of inconsistency.

Practical consequence: any backtest that pulls Google Trends data today and treats it as historically fixed is using point-in-time incorrect data. The correct approach is to snapshot each query at the time the data would have been available, using a fixed and consistent query window. Many practitioners sidestep this by using a fixed rolling window — always querying the same number of trailing weeks — and appending incrementally rather than re-pulling the full history. Even then, the sample noise requires treating the series as an estimate with measurement error, not a precise count.

Wikipedia page views: coverage, granularity, and revision risk

The Wikimedia Foundation publishes hourly and daily page-view counts for every article across all language editions of Wikipedia. The data goes back to 2015 for the refined hourly series and earlier for coarser monthly figures. Unlike Google Trends, the counts are not rescaled — you receive the actual observed view count for each page on each day, which makes the series genuinely point-in-time stable once published. That stability is a significant advantage over Trends for systematic use.

The Wikipedia Views dataset provides clean, entity-resolved daily view series mapped to company identifiers, which resolves the most time-consuming part of working with raw Wikimedia data: figuring out which article corresponds to which ticker, and handling the changes over time. The data latency is typically one to two days, which means the previous day's view count is available the following morning — usable as a feature for strategies that trade at the open or later.

The leakage risk unique to Wikipedia is retroactive page editing and renaming. An article's title, content, and categorical structure can change at any time. If your entity resolution relies on the current state of Wikipedia to map tickers to articles, and then you apply that mapping to historical data, you may be assigning views from the pre-existing article to the current company — or missing views entirely if the company was covered under a different title in the past. Corporate actions compound this: spin-offs, mergers, and rebrands all change Wikipedia coverage. Entity resolution must therefore be versioned alongside the view data, with the mapping reflecting what article existed and what it was named at each historical date.

Engineering features without look-ahead

The standard feature construction approach for both series is a rolling normalization: divide today's raw observation by the trailing N-day average or standard deviation, producing a z-score or ratio that controls for the company's typical attention baseline. The window length N is a hyperparameter; 20 to 60 trading days is a common range. Shorter windows are more reactive but also noisier; longer windows are smoother but may lag the attention spike by the time it appears in the feature.

Three specific rules reduce leakage exposure:

Use fixed query windows for Trends. Each historical snapshot should be pulled using the same trailing window length. Mixing query windows contaminates the index values. Re-pulling Trends history with today's full date range as the window is the most common implementation error.

Lag features to true availability time. Wikipedia views for Tuesday are typically published Wednesday morning. Google Trends daily data has a similar one-to-two-day lag before it stabilizes. Features should be lagged by at least one full day beyond the observation date to ensure they could have been known when the trade decision was made. A one-day lag is the minimum; two days provides a buffer against data pipeline variability.

Normalize within a fixed universe. Cross-sectional normalization — ranking a company's attention z-score relative to its peers on the same day — removes market-wide attention effects (news cycles that drive views broadly) and leaves the idiosyncratic component. This step also removes heteroscedasticity across different market-cap tiers, where absolute view counts are mechanically higher for large-cap companies.

Entity resolution and ambiguous names

Mapping a ticker to the correct Google Trends query and Wikipedia article is not trivial. Several failure modes appear consistently in practice. The company's legal name may differ from its operating name. Common words in a company name (Apple, Target, Oracle) generate search and Wikipedia traffic that has nothing to do with the stock. ETFs that hold a sector are often confused with the underlying companies in that sector.

For Google Trends, querying by ticker symbol generally produces cleaner results than querying by company name for large-cap stocks, where the ticker is commonly typed by retail participants. For smaller companies where the ticker is not widely known, company-name queries are necessary but must be validated for contamination — high search volume for a common word should flag a query as unreliable. Trends allows category filtering by topic rather than keyword, which can improve precision for ambiguous names, though topic matching has its own inconsistencies across the historical series.

For Wikipedia, the entity-resolution problem is best handled by maintaining a versioned mapping that tracks article title changes, redirects, and disambiguation page history. A company that was known as one name in 2016 and rebranded in 2020 will have different Wikipedia article titles across that period. The mapping must reflect this — assigning views from the 2016 article to the 2016 dates and views from the 2020 article to the 2020 dates, with a break at the transition.

Combining attention features with price and volume

Attention alone is not a return predictor — it is an input that interacts with the information environment. The combination that appears repeatedly in practitioner research is elevated attention during or immediately before a corporate event window: earnings announcements, FDA decisions, activist disclosures, or index reconstitution events. Outside these windows, the mean-reversion pattern dominates: sharp attention spikes in quiet periods tend to reverse as the marginal buyer exhausted by attention fades.

One structurally sound feature interaction is attention paired with unusual options activity or short-interest changes. When retail attention rises and simultaneously short interest declines, the crowd is moving in the same direction as institutional positioning, which has historically produced stronger returns than attention alone. When attention rises into elevated short interest, the structure is different — potential squeeze dynamics with binary outcomes. These interactions require point-in-time short interest data, which itself has publication lags that must be respected.

Small-cap and micro-cap universes show the strongest marginal contribution from attention signals. In these names, institutional research coverage is sparse, market makers are less informed, and retail flows are proportionally larger relative to total volume. Attention signals add less incremental information to large-cap names that are already covered by hundreds of analysts and priced continuously by algorithmic market-makers.

Backtesting discipline and overfitting risk

Attention signals are particularly vulnerable to overfitting because they are high-dimensional — every combination of window length, normalization method, lag, and event filter is a separate model — and because the historical returns associated with them are noisy enough that spurious patterns survive in-sample. The standard remedies apply with extra urgency here.

Strict out-of-sample testing means selecting model parameters on a training window and evaluating performance on a subsequent, non-overlapping test window. Walk-forward validation — where the model is re-estimated at each point in time using only data available up to that point — is more realistic than a single train-test split and better approximates live trading conditions. If the out-of-sample Sharpe ratio is materially lower than the in-sample figure, the model is overfit; the magnitude of the gap matters.

Multiple-testing correction is essential when comparing across many signal variants. Running 50 feature combinations and selecting the best-performing one produces an expected best-in-sample Sharpe that substantially overstates true expected performance. A conservative approach is to apply a Bonferroni correction or use a family-wise error rate framework; a practical approach is to pre-register the exact feature specification before looking at results on the test window. Either way, the number of independent tests must be tracked and reported honestly.

Transaction costs in the names where attention signals are most powerful — small caps with wide spreads — are also highest. A signal that appears profitable at zero cost may be entirely consumed by realized bid-ask spread, market impact, and borrowing costs for short legs. Costs must be modeled at realistic executed-price levels, not mid-quote, and must scale with position size rather than being applied as a flat percentage.

The combination of genuine data discipline, entity resolution, and out-of-sample validation narrows the universe of attention-based strategies considerably. What remains is still a real and usable signal — particularly in event windows and small-cap universes — but it requires treating the data with the same rigor applied to any other factor, without the shortcut of assuming that behavioral intuition translates directly into alpha.