Insights
How to Schedule a Daily Data Pull (Cron + Python)
Alphanume Team · June 2, 2026
Automating a nightly dataset refresh.
A one-off script is fine for exploration, but production work demands a reliable schedule python data pull that runs unattended, writes clean output, and wakes someone up when it breaks. This tutorial builds exactly that: a pull script that is idempotent and append-safe, a crontab line that fires after the data is actually available, and enough retry and alerting logic that a transient network hiccup does not silently kill your pipeline. If you are still getting comfortable with the basics, getting stock data with a REST API in Python covers the request pattern before we layer scheduling on top.
The base pull script
Every Alphanume endpoint shares the same shape: a GET against the base URL with an api_key query parameter, returning a JSON envelope with a data array. A thin helper handles that once so the rest of the script stays readable. We will use the next-day-movers dataset as a running example — it refreshes around 3:30 pm ET each trading day.
import os
import logging
import requests
import pandas as pd
from datetime import date, timedelta
from pathlib import Path
BASE_URL = "https://api.alphanume.com/v1"
API_KEY = os.environ["ALPHANUME_API_KEY"]
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
log = logging.getLogger(__name__)
def get_data(endpoint, **params):
params["api_key"] = API_KEY
resp = requests.get(
f"{BASE_URL}/{endpoint}", params=params, timeout=30
)
resp.raise_for_status()
payload = resp.json()
log.info("fetched %d rows from %s", payload["count"], endpoint)
return payload["data"]
The raise_for_status() call is not optional — a 401 or 429 returns an error body that pandas will cheerfully parse into a DataFrame of nonsense. Fail loudly instead. The full set of available datasets and their field lists live in the API documentation.
Making the pull idempotent
The most common pipeline bug is running twice and appending duplicate rows. The fix is to check whether today's partition already exists before touching the network. Writing one file per date into a directory tree gives you that check for free and makes the output trivially queryable with pandas or any columnar reader.
OUTPUT_DIR = Path(os.environ.get("DATA_DIR", "/data/next-day-movers"))
def output_path(run_date: date) -> Path:
p = OUTPUT_DIR / str(run_date.year) / f"{run_date.month:02d}"
p.mkdir(parents=True, exist_ok=True)
return p / f"{run_date.isoformat()}.parquet"
def pull_next_day_movers(run_date: date) -> None:
path = output_path(run_date)
if path.exists():
log.info("already pulled %s — skipping", run_date)
return
rows = get_data("next-day-movers", date=run_date.isoformat())
df = pd.DataFrame(rows)
df["pulled_at"] = pd.Timestamp.utcnow().isoformat()
df.to_parquet(path, index=False)
log.info("wrote %d rows to %s", len(df), path)
if __name__ == "__main__":
pull_next_day_movers(date.today())
Writing to a temporary file and renaming it atomically is worth the extra line on Linux — a partial write followed by a crash will never leave a corrupt file that the idempotency check treats as complete. Replace the df.to_parquet(path) line with df.to_parquet(path.with_suffix(".tmp")) followed by path.with_suffix(".tmp").rename(path) if that matters in your environment.
Retries and backoff
Networks are unreliable. A single transient 503 should not kill the night's pull. A small exponential backoff loop around the API call is enough for most pipelines — no external library required.
import time
def get_data_with_retry(endpoint, retries=4, backoff=5, **params):
params["api_key"] = API_KEY
url = f"{BASE_URL}/{endpoint}"
for attempt in range(1, retries + 1):
try:
resp = requests.get(url, params=params, timeout=30)
resp.raise_for_status()
payload = resp.json()
log.info(
"fetched %d rows from %s (attempt %d)",
payload["count"], endpoint, attempt,
)
return payload["data"]
except requests.RequestException as exc:
if attempt == retries:
raise
wait = backoff * (2 ** (attempt - 1))
log.warning(
"attempt %d failed (%s) — retrying in %ds",
attempt, exc, wait,
)
time.sleep(wait)
Four attempts with a 5-second base gives you waits of 5, 10, and 20 seconds before the final raise — long enough to ride out a brief hiccup, short enough that cron is not still waiting at the next scheduled run.
Scheduling with cron — the timezone gotcha
Cron reads the server's local timezone, not ET. If your server runs UTC — as most cloud VMs do by default — you need to convert the dataset's ET publish time to UTC before writing the crontab line. The next-day-movers dataset publishes around 3:30 pm ET, which is 8:30 pm UTC in winter (EST) and 7:30 pm UTC in summer (EDT). Pick the later offset (8:30 pm UTC) and add a buffer of at least 15 minutes so you are pulling after the data is live even on the slow end.
# crontab -e
# Pull next-day-movers daily at 20:45 UTC (buffer after 20:30 UTC / 3:30 pm ET)
# Redirect stdout and stderr to a rolling log
45 20 * * 1-5 /usr/bin/python3 /opt/alphanume/pull.py >> /var/log/alphanume/pull.log 2>&1
A few things worth noting in that line: 1-5 restricts the job to weekdays; the full path to the Python interpreter avoids PATH confusion in cron's minimal environment; and 2>&1 merges stderr into the log file so nothing is silently swallowed. Create the log directory before the first run — cron will not create it for you.
Environment and secret handling
Cron does not inherit your shell's environment, so os.environ["ALPHANUME_API_KEY"] will raise a KeyError unless you explicitly pass the variable. The cleanest approach is a small .env file owned by the service user with permissions 600, loaded by the script at startup.
from dotenv import load_dotenv
load_dotenv("/opt/alphanume/.env") # file contains: ALPHANUME_API_KEY=sk-...
API_KEY = os.environ["ALPHANUME_API_KEY"]
Never put the key directly in the crontab or in a file that sits inside a git repository. If you are running on a cloud provider, a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) is preferable to a flat file — the call to retrieve the secret replaces the load_dotenv line and gives you rotation and audit logs for free.
Alerting on failure
A silent failure is worse than a loud one. The simplest production alert is an email on non-zero exit — most Unix systems can send one with mailx or by setting the MAILTO variable at the top of the crontab.
MAILTO=ops@yourcompany.com
45 20 * * 1-5 /usr/bin/python3 /opt/alphanume/pull.py >> /var/log/alphanume/pull.log 2>&1
For richer alerting — Slack, PagerDuty, or a dead man's switch that pages if the job never completes — wrap the script in a thin shell wrapper that posts to your webhook on failure and calls a heartbeat URL on success. That pattern scales to any number of jobs without changing the Python code.
Verifying the pull and alternatives for heavier needs
After the first run, confirm the file landed and the row count looks right before relying on it downstream. A one-liner is enough:
import pandas as pd
from pathlib import Path
from datetime import date
path = Path("/data/next-day-movers") / str(date.today().year) \
/ f"{date.today().month:02d}" / f"{date.today().isoformat()}.parquet"
df = pd.read_parquet(path)
print(df.shape, df.dtypes)
print(df.head(3))
Cron is the right tool for simple scheduled pulls on a single server. If your needs grow — fan-out across dozens of datasets, dependency ordering, retry dashboards, or distributed execution — reach for a heavier orchestrator. systemd timers are a modern cron replacement on Linux with better logging and dependency management. GitHub Actions with a schedule: trigger works well for lightweight pulls that run in CI and commit results to a repo. Apache Airflow or Prefect make sense when you have a DAG of tasks with complex dependencies and want a UI to inspect every run. Start with cron — it is the lowest-friction path from working script to reliable schedule — and migrate when the complexity warrants it.