Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5. What happened to your name?

A small interactive that walks 144 years of US baby-name data (1880–2023, courtesy of the Social Security Administration) and lets you watch any name rise, peak, and fade. Type your name. The chart sweeps from 1880 to today; you’ll see when your name was its most popular, and which other names rose and fell with it.

The Python here is the boring serious part — load CSVs, normalize, find nearest-trajectory neighbors. The fun part is in the JS widget at the bottom.

import sys, pathlib
ROOT = pathlib.Path("..").resolve() if pathlib.Path("..").exists() else pathlib.Path("../..").resolve()
sys.path.insert(0, str(ROOT))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from widgets.name_explorer.widget import NameExplorer

DATA = ROOT / "data" / "names"
print("data dir:", DATA)
print("yob files:", len(list(DATA.glob("yob*.txt"))))
data dir: /Users/sanjay/seed/anywidget-experiments/data/names
yob files: 144

Step 1 — load 144 years of CSVs

Each yob{year}.txt is a three-column CSV: name, sex, count. We concat them all into one tidy DataFrame and add a year column.

frames = []
for path in sorted(DATA.glob("yob*.txt")):
    year = int(path.stem.replace("yob", ""))
    df = pd.read_csv(path, names=["name", "sex", "count"])
    df["year"] = year
    frames.append(df)

raw = pd.concat(frames, ignore_index=True)
print(f"{len(raw):,} rows across {raw.year.nunique()} years")
raw.head()
2,117,219 rows across 144 years
Loading...

Step 2 — pivot to a (name|sex) × year matrix, top 5,000 by lifetime count

Cap at 5,000 names to keep the page light and the cosine-similarity matrix small. That covers the names ~99% of US-born people would search for; rarer names are silently absent.

raw["key"] = raw["name"] + "|" + raw["sex"]

YEAR_START, YEAR_END = 1880, int(raw["year"].max())
years = list(range(YEAR_START, YEAR_END + 1))

totals = raw.groupby("key")["count"].sum().sort_values(ascending=False)
TOP_N = 5000
top_keys = totals.head(TOP_N).index.tolist()

mat = (
    raw[raw["key"].isin(top_keys)]
    .pivot_table(index="key", columns="year", values="count", fill_value=0, aggfunc="sum")
    .reindex(index=top_keys, columns=years, fill_value=0)
    .astype(np.int32)
)
print("matrix shape:", mat.shape)
mat.head()
matrix shape: (5000, 144)
Loading...

Step 3 — what does a trajectory look like?

Just one static plot for context, then we hand off to the interactive widget.

fig, ax = plt.subplots(figsize=(8, 3.2))
for k in ["Mary|F", "Jennifer|F", "Olivia|F"]:
    if k in mat.index:
        ax.plot(years, mat.loc[k].values, label=k.split("|")[0], linewidth=2)
ax.set_yscale("log")
ax.set_xlabel("year")
ax.set_ylabel("count (log)")
ax.legend()
ax.set_title("Three names you can probably picture")
plt.tight_layout()
plt.show()
<Figure size 800x320 with 1 Axes>

Step 4 — find each name’s peak year and rank

The widget shows a “peaked in 1921 · #1” pill. We compute year-of-max-count and the within-year rank by sex.

# Year-of-max per name
peak_year = mat.values.argmax(axis=1)
peak_count = mat.values.max(axis=1)
peak_year_actual = np.array(years)[peak_year]

# Within-year rank by sex (1 = most popular for that sex that year)
rank_lookup = {}
for yr, sub in raw.groupby("year"):
    for sex, sub2 in sub.groupby("sex"):
        order = sub2.sort_values("count", ascending=False).reset_index(drop=True)
        rank_lookup[(yr, sex)] = {
            row["name"]: idx + 1 for idx, row in order.iterrows()
        }

peaks = {}
for i, key in enumerate(mat.index):
    name, sex = key.split("|")
    yr = int(peak_year_actual[i])
    rank = rank_lookup.get((yr, sex), {}).get(name, None)
    peaks[key] = {
        "year": yr,
        "rank": int(rank) if rank else 0,
        "count": int(peak_count[i]),
    }
list(peaks.items())[:3]
[('James|M', {'year': 1947, 'rank': 1, 'count': 94761}), ('John|M', {'year': 1947, 'rank': 3, 'count': 88320}), ('Robert|M', {'year': 1947, 'rank': 2, 'count': 91654})]

Step 5 — trajectory twins via cosine similarity on z-scored shapes

We z-score each name’s series (subtract mean, divide by std) so that “Mary’s” and “Margaret’s” similarity captures shape, not magnitude. Then cosine similarity on the normalized matrix; top 4 nearest per name.

X = mat.values.astype(np.float32)
mu = X.mean(axis=1, keepdims=True)
sd = X.std(axis=1, keepdims=True)
sd[sd == 0] = 1.0
Z = (X - mu) / sd

# unit-normalize rows for cosine
norms = np.linalg.norm(Z, axis=1, keepdims=True)
norms[norms == 0] = 1.0
Zn = (Z / norms).astype(np.float32)

# Cosine matrix and top-K — done in chunks to be memory-friendly. We
# also drop pairs where the bare name (sans sex) matches the source,
# so "Jennifer F" doesn't list "Jennifer M" as its twin.
K = 4
n = Zn.shape[0]
keys = mat.index.tolist()
bare = [k.split("|")[0].lower() for k in keys]
twins = {}
chunk = 1000
for start in range(0, n, chunk):
    sim = Zn[start : start + chunk] @ Zn.T  # (chunk, n)
    for i in range(sim.shape[0]):
        sim[i, start + i] = -np.inf
        # Mask other-sex pairings of the same bare name
        same = bare[start + i]
        for j, b in enumerate(bare):
            if b == same:
                sim[i, j] = -np.inf
    # Pull a few extra candidates so we have headroom after masking
    cand = np.argpartition(-sim, K + 2, axis=1)[:, : K + 2]
    for i in range(sim.shape[0]):
        row_idx = cand[i]
        order = np.argsort(-sim[i, row_idx])
        twins[keys[start + i]] = [keys[j] for j in row_idx[order][:K]]
print("twin lookup built for", len(twins), "names")
twins["Mary|F"]
twin lookup built for 5000 names
['Martha|F', 'Ralph|M', 'Willie|M', 'Howard|M']

Step 6 — package up for the widget

Trajectories ship as plain int arrays keyed by Name|Sex. Era captions come from a hand-curated CSV in data/names/. Widget state ends up around 2–3 MB.

trajectories = {key: mat.loc[key].astype(int).tolist() for key in mat.index}

era_df = pd.read_csv(DATA / "era_captions.csv")
era_captions = {str(int(r.decade)): r.caption for r in era_df.itertuples()}

name_index = sorted(mat.index.tolist(), key=lambda k: (k.split("|")[0].lower(), k.split("|")[1]))

print(f"trajectories: {len(trajectories):,}  ·  twins: {len(twins):,}  ·  peaks: {len(peaks):,}")
print(f"era captions: {list(era_captions.items())[:2]} ...")
trajectories: 5,000  ·  twins: 5,000  ·  peaks: 5,000
era captions: [('1880', 'the gilded age — Mary, John, William ruled'), ('1890', 'close of the frontier; Helen and Charles rose')] ...

Step 7 — render

Type a name. Hit enter (or pick from the dropdown). Try Claude for a small surprise. Optionally type a birth year to anchor the chart.

widget = NameExplorer(
    trajectories=trajectories,
    peaks=peaks,
    twins=twins,
    era_captions=era_captions,
    name_index=name_index,
    selected="Mary|F",
    year_start=YEAR_START,
    year_end=YEAR_END,
)
widget

Data: Social Security Administration, public-domain. Widget code in widgets/name_explorer/. The interactive page you’re looking at is built statically — no Python kernel running underneath.