# Engineering Audit — media.aykhan.net

_Audited as a senior/staff engineering review. Scope: the whole repository, with
emphasis on the only real source file, `generate_index.py`._

## 1. Architecture summary

`media.aykhan.net` is a **static media host** served by GitHub Pages (images,
video, audio, icons referenced across aykhan.net). No server, database, auth, or
runtime — every file is a committed static asset.

A single Python script, `generate_index.py`, has two jobs that run on every push
(via `.github/workflows/generate_index.yml`, which commits the result back):

1. **`generate_index_html('.')`** — walks the entire repo and writes a browsable
   `index.html` directory listing into every folder. Hand-written listings are
   protected via an `<!-- Auto-generated by Python script -->` marker check
   (`is_auto_generated`).
2. **`generate_media_index()`** — produces the two public, machine-readable files
   consumed by the [Terminal Gateway](https://aykhan.net/terminal):
   `media-index.json` (metadata for every indexed media file) and
   `build-report.json`. This half is **whitelist-based**: only top-level folders
   in `WHITELIST_DIRS = ["assets", "thumbnails", "notion-pages", "achievements",
   "books"]` are scanned, only extensions in `MEDIA_TYPES` (images/video/audio)
   are indexed, and `DENY_DIR_NAMES` (`private`, `drafts`, `secrets`, …) are
   pruned. PDFs, HTML, and scripts are excluded by virtue of not being in
   `MEDIA_TYPES`.

**Data flow:** files on disk → `generate_index.py` → `media-index.json` /
`build-report.json` → fetched read-only by the terminal. Only
`path/name/extension/type/sizeBytes/url` metadata is published; file contents are
never inlined.

The design intent (read-only, public-metadata-only, whitelist-based) is sound.
The issues below are about **determinism, the listing/whitelist mismatch, and
output-injection hardening**, not the core security model.

## 2. Findings

### Critical
_None._ The security model (read-only, whitelist media indexing by extension,
deny-list) holds; no secret-exposure or injection path is reachable with the
current repo contents.

### High

| ID | File | Issue | Why it matters |
|----|------|-------|----------------|
| H1 | `generate_index.py` `generate_index_html` | **Non-deterministic row order.** `os.walk` yields `dirs`/`files` in arbitrary filesystem order and they are emitted unsorted. | Every CI run can reshuffle table rows, producing large meaningless diffs and an auto-commit on every push even when nothing changed (this repo has 1,400+ files, so churn is large). |
| H2 | `generate_index.py` `format_date` | **Machine-local timezone.** `dt_utc.astimezone()` formats "Last Modified" in the runner's local time with no TZ label. | Output depends on _where_ the script runs (dev vs GitHub Actions UTC). Combined with H1 this guarantees churn, and the displayed time is ambiguous (no zone shown). |

### Medium

| ID | File | Issue | Why it matters |
|----|------|-------|----------------|
| M1 | `generate_index.py` `EXCLUDED_DIRS` vs `DENY_DIR_NAMES` | **The HTML listing is _not_ whitelist-based.** `generate_index_html` walks everything from `.`, excluding only `.git/.github/__pycache__`. The documented "nothing sensitive published by accident" guarantee only covers the JSON indexer. | If a `private/`, `drafts/`, or `secrets/` folder is ever added, the JSON index would correctly skip it but the browsable `index.html` tree would still list every file in it. Defense-in-depth gap against the stated security posture. |
| M2 | `generate_index.py` `generate_index_html` | **No HTML escaping** of file/dir names interpolated into rows, `href`s, `<title>`, and `<h1>`. | A filename containing `<`, `>`, `&`, or `"` would break the markup or inject. Currently low-risk (no such names exist — verified), but it is unsafe output construction. |

### Low

| ID | File | Issue | Why it matters / recommendation |
|----|------|-------|---------------------------------|
| L1 | `generate_index.py` | `import time` is unused (dead import). | Remove. Verified no `time.*` usage. |
| L2 | `generate_index.py` `to_header_case` / `to_title_case` | Docstrings say "title case" but the functions only lowercase + insert separators. | Misleading; minor. Left as-is to stay surgical (cosmetic). |
| L3 | `generate_index.py` `format_size` | Falls through to `None` for sizes ≥ 1024 TB. | Impossible at this scale; guarding would be speculative. Noted only. |
| L4 | `generate_index.py` | `generate_index_html` and `generate_media_index` each walk the tree separately. | Negligible; not worth coupling the two passes. |
| L5 | both repos | `generate_index.py` shares ~250 identical lines with `data.aykhan.net`. | Real duplication, but the repos deploy independently to separate domains. A shared module would add cross-repo coupling/deploy complexity — **left intentionally.** |
| L6 | `generate_index.py` | No tests. | Addressed: added `test_generate_index.py`. |

## 3. Recommended fixes (safe implementation order)

1. **L1** — remove `import time` (zero-risk cleanup). _verify:_ script still runs.
2. **H1** — sort `dirs` and `files` before emitting rows. _verify:_ rows appear in lexicographic order; re-running no longer reshuffles.
3. **H2** — format `format_date` in UTC with an explicit `UTC` suffix. _verify:_ output is identical regardless of machine TZ.
4. **M2** — `html.escape` names and derived title/header text. _verify:_ a `<`-containing name renders as `&lt;`.
5. **M1** — extend the listing's excluded-dirs to mirror `DENY_DIR_NAMES`. _verify:_ a `secrets/` folder is not listed.
6. **L6** — add `test_generate_index.py` (stdlib `unittest`, no new deps) covering ordering, UTC formatting, whitelist/deny exclusion, and escaping.

**Behavior change to note:** the "Last Modified" column switches from machine-local
time to UTC (now suffixed `UTC`). This is human-facing display only — no JSON API
field, schema, URL, or deployment changes. It is the fix for H1/H2 churn.

**Not changing** (would exceed scope / violate "minimal, surgical"): cross-repo
deduplication (L5), the public JSON schema, the `MEDIA_TYPES` whitelist, the
workflow, or the deployment model.
