Skip to content

Local PDF Research Workflow

Purpose

This workflow keeps Zotero as the library of record while making local PDF attachments usable for Codex research runs without writing back to Zotero.

Step 1: Resolve Zotero attachments from the local .bib

Use the resolver in the bounded-graph skill draft:

1
2
3
4
.venv/bin/python .skill_drafts/bounded-graph-literature-research/scripts/resolve_zotero_attachments.py \
  --bib docs/project_QEM-QEC/10_conjectures.bib \
  --storage-root '/Users/trainerblade/Library/CloudStorage/GoogleDrive-ctchu@uchicago.edu/My Drive/02_Apps_Backups/Zotero/storage' \
  --format md

This uses:

  • the BibTeX snapshot for citation keys and titles,
  • the Zotero desktop Local API when available,
  • the Zotero Web API as a fallback,
  • the local storage root only when path reconstruction is still needed.

When the desktop Local API is enabled, attachment items expose a direct local file URL through Zotero metadata, so the resolver can usually skip storage-path reconstruction entirely.

For imported attachments without a direct local file URL, the reconstructed path is:

\[ \text{storage-root} / \text{attachment-key} / \text{filename} \]

Step 2: Build a paper bundle before theorem extraction

For a resolved local PDF, create a bundle with per-page text and page images:

1
2
3
.venv/bin/python .skill_drafts/bounded-graph-literature-research/scripts/prepare_paper_bundle.py \
  '/absolute/path/to/paper.pdf' \
  --output-dir tmp/paper-bundles/paper-name

The bundle contains:

  • text.txt: full extracted text with page separators,
  • pages/page-0001.txt, etc.: per-page extracted text,
  • pages/page-0001.png, etc. when a renderer is available,
  • manifest.json: page counts, parser choice, and next-step guidance.

Dependencies:

  • minimum: pypdf for text extraction,
  • recommended: PyMuPDF for page-image rendering,
  • fallback renderer: pdftoppm from Poppler.
  • current workspace recommendation: use /Users/trainerblade/Documents/02_myDocs/.venv/bin/python

Best parsing strategy for math-heavy papers

Use this order of preference:

  1. arXiv HTML or TeX source when available.
  2. Publisher HTML when math rendering is preserved cleanly.
  3. Local PDF bundle with page images plus extracted text.

Use extracted text for:

  • search,
  • rough navigation,
  • candidate theorem discovery,
  • keyword and notation lookup.

Use page images or source-formatted HTML as the source of truth for:

  • displayed equations,
  • theorem and lemma statements,
  • notation-heavy definitions,
  • any passage where a missing superscript, subscript, or symbol would change meaning.

Skill metadata

No installed skill update is strictly required for the workflow to function. The scripts above are sufficient.

Skill-description updates are still useful for discoverability:

  • bounded-graph-literature-research should mention local .bib snapshots and local PDF attachment resolution.
  • pdf should mention theorem-heavy and equation-heavy scientific papers, not only layout review.

The draft bounded-graph skill in .skill_drafts/ has been updated accordingly. Mirroring those changes into the globally installed skill is a separate step.