Local PDF Research Workflow¶

Purpose¶

This workflow keeps Zotero as the library of record while making local PDF attachments usable for Codex research runs without writing back to Zotero.

Step 1: Resolve Zotero attachments from the local `.bib`¶

Use the resolver in the bounded-graph skill draft:

.venv/bin/python .skill_drafts/bounded-graph-literature-research/scripts/resolve_zotero_attachments.py \
  --bib docs/project_QEM-QEC/10_conjectures.bib \
  --storage-root '/Users/trainerblade/Library/CloudStorage/GoogleDrive-ctchu@uchicago.edu/My Drive/02_Apps_Backups/Zotero/storage' \
  --format md

This uses:

the BibTeX snapshot for citation keys and titles,
the Zotero desktop Local API when available,
the Zotero Web API as a fallback,
the local storage root only when path reconstruction is still needed.

When the desktop Local API is enabled, attachment items expose a direct local file URL through Zotero metadata, so the resolver can usually skip storage-path reconstruction entirely.

For imported attachments without a direct local file URL, the reconstructed path is:

\[ \text{storage-root} / \text{attachment-key} / \text{filename} \]

Step 2: Build a paper bundle before theorem extraction¶

For a resolved local PDF, create a bundle with per-page text and page images:

.venv/bin/python .skill_drafts/bounded-graph-literature-research/scripts/prepare_paper_bundle.py \
  '/absolute/path/to/paper.pdf' \
  --output-dir tmp/paper-bundles/paper-name

The bundle contains:

text.txt: full extracted text with page separators,
pages/page-0001.txt, etc.: per-page extracted text,
pages/page-0001.png, etc. when a renderer is available,
manifest.json: page counts, parser choice, and next-step guidance.

Dependencies:

minimum: pypdf for text extraction,
recommended: PyMuPDF for page-image rendering,
fallback renderer: pdftoppm from Poppler.
current workspace recommendation: use /Users/trainerblade/Documents/02_myDocs/.venv/bin/python

Best parsing strategy for math-heavy papers¶

Use this order of preference:

arXiv HTML or TeX source when available.
Publisher HTML when math rendering is preserved cleanly.
Local PDF bundle with page images plus extracted text.

Use extracted text for:

search,
rough navigation,
candidate theorem discovery,
keyword and notation lookup.

Use page images or source-formatted HTML as the source of truth for:

displayed equations,
theorem and lemma statements,
notation-heavy definitions,
any passage where a missing superscript, subscript, or symbol would change meaning.

Skill metadata¶

No installed skill update is strictly required for the workflow to function. The scripts above are sufficient.

Skill-description updates are still useful for discoverability:

bounded-graph-literature-research should mention local .bib snapshots and local PDF attachment resolution.
pdf should mention theorem-heavy and equation-heavy scientific papers, not only layout review.

The draft bounded-graph skill in .skill_drafts/ has been updated accordingly. Mirroring those changes into the globally installed skill is a separate step.