This repository contains a prototype for captioning images, embedding text, and serving semantic image search results via an AWS Lambda function. It includes helper scripts for data ingestion, model training, and a minimal Chrome extension.
backend/– Data preparation and training scripts.alttext-embed.py– Generates captions for images with GPT-4-Vision and writes image embeddings.ingest.py– Fetches ArcXP story revisions to build story–image pairs.copy_to_rds.py– Bulk loads embeddings into a PostgreSQL database using pgvector.train.py– Trains a LightGBM ranking model from the collected data.
lambda/– AWS Lambda code used by the search API.handler.py– Entry point that scores candidate images against a query.ranker.py– Downloads the trained model and manages database connections.
extension/– Simple Chrome extension that queries/suggestfor lead-art recommendations.tests/– Unit tests for the handler and ranking helper.data/– Placeholder directory for generated embeddings and the trained model.
-
Install dependencies
pip install -r backend/requirements.txt
-
Ingest story–image pairs
scripts/run_ingest.sh --since 2024-01-01T00:00:00Z
Set
ARC_API_URLandARC_TOKENto connect to the Arc API. -
Caption and embed images
scripts/run_alttext.sh <image-id> [...]
Requires
PHOTO_URLandPHOTO_TOKENfor fetching image bytes. Results are written toartifacts/images.csv. -
Load embeddings into PostgreSQL
PG_DSN=postgresql://user:pass@host/db python backend/copy_to_rds.py
-
Train the ranking model
scripts/run_train.sh
Produces
artifacts/leadart_ranker.lgb. -
Run tests
pytest
The Lambda function (lambda/handler.py) uses the trained model to rank candidates stored in the database. The Chrome extension demonstrates how search suggestions can be integrated into Arc Composer.