AutoPopulate 2.0: Per-table job management with enhanced status tracking #1303

dimitri-yatsenko · 2026-01-04T04:06:48Z

Summary

This PR implements AutoPopulate 2.0, a complete redesign of the job handling system for distributed computing workflows. It addresses the scalability limitations identified in #1258 and resolves the confusing limit vs max_calls behavior reported in #1203.

Key Changes

Per-table Job Management

Each dj.Computed/dj.Imported table gets its own hidden jobs table (~~table_name)
Jobs tables use native primary keys (no more hash-based addressing)
Rich status tracking: pending, reserved, success, error, ignore
Access via MyTable.jobs instead of schema-level schema.jobs

Job Class API

jobs.refresh() - Sync job queue with key_source
jobs.reserve() / jobs.complete() / jobs.error() - Status transitions
jobs.pending / jobs.reserved / jobs.errors / jobs.completed - Query properties
jobs.progress() - Status breakdown for dashboards

FK-only Primary Key Validation

New Computed/Imported tables must have PKs composed entirely of FK references
Ensures 1:1 correspondence between jobs and target rows
Legacy tables with additional PK attributes continue to work with degraded granularity

Hidden Job Metadata

Optional _job_start_time, _job_duration, _job_version columns for computed tables
Controlled via config.jobs.add_job_metadata
Provides computation provenance without cluttering the visible schema

NATURAL JOIN → USING Clause

Replaced NATURAL JOIN with explicit USING clauses throughout
Hidden attributes (prefixed with _) are excluded from join matching
Enables safe use of hidden attributes without join collisions

Semantic Matching for Joins

Attribute lineage tracking via ~lineage table
Warns when joining tables with same-named attributes from different origins
schema.lineage property for viewing all lineages

Optimized progress() Method

Single aggregation query instead of two separate queries
Handles 1:many relationships correctly with COUNT(DISTINCT)

Migration Utility

add_job_metadata_columns() function to retrofit existing tables

Breaking Changes

Removed config.add_hidden_timestamp setting (see rationale below)
Removed target property from AutoPopulate (always uses self)
Deprecated limit and order parameters in populate() (use max_calls and priority)

Rationale: Deprecating `add_hidden_timestamp`

The old add_hidden_timestamp feature has been removed for several reasons:

Hash-based naming obsolete: Used _<sha1_hash>_timestamp to avoid NATURAL JOIN collisions. With the switch to USING clauses, hidden attributes are automatically excluded from joins, making hash-based naming unnecessary.
Not modern best practice: General insert/update timestamps on all tables should be handled by server-side database auditing features (MySQL Enterprise Audit, MariaDB Audit Plugin, binary logs) rather than application-level hidden columns. Server-side auditing:
- Catches ALL changes including direct SQL
- Cannot be bypassed by application code
- Provides tamper-evident audit trails
- Doesn't pollute schema with audit columns
Job metadata is the right use case: Hidden attributes are appropriate for computation provenance (_job_start_time, _job_duration, _job_version) which is tightly coupled to computed data and useful for reproducibility.

Specification Documents

This PR includes comprehensive design specifications:

AutoPopulate 2.0 Spec - Complete system design
Hidden Job Metadata Spec - Hidden attribute implementation
PK Rules Spec - Primary key handling in joins (in base branch)
Semantic Matching Spec - Attribute lineage system (in base branch)

Configuration

New settings under config.jobs:

dj.config.jobs.auto_refresh = True       # Auto-refresh on populate
dj.config.jobs.keep_completed = False    # Keep success records
dj.config.jobs.stale_timeout = 3600      # Seconds before stale cleanup
dj.config.jobs.default_priority = 5      # Default job priority
dj.config.jobs.add_job_metadata = False  # Add hidden metadata columns

Related Issues & PRs

Closes #1258 - FEAT: Autopopulate 2.0
Addresses #1203 - Confusing limit vs max_calls behavior (deprecated limit)

Base branch PRs (must be merged first):

Implement primary key rules for joins #1302 - Primary key rules for join operations
Implement Semantic Joins #1301 - Semantic matching for joins

Related PR:

Implement Codec Infrastructure #1300 - Codec terminology renaming

Test Plan

All existing tests pass (515 passed, 2 skipped)
Job table creation and status transitions
Hidden job metadata population
FK-only PK validation for new tables
Legacy table support with non-FK PK attributes
USING clause join behavior
Migration utility for existing tables

🤖 Generated with Claude Code

AutoPopulate 1.0 spec (specs/autopopulate-1.0.md): - Documents legacy system for reference - Key source generation and jobs_to_do computation - Schema-level ~jobs table with hash-based keys - Job reservation flow (reserve/complete/error/ignore) - Make method invocation (regular and generator patterns) - Transaction management and error handling - Limitations addressed by 2.0 (linked to GitHub issues) AutoPopulate 2.0 spec (docs/src/compute/autopopulate2.0-spec.md): - Per-table ~table__jobs with FK-derived primary keys - FK-only PK constraint for new tables (legacy supported) - Extended status: pending, reserved, success, error, ignore - Priority (uint8) and scheduled_time for job ordering - Duration tracking (float64) and version field - refresh() with stale_timeout and orphan_timeout - Deprecated: order, limit, keys parameters - reserve_jobs=False falls back to 1.0 behavior - Config sets defaults, parameters override Design decisions documented: - No target property (populate always populates self) - max_calls total across all processes - Ignore jobs permanent until manual delete - Success jobs re-pended if key in key_source but not in table Related: #1258, #1203, #749, #873, #665 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implementation plan (specs/autopopulate-2.0-implementation.md): - JobsTable class with ~~ prefix naming convention - FK-only PK constraint for new tables, legacy support - Two execution modes: direct (default) and distributed - AutoPopulate mixin updates (jobs property, populate paths) - Schema.jobs returning list of JobsTable objects - Configuration options and testing strategy Spec updates (docs/src/compute/autopopulate2.0-spec.md): - Changed table naming from ~table__jobs to ~~table - Removed default value from priority (set by refresh()) - Priority default controlled by config['jobs.default_priority'] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Job class (src/datajoint/jobs.py): - Per-table job queue with ~~ prefix naming convention - Removed legacy JobTable class - FK-derived primary key extraction from target table - Status filter properties: pending, reserved, errors, ignored, completed - Core methods: refresh(), reserve(), complete(), error(), ignore(), progress() - Uses update1() for status transitions - Timestamps with millisecond precision (timestamp(3)) schema.jobs property (src/datajoint/schemas.py): - Now returns list of Job objects instead of single JobTable - Only returns Job for tables where both target and ~~job table exist - Job tables created lazily on first access to table.jobs Jobs configuration (src/datajoint/settings.py): - JobsSettings class with auto_refresh, keep_completed, stale_timeout, default_priority, and version_method options 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add jobs property to AutoPopulate for per-table Job access - Add _declare_check hook pattern for table validation - Implement FK-only PK constraint for Computed/Imported tables - Split populate() into _populate_direct() and _populate_distributed() - Update _populate1() to use new Job API with duration tracking - Remove deprecated parameters (keys, order, limit) - Add new parameters (priority, refresh) for distributed mode - Remove leftover swap file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Job management changes: - Use dependency graph to identify FK-derived PK attributes (not lineage) - Fix semantic_check conflict in Job.refresh() operations - Fix SQL escaping for LIKE '~~%%' pattern - Add allow_new_pk_fields_in_computed_tables config option Schema changes: - Update schema.jobs to return list of Job objects for existing tables Test updates: - Update conftest.py fixtures to use new schema.jobs list API - Enable allow_new_pk_fields_in_computed_tables for legacy test tables - Rewrite test_autopopulate.py for new populate() signature - Rewrite test_jobs.py for per-table Job API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace int/smallint/tinyint with int32/int16/uint8 - Replace float/double with float32/float64 - Replace timestamp with datetime - Convert Auto from autoincrement to explicit Lookup values - Update fixture teardown to check schema.exists - Update test_alter_part regex to handle type comments 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add fractional seconds precision support to datetime core type - Replace timestamp(3) with datetime(3) in Job table definition - Eliminates native type warnings for job table timestamps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add config.jobs.add_job_metadata setting for hidden metadata columns - Add _job_start_time, _job_duration, _job_version to Computed/Imported tables - Replace NATURAL JOIN with explicit USING clause to exclude hidden attributes - Hidden attributes (prefixed with _) excluded from all binary operators - Add subquery requirement when joining multi-table expressions - Update jobs.py version field from varchar(255) to varchar(64) - Add tests for hidden job metadata feature 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove add_hidden_timestamp config setting (replaced by job metadata) - Remove hash-based timestamp column generation from declare.py - Remove unused sha1 import - Remove unused test fixtures: monkeysession, monkeymodule, enable_adapted_types - Clean up empty Utility Fixtures section in conftest.py Job metadata feature (config.jobs.add_job_metadata) remains for computed table provenance tracking. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add add_job_metadata_columns() migration utility to migrate.py - Adds hidden columns to existing Computed/Imported tables - Supports single tables or entire schemas - Dry-run mode for previewing changes - Optimize AutoPopulate.progress() with single aggregation query - Uses LEFT JOIN with COUNT(DISTINCT) for efficiency - Handles 1:many relationships correctly - Falls back to two-query method when no common attributes - Remove target property from AutoPopulate (always uses self) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Allow passing transaction, safemode, and force_masters kwargs to Part.delete() so users can nest Part deletions within larger transactions. Fixes #1276 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dimitri-yatsenko · 2026-01-04T04:26:13Z

This PR also fixes #1276 - Part.delete now passes through kwargs (transaction, safemode, force_masters) to Table.delete, allowing Part deletions to be nested within larger transactions.

dimitri-yatsenko and others added 10 commits January 3, 2026 12:41

github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Jan 4, 2026

dimitri-yatsenko changed the base branch from master to claude/semantic-match January 4, 2026 04:15

dimitri-yatsenko changed the base branch from claude/semantic-match to claude/pk-rules January 4, 2026 04:16

dimitri-yatsenko self-assigned this Jan 4, 2026

dimitri-yatsenko requested a review from ttngu207 January 4, 2026 04:16

dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features labels Jan 4, 2026

dimitri-yatsenko added this to the DataJoint 2.0 milestone Jan 4, 2026

github-actions bot removed breaking Not backward compatible changes feature Indicates new features labels Jan 4, 2026

Base automatically changed from claude/pk-rules to pre/v2.0 January 7, 2026 14:59

dimitri-yatsenko merged commit 83b380f into pre/v2.0 Jan 7, 2026
6 of 7 checks passed

dimitri-yatsenko deleted the claude/autopopulate-2.0 branch January 7, 2026 14:59

dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features labels Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AutoPopulate 2.0: Per-table job management with enhanced status tracking #1303

AutoPopulate 2.0: Per-table job management with enhanced status tracking #1303

Uh oh!

dimitri-yatsenko commented Jan 4, 2026 •

edited

Loading

Uh oh!

dimitri-yatsenko commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AutoPopulate 2.0: Per-table job management with enhanced status tracking #1303

AutoPopulate 2.0: Per-table job management with enhanced status tracking #1303

Uh oh!

Conversation

dimitri-yatsenko commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Breaking Changes

Rationale: Deprecating add_hidden_timestamp

Specification Documents

Configuration

Related Issues & PRs

Test Plan

Uh oh!

dimitri-yatsenko commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitri-yatsenko commented Jan 4, 2026 •

edited

Loading

Rationale: Deprecating `add_hidden_timestamp`