Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 4, 2026

Summary

This PR implements AutoPopulate 2.0, a complete redesign of the job handling system for distributed computing workflows. It addresses the scalability limitations identified in #1258 and resolves the confusing limit vs max_calls behavior reported in #1203.

Key Changes

Per-table Job Management

  • Each dj.Computed/dj.Imported table gets its own hidden jobs table (~~table_name)
  • Jobs tables use native primary keys (no more hash-based addressing)
  • Rich status tracking: pending, reserved, success, error, ignore
  • Access via MyTable.jobs instead of schema-level schema.jobs

Job Class API

  • jobs.refresh() - Sync job queue with key_source
  • jobs.reserve() / jobs.complete() / jobs.error() - Status transitions
  • jobs.pending / jobs.reserved / jobs.errors / jobs.completed - Query properties
  • jobs.progress() - Status breakdown for dashboards

FK-only Primary Key Validation

  • New Computed/Imported tables must have PKs composed entirely of FK references
  • Ensures 1:1 correspondence between jobs and target rows
  • Legacy tables with additional PK attributes continue to work with degraded granularity

Hidden Job Metadata

  • Optional _job_start_time, _job_duration, _job_version columns for computed tables
  • Controlled via config.jobs.add_job_metadata
  • Provides computation provenance without cluttering the visible schema

NATURAL JOIN → USING Clause

  • Replaced NATURAL JOIN with explicit USING clauses throughout
  • Hidden attributes (prefixed with _) are excluded from join matching
  • Enables safe use of hidden attributes without join collisions

Semantic Matching for Joins

  • Attribute lineage tracking via ~lineage table
  • Warns when joining tables with same-named attributes from different origins
  • schema.lineage property for viewing all lineages

Optimized progress() Method

  • Single aggregation query instead of two separate queries
  • Handles 1:many relationships correctly with COUNT(DISTINCT)

Migration Utility

  • add_job_metadata_columns() function to retrofit existing tables

Breaking Changes

  • Removed config.add_hidden_timestamp setting (see rationale below)
  • Removed target property from AutoPopulate (always uses self)
  • Deprecated limit and order parameters in populate() (use max_calls and priority)

Rationale: Deprecating add_hidden_timestamp

The old add_hidden_timestamp feature has been removed for several reasons:

  1. Hash-based naming obsolete: Used _<sha1_hash>_timestamp to avoid NATURAL JOIN collisions. With the switch to USING clauses, hidden attributes are automatically excluded from joins, making hash-based naming unnecessary.

  2. Not modern best practice: General insert/update timestamps on all tables should be handled by server-side database auditing features (MySQL Enterprise Audit, MariaDB Audit Plugin, binary logs) rather than application-level hidden columns. Server-side auditing:

    • Catches ALL changes including direct SQL
    • Cannot be bypassed by application code
    • Provides tamper-evident audit trails
    • Doesn't pollute schema with audit columns
  3. Job metadata is the right use case: Hidden attributes are appropriate for computation provenance (_job_start_time, _job_duration, _job_version) which is tightly coupled to computed data and useful for reproducibility.

Specification Documents

This PR includes comprehensive design specifications:

Configuration

New settings under config.jobs:

dj.config.jobs.auto_refresh = True       # Auto-refresh on populate
dj.config.jobs.keep_completed = False    # Keep success records
dj.config.jobs.stale_timeout = 3600      # Seconds before stale cleanup
dj.config.jobs.default_priority = 5      # Default job priority
dj.config.jobs.add_job_metadata = False  # Add hidden metadata columns

Related Issues & PRs

Closes #1258 - FEAT: Autopopulate 2.0
Addresses #1203 - Confusing limit vs max_calls behavior (deprecated limit)

Base branch PRs (must be merged first):

Related PR:

Test Plan

  • All existing tests pass (515 passed, 2 skipped)
  • Job table creation and status transitions
  • Hidden job metadata population
  • FK-only PK validation for new tables
  • Legacy table support with non-FK PK attributes
  • USING clause join behavior
  • Migration utility for existing tables

🤖 Generated with Claude Code

dimitri-yatsenko and others added 10 commits January 3, 2026 12:41
AutoPopulate 1.0 spec (specs/autopopulate-1.0.md):
- Documents legacy system for reference
- Key source generation and jobs_to_do computation
- Schema-level ~jobs table with hash-based keys
- Job reservation flow (reserve/complete/error/ignore)
- Make method invocation (regular and generator patterns)
- Transaction management and error handling
- Limitations addressed by 2.0 (linked to GitHub issues)

AutoPopulate 2.0 spec (docs/src/compute/autopopulate2.0-spec.md):
- Per-table ~table__jobs with FK-derived primary keys
- FK-only PK constraint for new tables (legacy supported)
- Extended status: pending, reserved, success, error, ignore
- Priority (uint8) and scheduled_time for job ordering
- Duration tracking (float64) and version field
- refresh() with stale_timeout and orphan_timeout
- Deprecated: order, limit, keys parameters
- reserve_jobs=False falls back to 1.0 behavior
- Config sets defaults, parameters override

Design decisions documented:
- No target property (populate always populates self)
- max_calls total across all processes
- Ignore jobs permanent until manual delete
- Success jobs re-pended if key in key_source but not in table

Related: #1258, #1203, #749, #873, #665

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implementation plan (specs/autopopulate-2.0-implementation.md):
- JobsTable class with ~~ prefix naming convention
- FK-only PK constraint for new tables, legacy support
- Two execution modes: direct (default) and distributed
- AutoPopulate mixin updates (jobs property, populate paths)
- Schema.jobs returning list of JobsTable objects
- Configuration options and testing strategy

Spec updates (docs/src/compute/autopopulate2.0-spec.md):
- Changed table naming from ~table__jobs to ~~table
- Removed default value from priority (set by refresh())
- Priority default controlled by config['jobs.default_priority']

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Job class (src/datajoint/jobs.py):
- Per-table job queue with ~~ prefix naming convention
- Removed legacy JobTable class
- FK-derived primary key extraction from target table
- Status filter properties: pending, reserved, errors, ignored, completed
- Core methods: refresh(), reserve(), complete(), error(), ignore(), progress()
- Uses update1() for status transitions
- Timestamps with millisecond precision (timestamp(3))

schema.jobs property (src/datajoint/schemas.py):
- Now returns list of Job objects instead of single JobTable
- Only returns Job for tables where both target and ~~job table exist
- Job tables created lazily on first access to table.jobs

Jobs configuration (src/datajoint/settings.py):
- JobsSettings class with auto_refresh, keep_completed, stale_timeout,
  default_priority, and version_method options

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add jobs property to AutoPopulate for per-table Job access
- Add _declare_check hook pattern for table validation
- Implement FK-only PK constraint for Computed/Imported tables
- Split populate() into _populate_direct() and _populate_distributed()
- Update _populate1() to use new Job API with duration tracking
- Remove deprecated parameters (keys, order, limit)
- Add new parameters (priority, refresh) for distributed mode
- Remove leftover swap file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Job management changes:
- Use dependency graph to identify FK-derived PK attributes (not lineage)
- Fix semantic_check conflict in Job.refresh() operations
- Fix SQL escaping for LIKE '~~%%' pattern
- Add allow_new_pk_fields_in_computed_tables config option

Schema changes:
- Update schema.jobs to return list of Job objects for existing tables

Test updates:
- Update conftest.py fixtures to use new schema.jobs list API
- Enable allow_new_pk_fields_in_computed_tables for legacy test tables
- Rewrite test_autopopulate.py for new populate() signature
- Rewrite test_jobs.py for per-table Job API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace int/smallint/tinyint with int32/int16/uint8
- Replace float/double with float32/float64
- Replace timestamp with datetime
- Convert Auto from autoincrement to explicit Lookup values
- Update fixture teardown to check schema.exists
- Update test_alter_part regex to handle type comments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fractional seconds precision support to datetime core type
- Replace timestamp(3) with datetime(3) in Job table definition
- Eliminates native type warnings for job table timestamps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add config.jobs.add_job_metadata setting for hidden metadata columns
- Add _job_start_time, _job_duration, _job_version to Computed/Imported tables
- Replace NATURAL JOIN with explicit USING clause to exclude hidden attributes
- Hidden attributes (prefixed with _) excluded from all binary operators
- Add subquery requirement when joining multi-table expressions
- Update jobs.py version field from varchar(255) to varchar(64)
- Add tests for hidden job metadata feature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove add_hidden_timestamp config setting (replaced by job metadata)
- Remove hash-based timestamp column generation from declare.py
- Remove unused sha1 import
- Remove unused test fixtures: monkeysession, monkeymodule, enable_adapted_types
- Clean up empty Utility Fixtures section in conftest.py

Job metadata feature (config.jobs.add_job_metadata) remains for computed
table provenance tracking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add add_job_metadata_columns() migration utility to migrate.py
  - Adds hidden columns to existing Computed/Imported tables
  - Supports single tables or entire schemas
  - Dry-run mode for previewing changes

- Optimize AutoPopulate.progress() with single aggregation query
  - Uses LEFT JOIN with COUNT(DISTINCT) for efficiency
  - Handles 1:many relationships correctly
  - Falls back to two-query method when no common attributes

- Remove target property from AutoPopulate (always uses self)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Jan 4, 2026
@dimitri-yatsenko dimitri-yatsenko changed the base branch from master to claude/semantic-match January 4, 2026 04:15
@dimitri-yatsenko dimitri-yatsenko changed the base branch from claude/semantic-match to claude/pk-rules January 4, 2026 04:16
@dimitri-yatsenko dimitri-yatsenko self-assigned this Jan 4, 2026
@dimitri-yatsenko dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features labels Jan 4, 2026
@dimitri-yatsenko dimitri-yatsenko added this to the DataJoint 2.0 milestone Jan 4, 2026
Allow passing transaction, safemode, and force_masters kwargs
to Part.delete() so users can nest Part deletions within larger
transactions.

Fixes #1276

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko
Copy link
Member Author

This PR also fixes #1276 - Part.delete now passes through kwargs (transaction, safemode, force_masters) to Table.delete, allowing Part deletions to be nested within larger transactions.

@github-actions github-actions bot removed breaking Not backward compatible changes feature Indicates new features labels Jan 4, 2026
Base automatically changed from claude/pk-rules to pre/v2.0 January 7, 2026 14:59
@dimitri-yatsenko dimitri-yatsenko merged commit 83b380f into pre/v2.0 Jan 7, 2026
6 of 7 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the claude/autopopulate-2.0 branch January 7, 2026 14:59
@dimitri-yatsenko dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features labels Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Not backward compatible changes documentation Issues related to documentation enhancement Indicates new improvements feature Indicates new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT: Autopopulate 2.0

2 participants