Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Dec 21, 2025

Description

This PR implements a major redesign of DataJoint's type system and storage architecture for v2.0. The changes establish a clean three-layer type architecture and modernize external storage handling.

Three-Layer Type Architecture

┌───────────────────────────────────────────────────────────────────┐
│                     AttributeTypes (Layer 3)                       │
│  Built-in:  <djblob>  <object>  <content>  <filepath@s>  <xblob>  │
│  User:      <custom>  <mytype>   ...                               │
├───────────────────────────────────────────────────────────────────┤
│                 Core DataJoint Types (Layer 2)                     │
│  float32  float64  int64  uint64  int32  uint32  int16  uint16    │
│  int8  uint8  bool  uuid  json  blob  date  datetime              │
│  char(n)  varchar(n)  enum(...)                                    │
├───────────────────────────────────────────────────────────────────┤
│               Native Database Types (Layer 1)                      │
│  MySQL: TINYINT  SMALLINT  INT  BIGINT  FLOAT  DOUBLE  ...        │
└───────────────────────────────────────────────────────────────────┘

Key Changes

New Type System

  • Core types (int32, float64, bool, uuid, json, blob, enum(...)) - scientist-friendly, portable across backends
  • AttributeTypes (<djblob>, <xblob>, <object>, <content>, <attach>, <xattach>, <filepath>) - composable encode/decode with angle bracket syntax
  • Native passthrough - MySQL types allowed with warning for non-standard usage
  • Core type blob now stores raw bytes; use <djblob> for serialized Python objects

Built-in AttributeTypes

Type Storage Addressing Returns
<djblob> Database (longblob) N/A Python object
<xblob@store> Content-addressed Hash Python object
<object@store> Path-addressed Primary key ObjectRef
<content@store> Content-addressed Hash bytes
<attach> Database (longblob) N/A Local file path
<xattach@store> Content-addressed Hash Local file path
<filepath@store> Configured store Relative path ObjectRef

Storage Architecture (OAS - Object-Augmented Schema)

  • Object storage: {schema}/{table}/{pk}/ - path-addressed, deleted with row
  • Content storage: _content/{hash} - content-addressed, deduplicated, garbage collected
  • fsspec backend: Unified storage interface supporting file, S3, GCS, Azure
  • Project-level ContentRegistry: Replaces per-schema ~external_* tables

Configuration System

  • Pydantic-based settings with validation
  • Recursive config file search (datajoint.json)
  • New object_storage.stores.* configuration for external stores
  • New dj.config.save_template() method to create template configuration files
  • Removed legacy config.save() methods

Configuration Precedence (12-Factor App Compliant)

Environment variables now take precedence over config file values, following DevOps best practices:

Priority (highest → lowest):
1. Environment variables (DJ_HOST, DJ_USER, DJ_PASS, DJ_PORT, etc.)
2. Secrets files (.secrets/database.password, etc.)
3. Config file (datajoint.json)
4. Defaults

This enables:

  • Committing datajoint.json with sensible defaults to version control
  • Overriding settings via environment variables in Docker/Kubernetes/CI
  • Ensuring secrets injected via env vars are never overwritten by config files

Removed: Schema Log Table (~log)

The per-schema ~log table has been removed. This table automatically logged DDL operations (table creation, alteration, drops) and some DML operations to a ~log table in each schema.

Why it was removed:

  • Outdated pattern: Modern applications use centralized logging (ELK stack, CloudWatch, etc.) rather than per-database tables
  • Limited utility: The log table only captured DataJoint operations, missing direct SQL changes
  • Performance overhead: Every DDL/DML operation incurred an additional INSERT
  • No retention policy: Log tables grew unbounded with no automatic cleanup
  • Better alternatives: Python's logging module, database audit logs, or external observability tools provide superior logging capabilities

Migration: If you relied on schema.log for auditing, consider:

  • Using Python's logging module with appropriate handlers
  • Enabling MySQL's general query log or audit plugin
  • Using an external observability platform

Breaking Changes

  • longblob no longer auto-serializes - use <djblob> instead
  • Legacy blob@store, attach@store, filepath@store syntax replaced with <xblob@store>, <xattach@store>, <filepath@store>
  • Removed AttributeAdapter - use AttributeType with @dj.register_type
  • Removed set_password() function
  • Removed bypass_serialization context manager
  • Removed external.py module (deprecated)
  • Removed Table.external property and legacy external table support
  • Removed Schema.log property and ~log table functionality
  • Config files no longer override environment variables (env vars now take precedence)
  • Dropped support for Python < 3.10 and MySQL < 8.0

New Features

  • Type composition: AttributeTypes can use other AttributeTypes as dtype
  • Garbage collection for content-addressed storage
  • ObjectRef for lazy file access with fsspec streaming
  • Filename collision handling in attachment downloads
  • Table.describe() now shows core type names (e.g., uuid) instead of native types (e.g., binary(16))
  • Developer Guide added to README

Testing Infrastructure

New Test Modules

Test File Coverage
test_attribute_type.py AttributeType system, registration, encode/decode
test_content_storage.py Content-addressed storage, deduplication
test_gc.py Garbage collection for content/object storage
test_object.py Object type, ObjectRef, path-addressed storage
test_type_aliases.py Core type aliases (int32, float64, etc.)
test_type_composition.py AttributeType chaining and composition
test_settings.py Expanded pydantic config tests including env var precedence

Removed Obsolete Tests

  • test_admin.py - tested removed set_password() function
  • test_bypass_serialization.py - tested removed context manager
  • test_external.py - tested legacy external storage
  • test_log.py - tested removed ~log table
  • test_filepath.py - tested legacy external table API
  • test_external_class.py - tested legacy external table API
  • test_s3.py - tested legacy external table API

Updated Test Schemas

  • schema_object.py - new schema for object type tests
  • schema_type_aliases.py - new schema for type alias tests
  • All schemas updated to use <djblob> instead of longblob for serialized data
  • schema_external.py updated to use new <xblob@store>, <xattach@store>, <filepath@store> syntax

Infrastructure Improvements

  • Simplified conftest.py (714 lines changed, net reduction)
  • New object_storage.stores.* fixture configuration
  • Docker Compose services streamlined for local testing
  • DevContainer configuration updated

Migration from Legacy Types

Legacy New
longblob (auto-serialized) <djblob>
blob@store <xblob@store>
attach <attach>
attach@store <xattach@store>
filepath@store <filepath@store>

Test Results

  • 471 passed, 2 skipped (macOS multiprocessing limitation)

This commit introduces a modern, extensible custom type system for DataJoint:

**New Features:**
- AttributeType base class with encode()/decode() methods
- Global type registry with @register_type decorator
- Entry point discovery for third-party type packages (datajoint.types)
- Type chaining: dtype can reference another custom type
- Automatic validation via validate() method before encoding
- resolve_dtype() for resolving chained types

**API Changes:**
- New: dj.AttributeType, dj.register_type, dj.list_types
- AttributeAdapter is now deprecated (backward-compatible wrapper)
- Feature flag DJ_SUPPORT_ADAPTED_TYPES is no longer required

**Entry Point Specification:**
Third-party packages can declare types in pyproject.toml:
  [project.entry-points."datajoint.types"]
  zarr_array = "dj_zarr:ZarrArrayType"

**Migration Path:**
Old AttributeAdapter subclasses continue to work but emit
DeprecationWarning. Migrate to AttributeType with encode/decode.
- Rewrite customtype.md with comprehensive documentation:
  - Overview of encode/decode pattern
  - Required components (type_name, dtype, encode, decode)
  - Type registration with @dj.register_type decorator
  - Validation with validate() method
  - Storage types (dtype options)
  - Type chaining for composable types
  - Key parameter for context-aware encoding
  - Entry point packages for distribution
  - Complete neuroscience example
  - Migration guide from AttributeAdapter
  - Best practices

- Update attributes.md to reference custom types
@dimitri-yatsenko dimitri-yatsenko added this to the DataJoint 2.0 milestone Dec 21, 2025
@github-actions github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Dec 21, 2025
Introduces `<djblob>` as an explicit AttributeType for DataJoint's
native blob serialization, allowing users to be explicit about
serialization behavior in table definitions.

Key changes:
- Add DJBlobType class with `serializes=True` flag to indicate
  it handles its own serialization (avoiding double pack/unpack)
- Update table.py and fetch.py to respect the `serializes` flag,
  skipping blob.pack/unpack when adapter handles serialization
- Add `dj.migrate` module with utilities for migrating existing
  schemas to use explicit `<djblob>` type declarations
- Add tests for DJBlobType functionality
- Document `<djblob>` type and migration procedure

The migration is metadata-only - blob data format is unchanged.
Existing `longblob` columns continue to work with implicit
serialization for backward compatibility.
Simplified design:
- Plain longblob columns store/return raw bytes (no serialization)
- <djblob> type handles serialization via encode/decode
- Legacy AttributeAdapter handles blob pack/unpack internally
  for backward compatibility

This eliminates the need for the serializes flag by making
blob serialization the responsibility of the adapter/type,
not the framework. Migration to <djblob> is now required
for existing schemas that rely on implicit serialization.
@dimitri-yatsenko dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features and removed enhancement Indicates new improvements labels Dec 21, 2025
@github-actions github-actions bot added enhancement Indicates new improvements and removed breaking Not backward compatible changes feature Indicates new features labels Dec 21, 2025
@dimitri-yatsenko dimitri-yatsenko changed the base branch from claude/add-file-column-type-LtXQt to pre/v2.0 December 22, 2025 16:18
@dimitri-yatsenko dimitri-yatsenko changed the base branch from pre/v2.0 to claude/add-file-column-type-LtXQt December 24, 2025 19:31
Base automatically changed from claude/add-file-column-type-LtXQt to pre/v2.0 December 24, 2025 20:09
Design document for reimplementing blob, attach, filepath, and object
types as a coherent AttributeType system. Separates storage location
(@store) from encoding behavior.
claude and others added 18 commits December 25, 2025 21:24
…lization

Breaking changes:
- Remove attribute_adapter.py entirely (hard deprecate)
- Remove bypass_serialization flag from blob.py - blobs always serialize now
- Remove unused 'database' field from Attribute in heading.py

Import get_adapter from attribute_type instead of attribute_adapter.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Document function-based content storage (not registry class)
- Add implementation status table
- Explain design decision: functions vs database table
- Update Phase 5 GC design for scanning approach
- Document removed/deprecated items

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Create builtin_types.py with DJBlobType, ContentType, XBlobType
- Types serve as examples for users creating custom types
- Module docstring includes example of defining a custom GraphType
- Add get_adapter() function to attribute_type.py for compatibility
- Auto-register built-in types via import at module load

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add <object> type for files and folders (Zarr, HDF5, etc.):
- Path derived from primary key: {schema}/{table}/objects/{pk}/{field}_{token}
- Supports bytes, files, and directories
- Returns ObjectRef for lazy fsspec-based access
- No deduplication (unlike <content>)

Update implementation plan with Phase 2b documenting ObjectType.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Migration utilities are out of scope for now. This is a breaking
change version - users will need to recreate tables with new types.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Document staged_insert.py for direct object storage writes
- Add flow comparison: normal insert vs staged insert
- Include staged_insert.py in critical files summary

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add remaining built-in AttributeTypes:
- <attach>: Internal file attachment stored in longblob
- <xattach>: External file attachment via <content> with deduplication
- <filepath@store>: Reference to existing file (no copy, returns ObjectRef)

Update implementation plan to mark Phase 3 complete.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add garbage collection module (gc.py) for content-addressed storage:
- scan_references() to find content hashes in schemas
- list_stored_content() to enumerate _content/ directory
- scan() for orphan detection without deletion
- collect() for orphan removal with dry_run option
- format_stats() for human-readable output

Add test files:
- test_content_storage.py for content_registry.py functions
- test_type_composition.py for type chain encoding/decoding
- test_gc.py for garbage collection

Update implementation plan to mark all phases complete.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Extend gc.py to handle both storage patterns:
- Content-addressed storage: <content>, <xblob>, <xattach>
- Path-addressed storage: <object>

New functions added:
- _uses_object_storage() - detect object type attributes
- _extract_object_refs() - extract path refs from JSON
- scan_object_references() - scan schemas for object paths
- list_stored_objects() - list all objects in storage
- delete_object() - delete object directory tree

Updated scan() and collect() to handle both storage types,
with combined and per-type statistics in the output.

Updated tests for new statistics format.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
External tables are deprecated in favor of the new storage type system.
Move the constant to external.py where it's used, keeping declare.py clean.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
External tables (~external_*) are deprecated in favor of the new
AttributeType-based storage system. The new types (<xblob>, <content>,
<object>) store data directly to storage via StorageBackend without
tracking tables.

- Remove src/datajoint/external.py entirely
- Remove ExternalMapping from schemas.py
- Remove external table pre-declaration from table.py

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Python 3.10+ doesn't have a built-in class property decorator (the
@classmethod + @Property chaining was deprecated in 3.11). The modern
approach is to define properties on the metaclass, which automatically
makes them work at the class level.

- Move connection, table_name, full_table_name properties to TableMeta
- Create PartMeta subclass with overridden properties for Part tables
- Remove ClassProperty class from utils.py

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Replace pytest-managed Docker containers with external docker-compose services.
This removes complexity, improves reliability, and allows running tests both
from the host machine and inside the devcontainer.

- Remove docker container lifecycle management from conftest.py
- Add pixi tasks for running tests (services-up, test, test-cov)
- Expose MySQL and MinIO ports in docker-compose.yaml for host access
- Simplify devcontainer to extend the main docker-compose.yaml
- Remove docker dependency from test requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix Table.table_name property to delegate to metaclass for UserTable subclasses
  (table_name was returning None instead of computed name)
- Fix heading type loading to preserve database type for core types (uuid, etc.)
  instead of overwriting with alias from comment
- Add original_type field to Attribute for storing the alias while keeping
  the actual SQL type in type field
- Fix tests: remove obsolete test_external.py, update resolve_dtype tests
  to expect 3 return values, update type alias tests to use CORE_TYPE_SQL
- Update pyproject.toml pytest_env to use D: prefix for default-only vars

Test results improved from 174 passed/284 errors to 381 passed/62 errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Type system changes:
- Core type `blob` stores raw bytes without serialization
- Built-in type `<djblob>` handles automatic serialization/deserialization
- Update jobs table to use <djblob> for key and error_stack columns
- Remove enable_python_native_blobs config check (always enabled)

Bug fixes:
- Fix is_blob detection to include NATIVE_BLOB types (longblob, mediumblob, etc.)
- Fix original_type fallback when None
- Fix test_type_aliases to use lowercase keys for CORE_TYPE_SQL lookup
- Allow None context for built-in types in heading initialization
- Update native type warning message wording

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update settings access tests to check type instead of specific value
  (safemode is set to False by conftest fixtures)
- Fix config.load() to handle nested JSON dicts in addition to flat
  dot-notation keys

Test results: 417 passed (was 414)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update GraphType and LayoutToFilepathType to use <djblob> dtype
  (old filepath@store syntax no longer supported)
- Fix local_schema and schema_virtual_module fixtures to pass connection
- Remove unused imports

Test results: 421 passed, 58 errors, 13 failed (was 417/62/13)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Source code fixes:
- Add download_path setting and squeeze handling in fetch.py
- Add filename collision handling in AttachType and XAttachType
- Fix is_blob detection to check both BLOB and NATIVE_BLOB patterns
- Fix FilepathType.validate to accept Path objects
- Add proper error message for undecorated tables

Test infrastructure updates:
- Update schema_external.py to use new <xblob@store>, <xattach@store>, <filepath@store> syntax
- Update all test tables to use <djblob> instead of longblob for serialization
- Configure object_storage.stores in conftest.py fixtures
- Remove obsolete test_admin.py (set_password was removed)
- Fix connection passing in various tests to avoid credential prompts
- Fix test_query_caching to handle existing directories

README:
- Add Developer Guide section with setup, test, and pre-commit instructions

Test results: 408 passed, 2 skipped (macOS multiprocessing limitation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko marked this pull request as ready for review January 1, 2026 00:10
dimitri-yatsenko and others added 9 commits December 31, 2025 21:02
- Add save_template() method to Config for creating datajoint.json templates
- Add default_store setting to ObjectStorageSettings
- Fix get_store_backend() to use default object storage when no store specified
- Fix StorageBackend._full_path() to prepend location for all protocols
- Fix StorageBackend.open() to create parent directories for write mode
- Fix ObjectType to support tuple (extension, data) format for streams
- Fix ObjectType to pass through pre-computed metadata for staged inserts
- Fix staged_insert.py path handling (use relative paths consistently)
- Fix table.py __make_placeholder to handle None values for adapter types
- Update schema_object.py to use <object> syntax (angle brackets required)
- Remove legacy external table support (Table.external property)
- Remove legacy external tests (test_filepath, test_external_class, test_s3)
- Add tests for save_template() method

Test results: 471 passed, 2 skipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove Log class from table.py
- Remove _log property and all _log() calls from Table class
- Remove log property from Schema class
- Remove ~log table special handling in heading.py
- Remove test_log.py
- Bump version from 0.14.6 to 2.0.0a1
- Remove version length assertion (was only for log table compatibility)

The log table was an outdated approach to event logging. Modern systems
should use standard Python logging, external log aggregation services,
or database audit logs instead.

Test results: 470 passed, 2 skipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a column is declared with a core type (like uuid, int32, float64),
describe() now displays the original core type name instead of the
underlying database type (e.g., shows "uuid" instead of "binary(16)").

Uses the Attribute.original_type field which stores the core type alias.

Bump version to 2.0.0a2.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Following the 12-Factor App methodology, environment variables now take
precedence over config file values. This is the standard DevOps practice
for deployments where secrets and environment-specific settings should
be injected via environment variables (Docker, Kubernetes, CI/CD).

Priority order (highest to lowest):
1. Environment variables (DJ_*)
2. Secrets files (.secrets/)
3. Config file (datajoint.json)
4. Defaults

Added ENV_VAR_MAPPING to track which settings have env var overrides.
The _update_from_flat_dict() method now skips file values when the
corresponding env var is set.

Added test_env_var_overrides_config_file to verify the new behavior.

Bump version to 2.0.0a3.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test optional dependency to pyproject.toml for docker-compose
- Remove deprecated minio-based s3.py client (using fsspec/s3fs now)
- Replace minio test fixtures with s3fs
- Fix Path.walk() for Python 3.10 compatibility (use os.walk)
- Use introspection instead of try/except TypeError for encoder params
- Make test_settings tests environment-agnostic (localhost vs docker)

All 473 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused methods that were superseded by AttributeType system:
- _process_object_value (replaced by ObjectType.encode)
- _build_object_url (only used by _process_object_value)
- get_object_storage (only used by _process_object_value)
- object_storage property (wrapper for get_object_storage)

Also removed unused imports: mimetypes, datetime, timezone,
StorageBackend, build_object_path, verify_or_create_store_metadata

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test organization:
- Split tests into unit/ and integration/ folders
- Unit tests (test_attribute_type, test_hash, test_settings) run without Docker
- Integration tests require MySQL and MinIO services
- Update imports to use absolute paths (from tests.schema import ...)

Configuration simplification:
- Change pytest_env defaults to localhost (was Docker hostnames)
- Simplify pixi tasks (env vars now use defaults)
- Update devcontainer to set Docker-specific env vars
- Update docker-compose comments

Tests now run with just:
  pip install -e ".[test]"
  docker compose up -d db minio
  pytest tests/

Unit tests only:
  pytest tests/unit/

Bump version to 2.0.0a5

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update to new pyparsing API (snake_case):
- setResultsName -> set_results_name
- delimitedList -> DelimitedList
- parseString -> parse_string
- parseAll -> parse_all
- endQuoteChar -> end_quote_char
- unquoteResults -> unquote_results

Reduces test warnings from 5854 to 3.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Access model_fields from class instead of instance:
- self.model_fields -> type(self).model_fields

Reduces test warnings from 3 to 1 (remaining is intentional user warning).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko merged commit 164e7cc into pre/v2.0 Jan 1, 2026
4 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the claude/upgrade-adapted-type-1W3ap branch January 1, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants