From 8f6f10b5fd40dd9c9885b3949f5b12d88530996c Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 4 Dec 2025 12:38:34 -0800
Subject: [PATCH 01/17] Add comprehensive style guide for
 cuda/core/experimental

This commit adds a complete style guide covering conventions for Python
and Cython code in cuda/core/experimental. The guide includes:

- File structure and organization (SPDX headers, imports, __all__, ordering)
- Package layout (.pyx/.pxd/.py files, subpackage patterns)
- Import statement organization (5 groups with alphabetical sorting)
- Class and function definition ordering (dunder methods, methods, properties)
- Naming conventions (PascalCase, snake_case, UPPER_SNAKE_CASE)
- Type annotations (PEP 604 union syntax, forward references)
- Docstrings (NumPy style with comprehensive examples)
- Error handling and warnings (custom exceptions, stacklevel, one-time warnings)
- Memory management (resource lifecycle, cleanup patterns)
- Thread safety and concurrency (locks, thread-local storage)
- Cython-specific features (cdef/cpdef/def, nogil, inline functions)
- Constants and magic numbers (naming, CUDA constants)
- Comments and inline documentation (TODO, NOTE patterns)
- Code organization within files
- Performance considerations (GIL management, C types)
- API design principles (public vs private, backward compatibility)
- CUDA-specific patterns (GIL management for driver API calls)
- Copyright and licensing (SPDX format)

The guide follows PEP 8 as the base and promotes modern Python practices
(PEP 604, PEP 563) while documenting current codebase patterns.
---
 cuda_core/cuda/core/style-guide.md | 1818 ++++++++++++++++++++++++++++
 1 file changed, 1818 insertions(+)
 create mode 100644 cuda_core/cuda/core/style-guide.md

diff --git a/cuda_core/cuda/core/style-guide.md b/cuda_core/cuda/core/style-guide.md
new file mode 100644
index 0000000000..f9ea81abb3
--- /dev/null
+++ b/cuda_core/cuda/core/style-guide.md
@@ -0,0 +1,1818 @@
+# CUDA Core Style Guide
+
+This style guide defines conventions for Python and Cython code in `cuda/core/experimental`.
+
+**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide.** The rules in this document highlight project-specific conventions and extensions beyond PEP 8, particularly for Cython code and the structure of this codebase.
+
+## Table of Contents
+
+1. [File Structure](#file-structure)
+2. [Package Layout](#package-layout)
+3. [Import Statements](#import-statements)
+4. [Class and Function Definitions](#class-and-function-definitions)
+5. [Naming Conventions](#naming-conventions)
+6. [Type Annotations and Declarations](#type-annotations-and-declarations)
+7. [Docstrings](#docstrings)
+8. [Errors and Warnings](#errors-and-warnings)
+9. [Memory Management](#memory-management)
+10. [Thread Safety and Concurrency](#thread-safety-and-concurrency)
+11. [Cython-Specific Features](#cython-specific-features)
+12. [Constants and Magic Numbers](#constants-and-magic-numbers)
+13. [Comments and Inline Documentation](#comments-and-inline-documentation)
+14. [Code Organization Within Files](#code-organization-within-files)
+15. [Performance Considerations](#performance-considerations)
+16. [API Design Principles](#api-design-principles)
+17. [CUDA-Specific Patterns](#cuda-specific-patterns)
+18. [Copyright and Licensing](#copyright-and-licensing)
+
+---
+
+## File Structure
+
+Files in `cuda/core/experimental` must follow a consistent structure. The ordering of elements within a file is as follows:
+
+### 1. SPDX Copyright Header
+
+The file must begin with the SPDX copyright header as specified in [Copyright and Licensing](#copyright-and-licensing).
+
+### 2. Import Statements
+
+Import statements come immediately after the copyright header. Follow the import ordering conventions specified in [Import Statements](#import-statements).
+
+### 3. `__all__` Declaration
+
+If the module exports public API elements, include an `__all__` list after the imports and before any other code. This explicitly defines the public API of the module.
+
+```python
+__all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
+```
+
+### 4. Type Aliases and Constants
+
+Type aliases and module-level constants may immediately follow `__all__` (if present) or come after imports:
+
+```python
+DevicePointerT = driver.CUdeviceptr | int | None
+"""Type union for device pointer representations."""
+
+LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
+```
+
+### 5. Principal Class or Function
+
+If the file principally implements a single class or function (e.g., `_buffer.pyx` defines the `Buffer` class, `_device.pyx` defines the `Device` class), that principal class or function should come next, immediately after `__all__` (if present).
+
+**The principal class or function is an exception to alphabetical ordering** and appears first in its section.
+
+### 6. Other Public Classes and Functions
+
+Following the principal class or function, define other public classes and functions. These include:
+
+- **Auxiliary classes**: Supporting classes that are part of the public API (e.g., `DeviceMemoryResourceOptions` is an auxiliary class used by `DeviceMemoryResource`)
+- **Abstract base classes**: ABCs that define interfaces (e.g., `MemoryResource` in `_buffer.pyx`)
+- **Other public classes**: Additional classes exported by the module
+
+**All classes and functions in this section should be sorted alphabetically by name**, regardless of their relationship to the principal class. The principal class appears first as an exception to this rule.
+
+**Example:** In `_device_memory_resource.pyx`, `DeviceMemoryResource` is the principal class and appears first. Then `DeviceMemoryResourceOptions` appears after it (alphabetically after the principal class), even though it's an auxiliary/options class.
+
+### 7. Public Module Functions
+
+After all classes, define public module-level functions that are part of the API.
+
+### 8. Private or Implementation Functions
+
+Finally, define private functions and implementation details. These include:
+
+- Functions with names starting with `_` (private)
+- `cdef inline` functions used for internal implementation
+- Helper functions not part of the public API
+
+### Example Structure
+
+```python
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Imports (cimports first, then regular imports)
+from libc.stdint cimport uintptr_t
+from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource
+import abc
+
+__all__ = ['Buffer', 'MemoryResource', 'some_public_function']
+
+# Type aliases (if any)
+DevicePointerT = driver.CUdeviceptr | int | None
+"""Type union for device pointer representations."""
+
+# Principal class
+cdef class Buffer:
+    """Principal class for this module."""
+    # ...
+
+# Other public classes
+cdef class MemoryResource:
+    """Abstract base class."""
+    # ...
+
+# Public module functions
+def some_public_function():
+    """Public API function."""
+    # ...
+
+# Private implementation functions
+cdef inline void Buffer_close(Buffer self, stream):
+    """Private implementation helper."""
+    # ...
+```
+
+### Notes
+
+- Not every file will have all sections. For example, a utility module may not have a principal class.
+- The distinction between "principal" and "other" classes is based on the file's primary purpose. If a file exists primarily to define one class, that class is the principal class.
+- Private implementation functions should be placed at the end of the file to keep the public API visible at the top.
+- **Within each section**, classes and functions should be sorted alphabetically by name. The principal class or function is an exception to this rule, as it appears first in its respective section.
+
+## Package Layout
+
+### File Types
+
+The `cuda/core/experimental` package uses three types of files:
+
+1. **`.pyx` files**: Cython implementation files containing the actual code
+2. **`.pxd` files**: Cython declaration files containing type definitions and function signatures for C-level access
+3. **`.py` files**: Pure Python files for utilities and high-level interfaces
+
+### File Naming Conventions
+
+- **Implementation files**: Use `.pyx` for Cython code, `.py` for pure Python code
+- **Declaration files**: Use `.pxd` for Cython type declarations
+- **Private modules**: Prefix with underscore (e.g., `_buffer.pyx`, `_device.pyx`)
+- **Public modules**: No underscore prefix (e.g., `utils.py`)
+
+### Relationship Between `.pxd` and `.pyx` Files
+
+For each `.pyx` file that defines classes or functions used by other Cython modules, create a corresponding `.pxd` file:
+
+- **`.pxd` file**: Contains `cdef` class declarations, `cdef`/`cpdef` function signatures, and `cdef` attribute declarations
+- **`.pyx` file**: Contains the full implementation including Python methods, docstrings, and implementation details
+
+**Example:**
+
+`_buffer.pxd`:
+```python
+cdef class Buffer:
+    cdef:
+        uintptr_t      _ptr
+        size_t         _size
+        MemoryResource _memory_resource
+        object         _ipc_data
+```
+
+`_buffer.pyx`:
+```python
+cdef class Buffer:
+    """Full implementation with methods and docstrings."""
+    cdef:
+        uintptr_t      _ptr
+        size_t         _size
+        MemoryResource _memory_resource
+        object         _ipc_data
+
+    def close(self, stream=None):
+        """Implementation here."""
+        # ...
+```
+
+### Module Organization
+
+#### Simple Top-Level Modules
+
+For simple modules at the `cuda/core/experimental` level, define classes and functions directly in the module file with an `__all__` list:
+
+```python
+# _device.pyx
+__all__ = ['Device', 'DeviceProperties']
+
+cdef class Device:
+    # ...
+
+cdef class DeviceProperties:
+    # ...
+```
+
+#### Complex Subpackages
+
+For complex subpackages that require extra structure (like `_memory/`), use the following pattern:
+
+1. **Private submodules**: Each component is implemented in a private submodule (e.g., `_buffer.pyx`, `_device_memory_resource.pyx`)
+2. **Submodule `__all__`**: Each submodule defines its own `__all__` list with the symbols it exports
+3. **Subpackage `__init__.py`**: The subpackage `__init__.py` uses `from ._module import *` to assemble the package
+
+**Example structure for `_memory/` subpackage:**
+
+`_memory/_buffer.pyx`:
+```python
+__all__ = ['Buffer', 'MemoryResource']
+
+cdef class Buffer:
+    # ...
+
+cdef class MemoryResource:
+    # ...
+```
+
+`_memory/_device_memory_resource.pyx`:
+```python
+__all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
+
+cdef class DeviceMemoryResourceOptions:
+    # ...
+
+cdef class DeviceMemoryResource:
+    # ...
+```
+
+`_memory/__init__.py`:
+```python
+from ._buffer import *  # noqa: F403
+from ._device_memory_resource import *  # noqa: F403
+from ._graph_memory_resource import *  # noqa: F403
+from ._ipc import *  # noqa: F403
+from ._legacy import *  # noqa: F403
+from ._virtual_memory_resource import *  # noqa: F403
+```
+
+This pattern allows:
+- **Modular organization**: Each component lives in its own file
+- **Clear public API**: Each submodule explicitly defines what it exports via `__all__`
+- **Clean package interface**: The subpackage `__init__.py` assembles all exports into a single namespace
+- **Easier refactoring**: Components can be moved or reorganized without changing the public API
+
+**Migration guidance**: Simple top-level modules can be migrated to this subpackage structure when they become sufficiently complex (e.g., when a module grows to multiple related classes or when logical grouping would improve maintainability).
+
+### Guidelines
+
+1. **Always create `.pxd` files for shared Cython types**: If a class or function is `cimport`ed by other modules, provide a `.pxd` declaration file.
+
+2. **Keep `.pxd` files minimal**: Only include declarations needed for Cython compilation. Omit implementation details, docstrings, and Python-only code.
+
+3. **Use `__all__` in submodules**: Each submodule should define `__all__` to explicitly declare its public API.
+
+4. **Use `from ._module import *` in subpackage `__init__.py`**: This pattern assembles the subpackage API from its submodules. Use `# noqa: F403` to suppress linting warnings about wildcard imports.
+
+5. **Migrate to subpackage structure when complex**: When a top-level module becomes complex (multiple related classes, logical grouping needed), consider refactoring to the subpackage pattern.
+
+6. **Separate concerns**: Use `.py` files for pure Python utilities, `.pyx` files for Cython implementations that need C-level performance.
+
+## Import Statements
+
+Import statements must be organized into five groups, in the following order:
+**Note**: Within each section, imports should be sorted alphabetically.
+
+### 1. `__future__` Imports
+
+`__future__` imports must come first, before all other imports.
+
+
+```python
+from __future__ import annotations
+```
+
+### 2. External `cimport` Statements
+
+External Cython imports from standard libraries and third-party packages. This includes:
+
+- `libc.*` (e.g., `libc.stdint`, `libc.stdlib`, `libc.string`)
+- `cpython`
+- `cython`
+- `cuda.bindings` (CUDA bindings package)
+
+```python
+cimport cpython
+from libc.stdint cimport uintptr_t
+from libc.stdlib cimport malloc, free
+from cuda.bindings cimport cydriver
+```
+
+### 3. cuda-core `cimport` Statements
+
+Cython imports from within the `cuda.core.experimental` package.
+
+```python
+from cuda.core.experimental._memory._buffer cimport Buffer, MemoryResource
+from cuda.core.experimental._stream cimport Stream_accept, Stream
+from cuda.core.experimental._utils.cuda_utils cimport (
+    HANDLE_RETURN,
+    check_or_create_options,
+)
+```
+
+### 4. External `import` Statements
+
+Regular Python imports from standard libraries and third-party packages. This includes:
+
+- Standard library modules (e.g., `abc`, `typing`, `threading`, `dataclasses`)
+- Third-party packages
+
+```python
+import abc
+import threading
+from dataclasses import dataclass
+```
+
+### 5. cuda-core `import` Statements
+
+Regular Python imports from within the `cuda.core.experimental` package.
+
+```python
+from cuda.core.experimental._context import Context, ContextOptions
+from cuda.core.experimental._dlpack import DLDeviceType, make_py_capsule
+from cuda.core.experimental._utils.cuda_utils import (
+    CUDAError,
+    driver,
+    handle_return,
+)
+```
+
+### Additional Rules
+
+1. **Alphabetical Ordering**: Within each group, imports should be sorted alphabetically by module name.
+
+2. **Multi-line Imports**: When importing multiple items from a single module, use parentheses for multi-line formatting:
+   ```python
+   from cuda.core.experimental._utils.cuda_utils cimport (
+       HANDLE_RETURN,
+       check_or_create_options,
+   )
+   ```
+
+3. **Type-only imports**: With `from __future__ import annotations`, types can be imported normally even if only used in annotations. Avoid `TYPE_CHECKING` blocks (see [Type Annotations and Declarations](#type-annotations-and-declarations) for details).
+
+4. **Blank Lines**: Use blank lines to separate the five import groups. Do not use blank lines within a group unless using multi-line import formatting.
+
+5. **`try/except` Blocks**: Import fallbacks (e.g., for optional dependencies) should be placed in the appropriate group (external or cuda-core) using `try/except` blocks.
+
+### Example
+
+```python
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# 1. __future__ imports
+from __future__ import annotations
+
+# 2. External cimports
+cimport cpython
+from libc.stdint cimport uintptr_t
+from libc.stdlib cimport malloc, free
+from cuda.bindings cimport cydriver
+
+# 3. cuda-core cimports
+from cuda.core.experimental._memory._buffer cimport Buffer, MemoryResource
+from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+
+# 4. External imports
+import abc
+from dataclasses import dataclass
+
+# 5. cuda-core imports
+from cuda.core.experimental._context import Context
+from cuda.core.experimental._device import Device
+from cuda.core.experimental._utils.cuda_utils import driver
+```
+
+## Class and Function Definitions
+
+### Class Definition Order
+
+Within a class definition, elements must be organized in the following order:
+
+1. **Special (dunder) methods**: Methods with names starting and ending with double underscores (e.g., `__init__`, `__cinit__`, `__dealloc__`, `__reduce__`, `__dlpack__`)
+
+2. **Methods**: Regular instance methods, class methods (`@classmethod`), and static methods (`@staticmethod`)
+
+3. **Properties**: Properties defined with `@property` decorator
+
+**Note**: Within each section (dunder methods, methods, properties), elements should be sorted alphabetically by name.
+
+### Example
+
+```python
+cdef class Buffer:
+    """Example class demonstrating the ordering."""
+
+    # 1. Special (dunder) methods (alphabetically sorted)
+    def __buffer__(self, flags: int, /) -> memoryview:
+        """Buffer protocol support."""
+        # ...
+
+    def __cinit__(self):
+        """Cython initialization."""
+        # ...
+
+    def __dealloc__(self):
+        """Cleanup."""
+        # ...
+
+    def __dlpack__(self, *, stream=None):
+        """DLPack protocol support."""
+        # ...
+
+    def __init__(self, *args, **kwargs):
+        """Python initialization."""
+        # ...
+
+    def __reduce__(self):
+        """Pickle support."""
+        # ...
+
+    # 2. Methods (alphabetically sorted)
+    def close(self, stream=None):
+        """Close the buffer."""
+        # ...
+
+    def copy_from(self, src, *, stream):
+        """Copy data from source buffer."""
+        # ...
+
+    def copy_to(self, dst=None, *, stream):
+        """Copy data to destination buffer."""
+        # ...
+
+    @classmethod
+    def from_handle(cls, ptr, size, mr=None):
+        """Create buffer from handle."""
+        # ...
+
+    def get_ipc_descriptor(self):
+        """Get IPC descriptor."""
+        # ...
+
+    # 3. Properties (alphabetically sorted)
+    @property
+    def device_id(self) -> int:
+        """Device ID property."""
+        # ...
+
+    @property
+    def handle(self):
+        """Handle property."""
+        # ...
+
+    @property
+    def size(self) -> int:
+        """Size property."""
+        # ...
+```
+
+### Helper Functions
+
+Sometimes, implementation details are moved outside of the class definition to improve readability. Helper functions should be placed at the end of the file (in the private/implementation section) when:
+
+- The indentation level exceeds 4 levels
+- A method definition is long (>20 lines)
+- The class definition itself is very long
+
+In Cython files, these are often `cdef` or `cdef inline` functions. The helper function name typically follows the pattern `ClassName_methodname` (e.g., `DMR_close`, `Buffer_close`).
+
+**Example:**
+
+```python
+cdef class DeviceMemoryResource:
+    def close(self):
+        """Close the memory resource."""
+        DMR_close(self)  # Calls helper function
+
+# ... other classes and functions ...
+
+# Helper function at end of file
+cdef inline DMR_close(DeviceMemoryResource self):
+    """Implementation moved outside class for readability."""
+    if self._handle == NULL:
+        return
+    # ... implementation ...
+```
+
+### Function Definitions
+
+For module-level functions (outside of classes), follow the ordering specified in [File Structure](#file-structure): principal functions first (if applicable), then other public functions, then private functions. Within each group, sort alphabetically.
+
+## Naming Conventions
+
+### Class Names
+
+Use **PascalCase** (also known as CapWords) for class names.
+
+```python
+cdef class Buffer:
+    # ...
+
+cdef class DeviceMemoryResource:
+    # ...
+
+class CUDAError(Exception):
+    # ...
+```
+
+### Function and Method Names
+
+Use **snake_case** for function and method names.
+
+```python
+def allocate(self, size_t size, stream=None) -> Buffer:
+    # ...
+
+def get_ipc_descriptor(self) -> IPCBufferDescriptor:
+    # ...
+
+cdef inline void Buffer_close(Buffer self, stream):
+    # ...
+```
+
+### Variable Names
+
+#### Python Variables
+
+Use **snake_case** for Python variables.
+
+```python
+device_id = 0
+memory_resource = DeviceMemoryResource(device_id)
+buffer_size = 1024
+```
+
+#### Private Attributes
+
+Use **snake_case** with a leading underscore for private instance attributes.
+
+```python
+cdef class Buffer:
+    cdef:
+        uintptr_t _ptr
+        size_t _size
+        MemoryResource _memory_resource
+        object _ipc_data
+```
+
+#### Cython `cdef` Variables
+
+Consider prefixing `cdef` variables with `c_` to distinguish them from Python variables. This improves code readability by making it clear which variables are C-level types.
+
+**Preferred:**
+```python
+def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
+    stream = Stream_accept(stream)
+    cdef size_t c_src_size = self._size
+
+    if dst is None:
+        dst = self._memory_resource.allocate(c_src_size, stream)
+
+    cdef size_t c_dst_size = dst._size
+    if c_dst_size != c_src_size:
+        raise ValueError(f"buffer sizes mismatch: src={c_src_size}, dst={c_dst_size}")
+    # ...
+```
+
+**Also acceptable (if context is clear):**
+```python
+cdef cydriver.CUdevice get_device_from_ctx(
+        cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
+    cdef bint switch_context = (curr_ctx != target_ctx)
+    cdef cydriver.CUcontext ctx
+    cdef cydriver.CUdevice target_dev
+    # ...
+```
+
+The `c_` prefix is particularly helpful when mixing Python and Cython variables in the same scope, or when the variable name would otherwise be ambiguous.
+
+### Constants
+
+Use **UPPER_SNAKE_CASE** for module-level constants.
+
+```python
+LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
+PER_THREAD_DEFAULT_STREAM = C_PER_THREAD_DEFAULT_STREAM
+
+RUNTIME_CUDA_ERROR_EXPLANATIONS = {
+    # ...
+}
+```
+
+### Private Module-Level Names
+
+Use **snake_case** with a leading underscore for private module-level functions, classes, and variables.
+
+```python
+_fork_warning_checked = False
+
+def _reduce_3_tuple(t: tuple):
+    # ...
+
+cdef inline void _helper_function():
+    # ...
+```
+
+## Type Annotations and Declarations
+
+### Python Type Annotations
+
+#### PEP 604 Union Syntax
+
+Use the modern [PEP 604](https://peps.python.org/pep-0604/) union syntax (`X | Y`) instead of `typing.Union` or `typing.Optional`.
+
+**Preferred:**
+```python
+def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
+    # ...
+
+def close(self, stream: Stream | None = None):
+    # ...
+```
+
+**Avoid:**
+```python
+from typing import Optional, Union
+
+def allocate(self, size_t size, stream: Optional[Union[Stream, GraphBuilder]] = None) -> Buffer:
+    # ...
+
+def close(self, stream: Optional[Stream] = None):
+    # ...
+```
+
+#### Forward References and `from __future__ import annotations`
+
+Where needed, files should include `from __future__ import annotations` at the top (after the SPDX header). This enables:
+
+1. **Forward references**: Type annotations can reference types that are defined later in the file or in other modules without requiring `TYPE_CHECKING` blocks.
+
+2. **Cleaner syntax**: Annotations are evaluated as strings, avoiding circular import issues.
+
+**Preferred:**
+```python
+from __future__ import annotations
+
+# Can reference Stream even if it's defined later or in another module
+def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
+    # ...
+```
+
+**Avoid:**
+```python
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from cuda.core.experimental._stream import Stream
+
+def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
+    # ...
+```
+
+#### Guidelines
+
+1. **Use `from __future__ import annotations`**: This should be present in all `.py` and `.pyx` files with type annotations.
+
+2. **Use `|` for unions**: Prefer `X | Y | None` over `Union[X, Y]` or `Optional[X]`.
+
+3. **Avoid `TYPE_CHECKING` blocks**: With `from __future__ import annotations`, forward references work without `TYPE_CHECKING` guards.
+
+4. **Import types normally**: Even if a type is only used in annotations, import it normally (not in a `TYPE_CHECKING` block).
+
+### Cython Type Declarations
+
+Cython uses `cdef` declarations for C-level types. These follow different rules:
+
+```python
+cdef class Buffer:
+    cdef:
+        uintptr_t _ptr
+        size_t _size
+        MemoryResource _memory_resource
+```
+
+For Cython-specific type declarations, see [Cython-Specific Features](#cython-specific-features).
+
+## Docstrings
+
+This project uses the **NumPy docstring style** for all documentation. This format is well-suited for scientific and technical libraries and integrates well with Sphinx documentation generation.
+
+### Format Overview
+
+Docstrings use triple double-quotes (`"""`) and follow this general structure:
+
+```python
+"""Summary line.
+
+Extended description (optional).
+
+Parameters
+----------
+param1 : type
+    Description of param1.
+param2 : type, optional
+    Description of param2. Default is value.
+
+Returns
+-------
+return_type
+    Description of return value.
+
+Raises
+------
+ExceptionType
+    Description of when this exception is raised.
+
+Notes
+-----
+Additional notes and implementation details.
+
+Examples
+--------
+>>> example_code()
+result
+"""
+```
+
+### Module Docstrings
+
+Module docstrings should appear after imports and `__all__` (if present), before any classes or functions. They should provide a brief overview of the module's purpose.
+
+```python
+"""Module for managing CUDA device memory resources.
+
+This module provides classes and functions for allocating and managing
+device memory using CUDA's stream-ordered memory pool API.
+"""
+```
+
+For simple utility modules, a single-line docstring may suffice:
+
+```python
+"""Utility functions for CUDA error handling."""
+```
+
+### Class Docstrings
+
+Class docstrings should include:
+
+1. **Summary line**: A one-line description of the class
+2. **Extended description** (optional): Additional context about the class
+3. **Parameters section**: If the class is callable (has `__init__`), document constructor parameters
+4. **Attributes section**: Document public attributes (if any)
+5. **Notes section**: Important usage notes, implementation details, or examples
+6. **Examples section**: Usage examples (if helpful)
+
+**Example:**
+
+```python
+cdef class DeviceMemoryResource(MemoryResource):
+    """
+    A device memory resource managing a stream-ordered memory pool.
+
+    Parameters
+    ----------
+    device_id : Device | int
+        Device or Device ordinal for which a memory resource is constructed.
+    options : DeviceMemoryResourceOptions, optional
+        Memory resource creation options. If None, uses the driver's current
+        or default memory pool for the specified device.
+
+    Attributes
+    ----------
+    device_id : int
+        The device ID associated with this memory resource.
+    is_ipc_enabled : bool
+        Whether this memory resource supports IPC.
+
+    Notes
+    -----
+    To create an IPC-enabled memory resource, specify ``ipc_enabled=True``
+    in the options. IPC-enabled resources can share allocations between
+    processes.
+
+    Examples
+    --------
+    >>> dmr = DeviceMemoryResource(0)
+    >>> buffer = dmr.allocate(1024)
+    """
+```
+
+For simple classes, a brief docstring may be sufficient:
+
+```python
+@dataclass
+cdef class DeviceMemoryResourceOptions:
+    """Customizable DeviceMemoryResource options.
+
+    Attributes
+    ----------
+    ipc_enabled : bool, optional
+        Whether to create an IPC-enabled memory pool. Default is False.
+    max_size : int, optional
+        Maximum pool size. Default is 0 (system-dependent).
+    """
+```
+
+### Method and Function Docstrings
+
+Method and function docstrings should include:
+
+1. **Summary line**: A one-line description starting with a verb (e.g., "Allocate", "Return", "Create")
+2. **Extended description** (optional): Additional details about behavior
+3. **Parameters section**: All parameters with types and descriptions
+4. **Returns section**: Return type and description
+5. **Raises section**: Exceptions that may be raised (if any)
+6. **Notes section**: Important implementation details or usage notes (if needed)
+7. **Examples section**: Usage examples (if helpful)
+
+**Example:**
+
+```python
+def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
+    """Allocate a buffer of the requested size.
+
+    Parameters
+    ----------
+    size : int
+        The size of the buffer to allocate, in bytes.
+    stream : Stream | GraphBuilder, optional
+        The stream on which to perform the allocation asynchronously.
+        If None, an internal stream is used.
+
+    Returns
+    -------
+    Buffer
+        The allocated buffer object, which is accessible on the device
+        that this memory resource was created for.
+
+    Raises
+    ------
+    TypeError
+        If called on a mapped IPC-enabled memory resource.
+    RuntimeError
+        If allocation fails.
+
+    Notes
+    -----
+    The allocated buffer is associated with this memory resource and will
+    be deallocated when the buffer is closed or when this resource is closed.
+    """
+```
+
+For simple functions, a brief docstring may suffice:
+
+```python
+def get_ipc_descriptor(self) -> IPCBufferDescriptor:
+    """Export a buffer allocated for sharing between processes."""
+```
+
+### Property Docstrings
+
+Property docstrings should be concise and focus on what the property represents. For read-write properties, document both getter and setter behavior.
+
+**Read-only property:**
+
+```python
+@property
+def device_id(self) -> int:
+    """Return the device ordinal of this buffer."""
+```
+
+**Read-write property:**
+
+```python
+@property
+def peer_accessible_by(self):
+    """
+    Get or set the devices that can access allocations from this memory pool.
+
+    Returns
+    -------
+    tuple of int
+        A tuple of sorted device IDs that currently have peer access to
+        allocations from this memory pool.
+
+    Notes
+    -----
+    When setting, accepts a sequence of Device objects or device IDs.
+    Setting to an empty sequence revokes all peer access.
+
+    Examples
+    --------
+    >>> dmr.peer_accessible_by = [1]  # Grant access to device 1
+    >>> assert dmr.peer_accessible_by == (1,)
+    """
+```
+
+### Type References in Docstrings
+
+Use Sphinx-style cross-references for types:
+
+- **Classes**: ``:class:`Buffer` `` or ``:class:`~_memory.Buffer` `` (with `~` to hide module path)
+- **Methods**: ``:meth:`DeviceMemoryResource.allocate` ``
+- **Attributes**: ``:attr:`device_id` ``
+- **Modules**: ``:mod:`multiprocessing` ``
+- **Objects**: ``:obj:`~_memory.DevicePointerT` ``
+
+**Example:**
+
+```python
+def from_handle(
+    ptr: DevicePointerT, size_t size, mr: MemoryResource | None = None
+) -> Buffer:
+    """Create a new :class:`Buffer` object from a pointer.
+
+    Parameters
+    ----------
+    ptr : :obj:`~_memory.DevicePointerT`
+        Allocated buffer handle object.
+    size : int
+        Memory size of the buffer.
+    mr : :obj:`~_memory.MemoryResource`, optional
+        Memory resource associated with the buffer.
+    """
+```
+
+### Guidelines
+
+1. **Always include docstrings**: All public classes, methods, functions, and properties should have docstrings.
+
+2. **Start with a verb**: Summary lines for methods and functions should start with a verb in imperative mood (e.g., "Allocate", "Return", "Create", not "Allocates", "Returns", "Creates").
+
+3. **Be concise but complete**: Provide enough information for users to understand and use the API, but avoid unnecessary verbosity.
+
+4. **Use proper sections**: Include Parameters, Returns, Raises sections when applicable. Use Notes and Examples sections when they add value.
+
+5. **Document optional parameters**: Clearly indicate optional parameters and their default values.
+
+6. **Use type hints**: Type information in docstrings should complement (not duplicate) type annotations. Use docstrings to provide additional context about types.
+
+7. **Cross-reference related APIs**: Use Sphinx cross-references to link to related classes, methods, and attributes.
+
+8. **Keep private methods brief**: Private methods (starting with `_`) may have minimal docstrings, but should still document non-obvious behavior.
+
+9. **Update docstrings with code changes**: Keep docstrings synchronized with implementation changes.
+
+## Errors and Warnings
+
+### Exception Types
+
+#### Custom Exceptions
+
+The project defines custom exception types for CUDA-specific errors:
+
+- **`CUDAError`**: Base exception for CUDA-related errors
+- **`NVRTCError`**: Exception for NVRTC (compiler) errors, inherits from `CUDAError`
+
+```python
+from cuda.core.experimental._utils.cuda_utils import CUDAError, NVRTCError
+
+raise CUDAError("CUDA operation failed")
+raise NVRTCError("NVRTC compilation error")
+```
+
+#### Standard Python Exceptions
+
+Use standard Python exceptions when appropriate:
+
+- **`ValueError`**: Invalid argument values
+- **`TypeError`**: Invalid argument types
+- **`RuntimeError`**: Runtime errors that don't fit other categories
+- **`NotImplementedError`**: Features that are not yet implemented
+- **`BufferError`**: Buffer protocol-related errors
+
+```python
+if size < 0:
+    raise ValueError(f"size must be non-negative, got {size}")
+
+if not isinstance(stream, Stream):
+    raise TypeError(f"stream must be a Stream, got {type(stream)}")
+
+if self.is_mapped:
+    raise RuntimeError("Memory resource is not IPC-enabled")
+```
+
+### Raising Errors
+
+#### Error Messages
+
+Error messages should be clear and include context:
+
+**Preferred:**
+```python
+if dst_size != src_size:
+    raise ValueError(
+        f"buffer sizes mismatch between src and dst "
+        f"(sizes are: src={src_size}, dst={dst_size})"
+    )
+```
+
+**Avoid:**
+```python
+if dst_size != src_size:
+    raise ValueError("sizes don't match")
+```
+
+#### CUDA API Error Handling
+
+For CUDA Driver API calls, use the `HANDLE_RETURN` macro in `nogil` contexts:
+
+```python
+cdef int allocate_buffer(uintptr_t* ptr, size_t size) except?-1 nogil:
+    HANDLE_RETURN(cydriver.cuMemAlloc(ptr, size))
+    return 0
+```
+
+For Python-level CUDA error handling, use `handle_return()`:
+
+```python
+from cuda.core.experimental._utils.cuda_utils import handle_return
+
+err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
+handle_return((err,))
+```
+
+Or use `raise_if_driver_error()` for direct error raising:
+
+```python
+from cuda.core.experimental._utils.cuda_utils cimport (
+    _check_driver_error as raise_if_driver_error,
+)
+
+err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
+raise_if_driver_error(err)
+```
+
+#### Error Explanations
+
+CUDA errors include explanations from dictionaries (`DRIVER_CU_RESULT_EXPLANATIONS`, `RUNTIME_CUDA_ERROR_EXPLANATIONS`) when available. The error checking functions (`_check_driver_error()`, `_check_runtime_error()`) automatically include these explanations in the error message.
+
+### Warnings
+
+#### Warning Categories
+
+Use appropriate warning categories:
+
+- **`UserWarning`**: For user-facing warnings about potentially problematic usage
+- **`DeprecationWarning`**: For deprecated features that will be removed in future versions
+
+```python
+import warnings
+
+warnings.warn(
+    "multiprocessing start method is 'fork', which CUDA does not support. "
+    "Forked subprocesses exhibit undefined behavior. "
+    "Set the start method to 'spawn' before creating processes that use CUDA.",
+    UserWarning,
+    stacklevel=3
+)
+
+warnings.warn(
+    "Implementing __cuda_stream__ as an attribute is deprecated; "
+    "it must be implemented as a method",
+    DeprecationWarning,
+    stacklevel=3
+)
+```
+
+#### Stack Level
+
+Always specify the `stacklevel` parameter to point to the caller, not the warning location:
+
+```python
+warnings.warn(message, UserWarning, stacklevel=3)
+```
+
+The `stacklevel` value depends on the call depth. Use `stacklevel=2` for direct function calls, `stacklevel=3` for calls through helper functions.
+
+#### One-Time Warnings
+
+For warnings that should only be emitted once per process, use a module-level flag:
+
+```python
+_fork_warning_checked = False
+
+def check_multiprocessing_start_method():
+    global _fork_warning_checked
+    if _fork_warning_checked:
+        return
+    _fork_warning_checked = True
+
+    # ... check condition and emit warning ...
+    warnings.warn(message, UserWarning, stacklevel=3)
+```
+
+#### Deprecation Warnings
+
+For deprecation warnings, use `warnings.simplefilter("once", DeprecationWarning)` to ensure each deprecation message is shown only once:
+
+```python
+warnings.simplefilter("once", DeprecationWarning)
+warnings.warn(
+    "Feature X is deprecated and will be removed in a future version",
+    DeprecationWarning,
+    stacklevel=3
+)
+```
+
+### Guidelines
+
+1. **Use specific exception types**: Choose the most appropriate exception type for the error condition.
+
+2. **Include context in error messages**: Error messages should include relevant values and context to help users diagnose issues.
+
+3. **Use custom exceptions for CUDA errors**: Use `CUDAError` or `NVRTCError` for CUDA-specific errors rather than generic exceptions.
+
+4. **Specify stacklevel for warnings**: Always include `stacklevel` parameter in `warnings.warn()` calls to point to the actual caller.
+
+5. **Use one-time warnings for repeated operations**: When a warning could be triggered multiple times, use a flag to ensure it's only shown once.
+
+6. **Prefer warnings over errors for recoverable issues**: Use warnings for issues that don't prevent execution but may cause problems.
+
+## Memory Management
+
+### Resource Lifecycle
+
+CUDA memory resources and buffers follow a clear lifecycle pattern:
+
+1. **Creation**: Resources and buffers are created through factory methods or constructors
+2. **Usage**: Objects are used for CUDA operations
+3. **Cleanup**: Resources are explicitly closed or automatically cleaned up
+
+### Explicit Cleanup
+
+Always provide explicit cleanup methods for resources that manage CUDA handles:
+
+```python
+cdef class DeviceMemoryResource:
+    def close(self):
+        """Close the memory resource and release CUDA handles."""
+        DMR_close(self)
+
+    def __dealloc__(self):
+        """Automatic cleanup when object is garbage collected."""
+        DMR_close(self)
+```
+
+### Buffer Lifecycle
+
+Buffers are associated with memory resources and should be closed when no longer needed:
+
+```python
+cdef class Buffer:
+    def close(self, stream: Stream | GraphBuilder | None = None):
+        """Deallocate this buffer asynchronously on the given stream."""
+        Buffer_close(self, stream)
+
+    def __dealloc__(self):
+        """Automatic cleanup if not explicitly closed."""
+        self.close(self._alloc_stream)
+```
+
+### Guidelines
+
+1. **Provide explicit `close()` methods**: All resources managing CUDA handles should have a `close()` method for explicit cleanup.
+
+2. **Implement `__dealloc__` as safety net**: Use `__dealloc__` to ensure cleanup happens even if users forget to call `close()`, but don't rely on it for normal operation.
+
+3. **Document cleanup behavior**: Clearly document when cleanup happens automatically versus when it must be called explicitly.
+
+4. **Handle cleanup errors gracefully**: Cleanup methods should be idempotent (safe to call multiple times) and handle errors without raising exceptions when possible.
+
+5. **Use stream-ordered deallocation**: When deallocating buffers, use the appropriate stream for asynchronous cleanup to avoid blocking operations.
+
+6. **Track resource ownership**: Clearly document which objects own CUDA handles and are responsible for cleanup.
+
+## Thread Safety and Concurrency
+
+### Thread-Local Storage
+
+Use `threading.local()` for thread-local state that needs to persist across function calls:
+
+```python
+import threading
+
+_tls = threading.local()
+
+def some_function():
+    if not hasattr(_tls, 'devices'):
+        _tls.devices = []
+    return _tls.devices
+```
+
+### Locks for Shared State
+
+Use `threading.Lock()` to protect shared mutable state:
+
+```python
+import threading
+
+_lock = threading.Lock()
+
+def thread_safe_operation():
+    with _lock:
+        # Critical section
+        # Modify shared state
+        pass
+```
+
+### Combining Locks with `nogil`
+
+When protecting CUDA operations, acquire the lock before entering `nogil` context:
+
+```python
+def thread_safe_cuda_operation():
+    with _lock, nogil:
+        HANDLE_RETURN(cydriver.cuSomeOperation())
+```
+
+### One-Time Initialization
+
+For one-time initialization that must be thread-safe, use a lock with a flag:
+
+```python
+cdef bint _initialized = False
+_lock = threading.Lock()
+
+def initialize():
+    global _initialized
+    with _lock:
+        if not _initialized:
+            # Perform initialization
+            _initialized = True
+```
+
+### Guidelines
+
+1. **Use thread-local storage for per-thread state**: When state needs to be isolated per thread, use `threading.local()`.
+
+2. **Protect shared mutable state**: Use locks to protect any shared mutable state that could be accessed from multiple threads.
+
+3. **Minimize lock scope**: Keep critical sections as short as possible to reduce contention.
+
+4. **Document thread safety**: Clearly document which operations are thread-safe and which require external synchronization.
+
+5. **Avoid global mutable state**: Prefer thread-local storage or instance variables over global mutable state when possible.
+
+6. **Combine locks with `nogil` correctly**: Acquire locks before entering `nogil` contexts, not inside them.
+
+## Cython-Specific Features
+
+### Function Declarations
+
+Cython provides three types of function declarations:
+
+1. **`def`**: Python function, callable from Python, slower than C functions
+2. **`cdef`**: C function, not callable from Python, fastest
+3. **`cpdef`**: Hybrid function, callable from both Python and C, faster than `def` but slower than `cdef`
+
+**Guidelines:**
+
+- Use `cdef` for internal helper functions that are only called from Cython code
+- Use `cpdef` when a function needs to be callable from Python but performance is important
+- Use `def` for public Python API functions where flexibility is more important than performance
+
+```python
+# Internal helper - only used in Cython
+cdef inline void Buffer_close(Buffer self, stream):
+    # ...
+
+# Public API - callable from Python, performance important
+cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
+    # ...
+
+# Public API - standard Python function
+def allocate(self, size_t size, stream=None) -> Buffer:
+    # ...
+```
+
+### Class Declarations
+
+Use `cdef class` for Cython extension types:
+
+```python
+cdef class Buffer:
+    cdef:
+        uintptr_t _ptr
+        size_t _size
+        MemoryResource _memory_resource
+```
+
+### The `nogil` Context
+
+Use `nogil` to release the Global Interpreter Lock (GIL) for performance-critical C operations. See [CUDA-Specific Patterns](#cuda-specific-patterns) for detailed guidelines.
+
+### Exception Handling
+
+Use `except?` or `except` clauses to propagate exceptions from `nogil` functions:
+
+```python
+cdef int get_device_from_ctx(...) except?cydriver.CU_DEVICE_INVALID nogil:
+    # Returns CU_DEVICE_INVALID on error, otherwise raises exception
+```
+
+### Type Declarations
+
+Declare C types explicitly for performance:
+
+```python
+cdef:
+    int device_id
+    size_t buffer_size
+    cydriver.CUdeviceptr ptr
+```
+
+### Inline Functions
+
+Use `inline` for small, frequently-called functions:
+
+```python
+cdef inline void Buffer_close(Buffer self, stream):
+    # ...
+```
+
+### Guidelines
+
+1. **Choose the right function type**: Use `cdef` for internal code, `cpdef` for performance-critical public APIs, `def` for standard public APIs.
+
+2. **Declare types explicitly**: Use `cdef` declarations for C-level types to enable optimizations.
+
+3. **Use `inline` judiciously**: Mark small, frequently-called functions as `inline`, but avoid overuse.
+
+4. **Handle exceptions properly**: Use appropriate exception clauses (`except`, `except?`) for `nogil` functions.
+
+5. **Document Cython-specific behavior**: When using Cython features that affect the Python API, document them clearly.
+
+## Constants and Magic Numbers
+
+### Naming Constants
+
+Use **UPPER_SNAKE_CASE** for module-level constants:
+
+```python
+LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
+PER_THREAD_DEFAULT_STREAM = C_PER_THREAD_DEFAULT_STREAM
+
+RUNTIME_CUDA_ERROR_EXPLANATIONS = {
+    # ...
+}
+```
+
+### CUDA Constants
+
+For CUDA API constants, use the bindings directly or create aliases with descriptive names:
+
+```python
+from cuda.bindings cimport cydriver
+
+# Use CUDA constants directly
+cdef cydriver.CUdevice device_id = cydriver.CU_DEVICE_INVALID
+
+# Or create descriptive aliases
+cdef object CU_DEVICE_INVALID = cydriver.CU_DEVICE_INVALID
+```
+
+### Avoid Magic Numbers
+
+Replace magic numbers with named constants:
+
+**Avoid:**
+```python
+if flags & 1:  # What does 1 mean?
+    # ...
+```
+
+**Preferred:**
+```python
+if flags & cydriver.CUstream_flags.CU_STREAM_NON_BLOCKING:
+    # ...
+```
+
+### Dictionary Mappings
+
+Use dictionaries to map between string representations and constants:
+
+```python
+_access_flags = {
+    "rw": cydriver.CU_MEM_ACCESS_FLAGS_PROT_READWRITE,
+    "r": cydriver.CU_MEM_ACCESS_FLAGS_PROT_READ,
+    None: 0
+}
+```
+
+### Guidelines
+
+1. **Name all constants**: Avoid magic numbers and strings. Use descriptive constant names.
+
+2. **Use UPPER_SNAKE_CASE**: Follow Python convention for module-level constants.
+
+3. **Group related constants**: Define related constants together, optionally in a dictionary or class.
+
+4. **Document non-obvious constants**: If a constant's purpose isn't immediately clear, add a comment explaining it.
+
+5. **Prefer CUDA bindings**: Use constants from `cuda.bindings` directly when possible rather than redefining them.
+
+## Comments and Inline Documentation
+
+### TODO Comments
+
+Use `TODO` comments to mark incomplete work or future improvements:
+
+```python
+# TODO: It is better to take a stream for latter deallocation
+return Buffer._init(ptr, size, mr=mr)
+
+# TODO: consider lower this to Cython
+expl = DRIVER_CU_RESULT_EXPLANATIONS.get(int(error))
+```
+
+### NOTE Comments
+
+Use `NOTE` comments to explain non-obvious implementation details:
+
+```python
+# NOTE: match this behavior to DeviceMemoryResource.allocate()
+stream = default_stream()
+
+# NOTE: this is referenced in instructions to debug nvbug 5698116
+cpdef DMR_mempool_get_access(DeviceMemoryResource dmr, int device_id):
+```
+
+### Implementation Comments
+
+Add comments to explain complex logic or non-obvious behavior:
+
+```python
+# Must not serialize the parent's stream!
+return Buffer.from_ipc_descriptor, (self.memory_resource, self.get_ipc_descriptor())
+
+# This works around nvbug 5698116. When a memory pool handle is recycled
+# the new handle inherits the peer access state of the previous handle.
+if self._peer_accessible_by:
+    self.peer_accessible_by = []
+```
+
+### Inline Type Comments
+
+Use type comments sparingly, only when type annotations aren't sufficient:
+
+```python
+import platform  # no-cython-lint
+```
+
+### Guidelines
+
+1. **Use TODO for incomplete work**: Mark known limitations, future improvements, or incomplete features with `TODO` comments.
+
+2. **Use NOTE for important context**: Add `NOTE` comments to explain non-obvious implementation decisions or workarounds.
+
+3. **Explain complex logic**: Add comments to explain why code is written a certain way, not what it does (the code should be self-explanatory).
+
+4. **Keep comments up-to-date**: Update or remove comments when code changes.
+
+5. **Avoid obvious comments**: Don't comment what the code clearly shows. Focus on the "why" rather than the "what".
+
+6. **Document workarounds**: Always document workarounds for bugs (include bug numbers when available) and explain why they're necessary.
+
+## Code Organization Within Files
+
+### Overall Structure
+
+Follow the ordering specified in [File Structure](#file-structure):
+
+1. SPDX copyright header
+2. Import statements
+3. `__all__` declaration
+4. Type aliases and constants (optional)
+5. Principal class/function
+6. Other public classes and functions
+7. Public module functions
+8. Private/implementation functions
+
+### Within Classes
+
+Follow the ordering specified in [Class and Function Definitions](#class-and-function-definitions):
+
+1. Special (dunder) methods (alphabetically sorted)
+2. Methods (alphabetically sorted)
+3. Properties (alphabetically sorted)
+
+### Helper Functions
+
+Move complex implementation details to helper functions at the end of the file. See [Class and Function Definitions - Helper Functions](#helper-functions) for details.
+
+### Type Aliases and Constants
+
+Type aliases and module-level constants should be defined after `__all__` (if present) or after imports, before classes. See [File Structure](#file-structure) for the complete ordering.
+
+```python
+DevicePointerT = driver.CUdeviceptr | int | None
+"""Type union for device pointer representations."""
+
+LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
+```
+
+### Guidelines
+
+1. **Follow the established ordering**: Maintain consistency with the file structure and class definition ordering rules.
+
+2. **Group related code**: Keep related functions and classes together.
+
+3. **Separate public and private**: Clearly separate public API from implementation details.
+
+4. **Use helper functions**: Extract complex logic into helper functions to improve readability.
+
+5. **Keep related code close**: Place helper functions near the code that uses them, or group all helpers at the end of the file.
+
+## Performance Considerations
+
+### Use Cython Types
+
+Declare C types explicitly for performance-critical code:
+
+```python
+cdef:
+    int device_id
+    size_t buffer_size
+    cydriver.CUdeviceptr ptr
+```
+
+### Prefer `cdef` for Internal Functions
+
+Use `cdef` functions for internal operations that don't need to be callable from Python:
+
+```python
+cdef inline void Buffer_close(Buffer self, stream):
+    # Fast C-level function
+```
+
+### Release GIL for CUDA Operations
+
+Always release the GIL when calling CUDA driver APIs. See [CUDA-Specific Patterns](#cuda-specific-patterns) for details.
+
+### Minimize Python Object Creation
+
+Avoid creating Python objects in hot paths:
+
+```python
+# Avoid: Creates Python list
+result = []
+for i in range(n):
+    result.append(i)
+
+# Preferred: Use C array or pre-allocate
+cdef int* c_result = <int*>malloc(n * sizeof(int))
+```
+
+### Use `inline` for Small Functions
+
+Mark small, frequently-called functions as `inline`:
+
+```python
+cdef inline int get_device_id(DeviceMemoryResource mr):
+    return mr._dev_id
+```
+
+### Avoid Unnecessary Type Conversions
+
+Minimize conversions between C and Python types:
+
+```python
+# Avoid: Unnecessary conversion
+cdef int size = int(self._size)
+
+# Preferred: Use C type directly
+cdef size_t size = self._size
+```
+
+### Guidelines
+
+1. **Profile before optimizing**: Don't optimize prematurely. Use profiling to identify actual bottlenecks.
+
+2. **Use C types in hot paths**: Declare C types (`cdef`) for variables used in performance-critical loops.
+
+3. **Release GIL appropriately**: Always release GIL for CUDA operations, but be careful about Python object access.
+
+4. **Minimize Python overhead**: Avoid Python object creation, method calls, and attribute access in hot paths.
+
+5. **Use `inline` judiciously**: Mark small, frequently-called functions as `inline`, but don't overuse (compiler may ignore if function is too large).
+
+6. **Cache expensive lookups**: Cache results of expensive operations (e.g., dictionary lookups, attribute access) when used repeatedly.
+
+## API Design Principles
+
+### Public vs Private API
+
+Use naming conventions to distinguish public and private APIs:
+
+- **Public API**: No leading underscore, documented in docstrings, included in `__all__`
+- **Private API**: Leading underscore (`_`), may have minimal documentation, not in `__all__`
+
+```python
+__all__ = ['Buffer', 'MemoryResource']  # Public API
+
+# Public API
+cdef class Buffer:
+    def allocate(self):  # Public method
+        # ...
+
+# Private API
+cdef inline void Buffer_close(Buffer self, stream):  # Private helper
+    # ...
+```
+
+### Backward Compatibility
+
+Maintain backward compatibility when possible:
+
+- **Deprecation warnings**: Use `DeprecationWarning` for APIs that will be removed
+- **Gradual migration**: Provide both old and new APIs during transition periods
+- **Version documentation**: Document when APIs were introduced or deprecated
+
+### Consistency
+
+Maintain consistency across the API:
+
+- **Naming patterns**: Use consistent naming patterns (e.g., `from_*` for factory methods)
+- **Parameter ordering**: Use consistent parameter ordering across similar functions
+- **Return types**: Use consistent return types for similar operations
+
+### Factory Methods
+
+Use class methods or static methods for factory functions:
+
+```python
+@classmethod
+def from_ipc_descriptor(cls, mr, ipc_descriptor, stream=None) -> Buffer:
+    """Factory method to create Buffer from IPC descriptor."""
+    # ...
+
+@staticmethod
+def from_handle(ptr, size, mr=None) -> Buffer:
+    """Factory method to create Buffer from handle."""
+    # ...
+```
+
+### Error Handling
+
+Design APIs to fail fast with clear error messages:
+
+- **Validate inputs early**: Check parameters at the start of functions
+- **Use appropriate exceptions**: Raise specific exception types for different error conditions
+- **Provide context**: Include relevant values and context in error messages
+
+### Guidelines
+
+1. **Minimize public API surface**: Keep the public API small and focused. Use private helpers for implementation details.
+
+2. **Document public APIs**: All public APIs must have complete docstrings following the [Docstrings](#docstrings) guidelines.
+
+3. **Use `__all__` explicitly**: List all public symbols in `__all__` to clearly define the module's public API.
+
+4. **Design for extensibility**: Consider future needs when designing APIs, but don't over-engineer.
+
+5. **Follow Python conventions**: Adhere to Python naming and design conventions (PEP 8, PEP 20).
+
+6. **Provide clear error messages**: When APIs fail, provide error messages that help users understand and fix the problem.
+
+7. **Use type hints**: Provide type annotations for all public APIs to improve IDE support and documentation.
+
+## CUDA-Specific Patterns
+
+### GIL Management for CUDA Driver API Calls
+
+**Always release the Global Interpreter Lock (GIL) when calling CUDA driver API functions.** This is critical for performance and thread safety.
+
+#### Using `with nogil:` Blocks
+
+Wrap CUDA driver API calls in `with nogil:` blocks:
+
+```python
+cdef cydriver.CUstream s
+with nogil:
+    HANDLE_RETURN(cydriver.cuStreamCreateWithPriority(&s, flags, prio))
+self._handle = s
+```
+
+For multiple driver calls, group them in a single `with nogil:` block:
+
+```python
+cdef int high, low
+with nogil:
+    HANDLE_RETURN(cydriver.cuCtxGetStreamPriorityRange(&high, &low))
+    HANDLE_RETURN(cydriver.cuStreamCreateWithPriority(&s, flags, prio))
+```
+
+#### Function-Level `nogil` Declaration
+
+For functions that primarily call CUDA driver APIs, declare the function `nogil`:
+
+```python
+cdef int get_device_from_ctx(
+        cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
+    """Get device ID from the given ctx."""
+    cdef bint switch_context = (curr_ctx != target_ctx)
+    cdef cydriver.CUcontext ctx
+    cdef cydriver.CUdevice target_dev
+    with nogil:
+        if switch_context:
+            HANDLE_RETURN(cydriver.cuCtxPopCurrent(&ctx))
+            HANDLE_RETURN(cydriver.cuCtxPushCurrent(target_ctx))
+        HANDLE_RETURN(cydriver.cuCtxGetDevice(&target_dev))
+        if switch_context:
+            HANDLE_RETURN(cydriver.cuCtxPopCurrent(&ctx))
+            HANDLE_RETURN(cydriver.cuCtxPushCurrent(curr_ctx))
+    return target_dev
+```
+
+#### Raising Exceptions from `nogil` Context
+
+When raising exceptions from a `nogil` context, acquire the GIL first using `with gil:`:
+
+```python
+cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
+    if error == cydriver.CUresult.CUDA_SUCCESS:
+        return 0
+    cdef const char* name
+    name_err = cydriver.cuGetErrorName(error, &name)
+    if name_err != cydriver.CUresult.CUDA_SUCCESS:
+        with gil:
+            raise CUDAError(f"UNEXPECTED ERROR CODE: {error}")
+    with gil:
+        expl = DRIVER_CU_RESULT_EXPLANATIONS.get(int(error))
+        if expl is not None:
+            raise CUDAError(f"{name.decode()}: {expl}")
+    # ... rest of error handling ...
+```
+
+#### Guidelines
+
+1. **Always use `with nogil:` for CUDA driver calls**: Every call to `cydriver.*` functions should be within a `with nogil:` block.
+
+2. **Use `HANDLE_RETURN` within `nogil` blocks**: The `HANDLE_RETURN` macro is designed to work in `nogil` contexts.
+
+3. **Acquire GIL before raising exceptions**: When raising Python exceptions from a `nogil` context, use `with gil:` to acquire the GIL first.
+
+4. **Group related driver calls**: If multiple driver calls are made sequentially, group them in a single `with nogil:` block for efficiency.
+
+5. **Declare functions `nogil` when appropriate**: Functions that primarily call CUDA driver APIs and don't need Python object access should be declared `nogil` at the function level.
+
+### Example
+
+```python
+cdef inline void DMR_close(DeviceMemoryResource self):
+    if self._handle == NULL:
+        return
+
+    try:
+        if self._mempool_owned:
+            with nogil:
+                HANDLE_RETURN(cydriver.cuMemPoolDestroy(self._handle))
+    finally:
+        self._dev_id = cydriver.CU_DEVICE_INVALID
+        self._handle = NULL
+        # ... cleanup ...
+```
+
+## Copyright and Licensing
+
+All source files in `cuda/core/experimental` must include a copyright header at the top of the file using the SPDX format.
+
+### Required Header Format
+
+Every `.py`, `.pyx`, and `.pxd` file must begin with the following header:
+
+```python
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+```
+
+### Guidelines
+
+1. **Placement**: The copyright header must be the first lines of the file, before any imports or other code.
+
+2. **Blank Lines**: Include a blank line between the copyright notice and the license identifier, and another blank line after the license identifier before the code begins.
+
+3. **Year Range**:
+   - The beginning year reflects the year the file was first added to the repository.
+   - The end year reflects the most recent year in which the file was modified.
+   - For new files, use a single year (e.g., `2025`) or the current year range if created mid-year.
+   - Update the end year when making modifications to existing files.
+
+4. **Consistency**: All files must use the same copyright text and license identifier (`Apache-2.0`).
+
+5. **SPDX Format**: The header uses the SPDX (Software Package Data Exchange) format, which is a standard way to communicate license and copyright information.
+
+### Example
+
+```python
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# ... rest of the file ...
+```

From 61994b2e4a2caadf22ab6271693efca9cf983985 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 4 Dec 2025 16:34:49 -0800
Subject: [PATCH 02/17] Move style guide to cuda_core root directory

---
 cuda_core/{cuda/core => }/style-guide.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename cuda_core/{cuda/core => }/style-guide.md (100%)

diff --git a/cuda_core/cuda/core/style-guide.md b/cuda_core/style-guide.md
similarity index 100%
rename from cuda_core/cuda/core/style-guide.md
rename to cuda_core/style-guide.md

From 9769d11e7667bfa2b823d50ea372b6c549093fe8 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 4 Dec 2025 16:40:50 -0800
Subject: [PATCH 03/17] Add Development Lifecycle section to style guide

Document the two-phase development approach:
- Phase 1: Start with Python driver implementation and tests
- Phase 2: Optimize by switching to cydriver with nogil blocks

Includes step-by-step conversion guide and before/after examples.
---
 cuda_core/style-guide.md | 164 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 163 insertions(+), 1 deletion(-)

diff --git a/cuda_core/style-guide.md b/cuda_core/style-guide.md
index f9ea81abb3..5d1c7e5206 100644
--- a/cuda_core/style-guide.md
+++ b/cuda_core/style-guide.md
@@ -23,7 +23,8 @@ This style guide defines conventions for Python and Cython code in `cuda/core/ex
 15. [Performance Considerations](#performance-considerations)
 16. [API Design Principles](#api-design-principles)
 17. [CUDA-Specific Patterns](#cuda-specific-patterns)
-18. [Copyright and Licensing](#copyright-and-licensing)
+18. [Development Lifecycle](#development-lifecycle)
+19. [Copyright and Licensing](#copyright-and-licensing)
 
 ---
 
@@ -1777,6 +1778,167 @@ cdef inline void DMR_close(DeviceMemoryResource self):
         # ... cleanup ...
 ```
 
+## Development Lifecycle
+
+### Two-Phase Development Approach
+
+When implementing new CUDA functionality, follow a two-phase development approach:
+
+1. **Phase 1: Python Implementation with Tests**
+   - Start with a pure Python implementation using the CUDA driver module
+   - Write comprehensive tests to verify correctness
+   - Ensure all tests pass before proceeding to Phase 2
+
+2. **Phase 2: Cythonization for Performance**
+   - After tests are passing, optimize by switching to `cydriver`
+   - Add `with nogil:` blocks around CUDA driver API calls
+   - Use `HANDLE_RETURN` macro for error handling
+   - Verify tests still pass after optimization
+
+### Phase 1: Initial Python Implementation
+
+Begin with a straightforward Python implementation using the `driver` module from `cuda.core.experimental._utils.cuda_utils`:
+
+```python
+from cuda.core.experimental._utils.cuda_utils import driver
+from cuda.core.experimental._utils.cuda_utils cimport (
+    _check_driver_error as raise_if_driver_error,
+)
+
+def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
+    stream = Stream_accept(stream)
+    cdef size_t src_size = self._size
+    
+    # ... validation logic ...
+    
+    err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
+    raise_if_driver_error(err)
+    return dst
+```
+
+**Benefits of starting with Python:**
+- Faster iteration during development
+- Easier debugging with Python stack traces
+- Simpler error handling
+- Focus on correctness before optimization
+
+### Phase 2: Cythonization Process
+
+Once tests are passing, optimize the implementation by:
+
+1. **Switching to `cydriver`**: Replace `driver` module calls with direct `cydriver` calls
+2. **Adding `with nogil:` blocks**: Wrap CUDA driver API calls to release the GIL
+3. **Using `HANDLE_RETURN`**: Replace `raise_if_driver_error()` with the `HANDLE_RETURN` macro
+4. **Casting stream handles**: Access the C-level stream handle for `cydriver` calls
+
+#### Step-by-Step Conversion
+
+**Step 1: Update imports**
+
+```python
+# Remove Python driver import
+# from cuda.core.experimental._utils.cuda_utils import driver
+
+# Add cydriver cimport
+from cuda.bindings cimport cydriver
+
+# Add HANDLE_RETURN
+from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+```
+
+**Step 2: Cast stream and extract C-level handle**
+
+```python
+stream = Stream_accept(stream)
+cdef Stream s_stream = <Stream>stream
+cdef cydriver.CUstream s = s_stream._handle
+```
+
+**Step 3: Wrap driver calls in `with nogil:` and use `HANDLE_RETURN`**
+
+```python
+# Before (Python driver):
+err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
+raise_if_driver_error(err)
+
+# After (cydriver):
+with nogil:
+    HANDLE_RETURN(cydriver.cuMemcpyAsync(
+        <cydriver.CUdeviceptr>dst._ptr,
+        <cydriver.CUdeviceptr>self._ptr,
+        src_size,
+        s
+    ))
+```
+
+**Step 4: Cast pointers to `cydriver.CUdeviceptr`**
+
+All device pointers passed to `cydriver` functions must be cast to `cydriver.CUdeviceptr`:
+
+```python
+<cydriver.CUdeviceptr>self._ptr
+```
+
+### Complete Example: Before and After
+
+**Before (Python driver implementation):**
+
+```python
+from cuda.core.experimental._utils.cuda_utils import driver
+from cuda.core.experimental._utils.cuda_utils cimport (
+    _check_driver_error as raise_if_driver_error,
+)
+
+def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
+    stream = Stream_accept(stream)
+    cdef size_t buffer_size = self._size
+    cdef unsigned char c_value8
+    
+    # Validation...
+    if width == 1:
+        c_value8 = <unsigned char>value
+        N = buffer_size
+        err, = driver.cuMemsetD8Async(self._ptr, c_value8, N, stream.handle)
+        raise_if_driver_error(err)
+```
+
+**After (Cythonized with cydriver):**
+
+```python
+from cuda.bindings cimport cydriver
+from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+
+def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
+    stream = Stream_accept(stream)
+    cdef Stream s_stream = <Stream>stream
+    cdef cydriver.CUstream s = s_stream._handle
+    cdef size_t buffer_size = self._size
+    cdef unsigned char c_value8
+    
+    # Validation...
+    if width == 1:
+        c_value8 = <unsigned char>value
+        N = buffer_size
+        with nogil:
+            HANDLE_RETURN(cydriver.cuMemsetD8Async(
+                <cydriver.CUdeviceptr>self._ptr, c_value8, N, s
+            ))
+```
+
+### Guidelines
+
+1. **Always write tests first**: Implement comprehensive tests before optimizing. This ensures correctness is established before performance improvements.
+
+2. **Verify tests after optimization**: After converting to `cydriver`, run all tests to ensure behavior is unchanged.
+
+3. **Don't skip Phase 1**: Even if you're confident about the implementation, starting with Python helps catch logic errors early.
+
+4. **Performance benefits**: The Cythonized version eliminates Python overhead and releases the GIL, providing significant performance improvements for CUDA operations.
+
+5. **Consistent pattern**: Follow this pattern for all new CUDA driver API wrappers to maintain consistency across the codebase.
+
+6. **Error handling**: The `HANDLE_RETURN` macro is designed to work in `nogil` contexts and will automatically raise appropriate exceptions when needed.
+
 ## Copyright and Licensing
 
 All source files in `cuda/core/experimental` must include a copyright header at the top of the file using the SPDX format.

From 3756e2a5c42d03411a979bc688313091b02cc755 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 10:31:23 -0800
Subject: [PATCH 04/17] Move and rename style guide to
 cuda_core/docs/developer-guide.md

---
 cuda_core/{style-guide.md => docs/developer-guide.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename cuda_core/{style-guide.md => docs/developer-guide.md} (100%)

diff --git a/cuda_core/style-guide.md b/cuda_core/docs/developer-guide.md
similarity index 100%
rename from cuda_core/style-guide.md
rename to cuda_core/docs/developer-guide.md

From b7cce0100fc8dd926d42badcad95f5962179a161 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 11:14:20 -0800
Subject: [PATCH 05/17] Refine developer guide: relax ordering rules and
 simplify SPDX guidance

- Change ordering rules from strict requirements to suggestions
- Prefer logical ordering, with alphabetical as fallback
- Make __init__/__cinit__ first by convention in dunder methods
- Keep import ordering strict (enforced by ruff linter)
- Simplify SPDX header guidance to reference existing patterns
- Use CamelCase terminology for class names
---
 cuda_core/docs/developer-guide.md | 111 ++++++++++--------------------
 1 file changed, 36 insertions(+), 75 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 5d1c7e5206..137738e153 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -30,11 +30,11 @@ This style guide defines conventions for Python and Cython code in `cuda/core/ex
 
 ## File Structure
 
-Files in `cuda/core/experimental` must follow a consistent structure. The ordering of elements within a file is as follows:
+Files in `cuda/core/experimental` should follow a consistent structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
 
 ### 1. SPDX Copyright Header
 
-The file must begin with the SPDX copyright header as specified in [Copyright and Licensing](#copyright-and-licensing).
+The file must begin with an SPDX copyright header. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.
 
 ### 2. Import Statements
 
@@ -63,7 +63,7 @@ LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
 
 If the file principally implements a single class or function (e.g., `_buffer.pyx` defines the `Buffer` class, `_device.pyx` defines the `Device` class), that principal class or function should come next, immediately after `__all__` (if present).
 
-**The principal class or function is an exception to alphabetical ordering** and appears first in its section.
+The principal class or function typically appears first in its section.
 
 ### 6. Other Public Classes and Functions
 
@@ -73,9 +73,9 @@ Following the principal class or function, define other public classes and funct
 - **Abstract base classes**: ABCs that define interfaces (e.g., `MemoryResource` in `_buffer.pyx`)
 - **Other public classes**: Additional classes exported by the module
 
-**All classes and functions in this section should be sorted alphabetically by name**, regardless of their relationship to the principal class. The principal class appears first as an exception to this rule.
+Consider organizing classes and functions logically—for example, by grouping related functionality or by order of typical usage. When no clear logical ordering exists, alphabetical ordering can help with discoverability.
 
-**Example:** In `_device_memory_resource.pyx`, `DeviceMemoryResource` is the principal class and appears first. Then `DeviceMemoryResourceOptions` appears after it (alphabetically after the principal class), even though it's an auxiliary/options class.
+**Example:** In `_device_memory_resource.pyx`, `DeviceMemoryResource` is the principal class and appears first, followed by `DeviceMemoryResourceOptions` (its options class).
 
 ### 7. Public Module Functions
 
@@ -92,9 +92,7 @@ Finally, define private functions and implementation details. These include:
 ### Example Structure
 
 ```python
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# SPDX-License-Identifier: Apache-2.0
+# <SPDX copyright header>
 
 # Imports (cimports first, then regular imports)
 from libc.stdint cimport uintptr_t
@@ -133,7 +131,7 @@ cdef inline void Buffer_close(Buffer self, stream):
 - Not every file will have all sections. For example, a utility module may not have a principal class.
 - The distinction between "principal" and "other" classes is based on the file's primary purpose. If a file exists primarily to define one class, that class is the principal class.
 - Private implementation functions should be placed at the end of the file to keep the public API visible at the top.
-- **Within each section**, classes and functions should be sorted alphabetically by name. The principal class or function is an exception to this rule, as it appears first in its respective section.
+- **Within each section**, prefer logical ordering (e.g., by functionality or typical usage). Alphabetical ordering is a reasonable fallback when no clear logical structure exists.
 
 ## Package Layout
 
@@ -269,8 +267,9 @@ This pattern allows:
 
 ## Import Statements
 
-Import statements must be organized into five groups, in the following order:
-**Note**: Within each section, imports should be sorted alphabetically.
+Import statements must be organized into five groups, in the following order.
+
+**Note**: Within each group, imports must be sorted alphabetically. This is enforced by pre-commit linters (`ruff`).
 
 ### 1. `__future__` Imports
 
@@ -339,7 +338,7 @@ from cuda.core.experimental._utils.cuda_utils import (
 
 ### Additional Rules
 
-1. **Alphabetical Ordering**: Within each group, imports should be sorted alphabetically by module name.
+1. **Alphabetical Ordering**: Within each group, imports must be sorted alphabetically by module name. This is enforced by pre-commit linters.
 
 2. **Multi-line Imports**: When importing multiple items from a single module, use parentheses for multi-line formatting:
    ```python
@@ -358,9 +357,7 @@ from cuda.core.experimental._utils.cuda_utils import (
 ### Example
 
 ```python
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# SPDX-License-Identifier: Apache-2.0
+# <SPDX copyright header>
 
 # 1. __future__ imports
 from __future__ import annotations
@@ -389,15 +386,15 @@ from cuda.core.experimental._utils.cuda_utils import driver
 
 ### Class Definition Order
 
-Within a class definition, elements must be organized in the following order:
+Within a class definition, the suggested organization is:
 
-1. **Special (dunder) methods**: Methods with names starting and ending with double underscores (e.g., `__init__`, `__cinit__`, `__dealloc__`, `__reduce__`, `__dlpack__`)
+1. **Special (dunder) methods**: Methods with names starting and ending with double underscores. By convention, `__init__` (or `__cinit__` in Cython) should be first among dunder methods, as it defines the class interface.
 
 2. **Methods**: Regular instance methods, class methods (`@classmethod`), and static methods (`@staticmethod`)
 
 3. **Properties**: Properties defined with `@property` decorator
 
-**Note**: Within each section (dunder methods, methods, properties), elements should be sorted alphabetically by name.
+**Note**: Within each section, prefer logical ordering (e.g., grouping related methods). Alphabetical ordering is acceptable when no clear logical structure exists. Developers should use their judgment.
 
 ### Example
 
@@ -405,15 +402,19 @@ Within a class definition, elements must be organized in the following order:
 cdef class Buffer:
     """Example class demonstrating the ordering."""
 
-    # 1. Special (dunder) methods (alphabetically sorted)
-    def __buffer__(self, flags: int, /) -> memoryview:
-        """Buffer protocol support."""
-        # ...
-
+    # 1. Special (dunder) methods (__cinit__/__init__ first by convention)
     def __cinit__(self):
         """Cython initialization."""
         # ...
 
+    def __init__(self, *args, **kwargs):
+        """Python initialization."""
+        # ...
+
+    def __buffer__(self, flags: int, /) -> memoryview:
+        """Buffer protocol support."""
+        # ...
+
     def __dealloc__(self):
         """Cleanup."""
         # ...
@@ -422,15 +423,11 @@ cdef class Buffer:
         """DLPack protocol support."""
         # ...
 
-    def __init__(self, *args, **kwargs):
-        """Python initialization."""
-        # ...
-
     def __reduce__(self):
         """Pickle support."""
         # ...
 
-    # 2. Methods (alphabetically sorted)
+    # 2. Methods
     def close(self, stream=None):
         """Close the buffer."""
         # ...
@@ -452,7 +449,7 @@ cdef class Buffer:
         """Get IPC descriptor."""
         # ...
 
-    # 3. Properties (alphabetically sorted)
+    # 3. Properties
     @property
     def device_id(self) -> int:
         """Device ID property."""
@@ -499,13 +496,13 @@ cdef inline DMR_close(DeviceMemoryResource self):
 
 ### Function Definitions
 
-For module-level functions (outside of classes), follow the ordering specified in [File Structure](#file-structure): principal functions first (if applicable), then other public functions, then private functions. Within each group, sort alphabetically.
+For module-level functions (outside of classes), follow the ordering specified in [File Structure](#file-structure): principal functions first (if applicable), then other public functions, then private functions. Within each group, prefer logical ordering; alphabetical ordering is a reasonable fallback.
 
 ## Naming Conventions
 
 ### Class Names
 
-Use **PascalCase** (also known as CapWords) for class names.
+Use **CamelCase** for class names.
 
 ```python
 cdef class Buffer:
@@ -1497,9 +1494,9 @@ Follow the ordering specified in [File Structure](#file-structure):
 
 Follow the ordering specified in [Class and Function Definitions](#class-and-function-definitions):
 
-1. Special (dunder) methods (alphabetically sorted)
-2. Methods (alphabetically sorted)
-3. Properties (alphabetically sorted)
+1. Special (dunder) methods (`__init__`/`__cinit__` first by convention)
+2. Methods
+3. Properties
 
 ### Helper Functions
 
@@ -1808,9 +1805,9 @@ from cuda.core.experimental._utils.cuda_utils cimport (
 def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
     stream = Stream_accept(stream)
     cdef size_t src_size = self._size
-    
+
     # ... validation logic ...
-    
+
     err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
     raise_if_driver_error(err)
     return dst
@@ -1893,7 +1890,7 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
     stream = Stream_accept(stream)
     cdef size_t buffer_size = self._size
     cdef unsigned char c_value8
-    
+
     # Validation...
     if width == 1:
         c_value8 = <unsigned char>value
@@ -1914,7 +1911,7 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
     cdef cydriver.CUstream s = s_stream._handle
     cdef size_t buffer_size = self._size
     cdef unsigned char c_value8
-    
+
     # Validation...
     if width == 1:
         c_value8 = <unsigned char>value
@@ -1941,40 +1938,4 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
 
 ## Copyright and Licensing
 
-All source files in `cuda/core/experimental` must include a copyright header at the top of the file using the SPDX format.
-
-### Required Header Format
-
-Every `.py`, `.pyx`, and `.pxd` file must begin with the following header:
-
-```python
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# SPDX-License-Identifier: Apache-2.0
-```
-
-### Guidelines
-
-1. **Placement**: The copyright header must be the first lines of the file, before any imports or other code.
-
-2. **Blank Lines**: Include a blank line between the copyright notice and the license identifier, and another blank line after the license identifier before the code begins.
-
-3. **Year Range**:
-   - The beginning year reflects the year the file was first added to the repository.
-   - The end year reflects the most recent year in which the file was modified.
-   - For new files, use a single year (e.g., `2025`) or the current year range if created mid-year.
-   - Update the end year when making modifications to existing files.
-
-4. **Consistency**: All files must use the same copyright text and license identifier (`Apache-2.0`).
-
-5. **SPDX Format**: The header uses the SPDX (Software Package Data Exchange) format, which is a standard way to communicate license and copyright information.
-
-### Example
-
-```python
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# SPDX-License-Identifier: Apache-2.0
-
-# ... rest of the file ...
-```
+All source files in `cuda/core/experimental` must include a copyright header at the top of the file using the SPDX format. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.

From ed8a8457165aa036b7f8e78bacee6e9ca95a6d98 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 11:22:22 -0800
Subject: [PATCH 06/17] Refine developer guide: remove PEP 8 duplicates and
 update namespace

- Rename title from "Style Guide" to "Developer Guide"
- Remove naming convention rules that duplicate PEP 8
- Keep Cython-specific c_ prefix guidance
- Remove generic comment advice covered by PEP 8
- Simplify constants and API design guidelines
- Replace cuda/core/experimental with cuda/core throughout
- Update all cuda.core.experimental imports to cuda.core
---
 cuda_core/docs/developer-guide.md | 176 +++++++-----------------------
 1 file changed, 40 insertions(+), 136 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 137738e153..f8f929f01b 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -1,8 +1,8 @@
-# CUDA Core Style Guide
+# CUDA Core Developer Guide
 
-This style guide defines conventions for Python and Cython code in `cuda/core/experimental`.
+This guide defines conventions for Python and Cython code in `cuda/core`.
 
-**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide.** The rules in this document highlight project-specific conventions and extensions beyond PEP 8, particularly for Cython code and the structure of this codebase.
+**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide.** The conventions in this document extend PEP 8 with project-specific patterns, particularly for Cython code and the structure of this codebase. Standard PEP 8 conventions (naming, whitespace, etc.) are not repeated here.
 
 ## Table of Contents
 
@@ -30,7 +30,7 @@ This style guide defines conventions for Python and Cython code in `cuda/core/ex
 
 ## File Structure
 
-Files in `cuda/core/experimental` should follow a consistent structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
+Files in `cuda/core` should follow a consistent structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
 
 ### 1. SPDX Copyright Header
 
@@ -96,7 +96,7 @@ Finally, define private functions and implementation details. These include:
 
 # Imports (cimports first, then regular imports)
 from libc.stdint cimport uintptr_t
-from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource
+from cuda.core._memory._device_memory_resource cimport DeviceMemoryResource
 import abc
 
 __all__ = ['Buffer', 'MemoryResource', 'some_public_function']
@@ -137,7 +137,7 @@ cdef inline void Buffer_close(Buffer self, stream):
 
 ### File Types
 
-The `cuda/core/experimental` package uses three types of files:
+The `cuda/core` package uses three types of files:
 
 1. **`.pyx` files**: Cython implementation files containing the actual code
 2. **`.pxd` files**: Cython declaration files containing type definitions and function signatures for C-level access
@@ -188,7 +188,7 @@ cdef class Buffer:
 
 #### Simple Top-Level Modules
 
-For simple modules at the `cuda/core/experimental` level, define classes and functions directly in the module file with an `__all__` list:
+For simple modules at the `cuda/core` level, define classes and functions directly in the module file with an `__all__` list:
 
 ```python
 # _device.pyx
@@ -298,12 +298,12 @@ from cuda.bindings cimport cydriver
 
 ### 3. cuda-core `cimport` Statements
 
-Cython imports from within the `cuda.core.experimental` package.
+Cython imports from within the `cuda.core` package.
 
 ```python
-from cuda.core.experimental._memory._buffer cimport Buffer, MemoryResource
-from cuda.core.experimental._stream cimport Stream_accept, Stream
-from cuda.core.experimental._utils.cuda_utils cimport (
+from cuda.core._memory._buffer cimport Buffer, MemoryResource
+from cuda.core._stream cimport Stream_accept, Stream
+from cuda.core._utils.cuda_utils cimport (
     HANDLE_RETURN,
     check_or_create_options,
 )
@@ -324,12 +324,12 @@ from dataclasses import dataclass
 
 ### 5. cuda-core `import` Statements
 
-Regular Python imports from within the `cuda.core.experimental` package.
+Regular Python imports from within the `cuda.core` package.
 
 ```python
-from cuda.core.experimental._context import Context, ContextOptions
-from cuda.core.experimental._dlpack import DLDeviceType, make_py_capsule
-from cuda.core.experimental._utils.cuda_utils import (
+from cuda.core._context import Context, ContextOptions
+from cuda.core._dlpack import DLDeviceType, make_py_capsule
+from cuda.core._utils.cuda_utils import (
     CUDAError,
     driver,
     handle_return,
@@ -342,7 +342,7 @@ from cuda.core.experimental._utils.cuda_utils import (
 
 2. **Multi-line Imports**: When importing multiple items from a single module, use parentheses for multi-line formatting:
    ```python
-   from cuda.core.experimental._utils.cuda_utils cimport (
+   from cuda.core._utils.cuda_utils cimport (
        HANDLE_RETURN,
        check_or_create_options,
    )
@@ -369,17 +369,17 @@ from libc.stdlib cimport malloc, free
 from cuda.bindings cimport cydriver
 
 # 3. cuda-core cimports
-from cuda.core.experimental._memory._buffer cimport Buffer, MemoryResource
-from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+from cuda.core._memory._buffer cimport Buffer, MemoryResource
+from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
 
 # 4. External imports
 import abc
 from dataclasses import dataclass
 
 # 5. cuda-core imports
-from cuda.core.experimental._context import Context
-from cuda.core.experimental._device import Device
-from cuda.core.experimental._utils.cuda_utils import driver
+from cuda.core._context import Context
+from cuda.core._device import Device
+from cuda.core._utils.cuda_utils import driver
 ```
 
 ## Class and Function Definitions
@@ -500,62 +500,9 @@ For module-level functions (outside of classes), follow the ordering specified i
 
 ## Naming Conventions
 
-### Class Names
+Follow PEP 8 naming conventions (CamelCase for classes, snake_case for functions/variables, UPPER_SNAKE_CASE for constants, leading underscore for private names).
 
-Use **CamelCase** for class names.
-
-```python
-cdef class Buffer:
-    # ...
-
-cdef class DeviceMemoryResource:
-    # ...
-
-class CUDAError(Exception):
-    # ...
-```
-
-### Function and Method Names
-
-Use **snake_case** for function and method names.
-
-```python
-def allocate(self, size_t size, stream=None) -> Buffer:
-    # ...
-
-def get_ipc_descriptor(self) -> IPCBufferDescriptor:
-    # ...
-
-cdef inline void Buffer_close(Buffer self, stream):
-    # ...
-```
-
-### Variable Names
-
-#### Python Variables
-
-Use **snake_case** for Python variables.
-
-```python
-device_id = 0
-memory_resource = DeviceMemoryResource(device_id)
-buffer_size = 1024
-```
-
-#### Private Attributes
-
-Use **snake_case** with a leading underscore for private instance attributes.
-
-```python
-cdef class Buffer:
-    cdef:
-        uintptr_t _ptr
-        size_t _size
-        MemoryResource _memory_resource
-        object _ipc_data
-```
-
-#### Cython `cdef` Variables
+### Cython `cdef` Variables
 
 Consider prefixing `cdef` variables with `c_` to distinguish them from Python variables. This improves code readability by making it clear which variables are C-level types.
 
@@ -586,33 +533,6 @@ cdef cydriver.CUdevice get_device_from_ctx(
 
 The `c_` prefix is particularly helpful when mixing Python and Cython variables in the same scope, or when the variable name would otherwise be ambiguous.
 
-### Constants
-
-Use **UPPER_SNAKE_CASE** for module-level constants.
-
-```python
-LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
-PER_THREAD_DEFAULT_STREAM = C_PER_THREAD_DEFAULT_STREAM
-
-RUNTIME_CUDA_ERROR_EXPLANATIONS = {
-    # ...
-}
-```
-
-### Private Module-Level Names
-
-Use **snake_case** with a leading underscore for private module-level functions, classes, and variables.
-
-```python
-_fork_warning_checked = False
-
-def _reduce_3_tuple(t: tuple):
-    # ...
-
-cdef inline void _helper_function():
-    # ...
-```
-
 ## Type Annotations and Declarations
 
 ### Python Type Annotations
@@ -663,7 +583,7 @@ def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
 from typing import TYPE_CHECKING
 
 if TYPE_CHECKING:
-    from cuda.core.experimental._stream import Stream
+    from cuda.core._stream import Stream
 
 def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
     # ...
@@ -966,7 +886,7 @@ The project defines custom exception types for CUDA-specific errors:
 - **`NVRTCError`**: Exception for NVRTC (compiler) errors, inherits from `CUDAError`
 
 ```python
-from cuda.core.experimental._utils.cuda_utils import CUDAError, NVRTCError
+from cuda.core._utils.cuda_utils import CUDAError, NVRTCError
 
 raise CUDAError("CUDA operation failed")
 raise NVRTCError("NVRTC compilation error")
@@ -1027,7 +947,7 @@ cdef int allocate_buffer(uintptr_t* ptr, size_t size) except?-1 nogil:
 For Python-level CUDA error handling, use `handle_return()`:
 
 ```python
-from cuda.core.experimental._utils.cuda_utils import handle_return
+from cuda.core._utils.cuda_utils import handle_return
 
 err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
 handle_return((err,))
@@ -1036,7 +956,7 @@ handle_return((err,))
 Or use `raise_if_driver_error()` for direct error raising:
 
 ```python
-from cuda.core.experimental._utils.cuda_utils cimport (
+from cuda.core._utils.cuda_utils cimport (
     _check_driver_error as raise_if_driver_error,
 )
 
@@ -1403,16 +1323,12 @@ _access_flags = {
 
 ### Guidelines
 
-1. **Name all constants**: Avoid magic numbers and strings. Use descriptive constant names.
+1. **Avoid magic numbers and strings**: Use descriptive constant names.
 
-2. **Use UPPER_SNAKE_CASE**: Follow Python convention for module-level constants.
+2. **Prefer CUDA bindings**: Use constants from `cuda.bindings` directly when possible rather than redefining them.
 
 3. **Group related constants**: Define related constants together, optionally in a dictionary or class.
 
-4. **Document non-obvious constants**: If a constant's purpose isn't immediately clear, add a comment explaining it.
-
-5. **Prefer CUDA bindings**: Use constants from `cuda.bindings` directly when possible rather than redefining them.
-
 ## Comments and Inline Documentation
 
 ### TODO Comments
@@ -1467,13 +1383,7 @@ import platform  # no-cython-lint
 
 2. **Use NOTE for important context**: Add `NOTE` comments to explain non-obvious implementation decisions or workarounds.
 
-3. **Explain complex logic**: Add comments to explain why code is written a certain way, not what it does (the code should be self-explanatory).
-
-4. **Keep comments up-to-date**: Update or remove comments when code changes.
-
-5. **Avoid obvious comments**: Don't comment what the code clearly shows. Focus on the "why" rather than the "what".
-
-6. **Document workarounds**: Always document workarounds for bugs (include bug numbers when available) and explain why they're necessary.
+3. **Document workarounds**: Always document workarounds for bugs (include bug numbers when available) and explain why they're necessary.
 
 ## Code Organization Within Files
 
@@ -1670,13 +1580,7 @@ Design APIs to fail fast with clear error messages:
 
 3. **Use `__all__` explicitly**: List all public symbols in `__all__` to clearly define the module's public API.
 
-4. **Design for extensibility**: Consider future needs when designing APIs, but don't over-engineer.
-
-5. **Follow Python conventions**: Adhere to Python naming and design conventions (PEP 8, PEP 20).
-
-6. **Provide clear error messages**: When APIs fail, provide error messages that help users understand and fix the problem.
-
-7. **Use type hints**: Provide type annotations for all public APIs to improve IDE support and documentation.
+4. **Use type hints**: Provide type annotations for all public APIs to improve IDE support and documentation.
 
 ## CUDA-Specific Patterns
 
@@ -1794,11 +1698,11 @@ When implementing new CUDA functionality, follow a two-phase development approac
 
 ### Phase 1: Initial Python Implementation
 
-Begin with a straightforward Python implementation using the `driver` module from `cuda.core.experimental._utils.cuda_utils`:
+Begin with a straightforward Python implementation using the `driver` module from `cuda.core._utils.cuda_utils`:
 
 ```python
-from cuda.core.experimental._utils.cuda_utils import driver
-from cuda.core.experimental._utils.cuda_utils cimport (
+from cuda.core._utils.cuda_utils import driver
+from cuda.core._utils.cuda_utils cimport (
     _check_driver_error as raise_if_driver_error,
 )
 
@@ -1834,13 +1738,13 @@ Once tests are passing, optimize the implementation by:
 
 ```python
 # Remove Python driver import
-# from cuda.core.experimental._utils.cuda_utils import driver
+# from cuda.core._utils.cuda_utils import driver
 
 # Add cydriver cimport
 from cuda.bindings cimport cydriver
 
 # Add HANDLE_RETURN
-from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
 ```
 
 **Step 2: Cast stream and extract C-level handle**
@@ -1881,8 +1785,8 @@ All device pointers passed to `cydriver` functions must be cast to `cydriver.CUd
 **Before (Python driver implementation):**
 
 ```python
-from cuda.core.experimental._utils.cuda_utils import driver
-from cuda.core.experimental._utils.cuda_utils cimport (
+from cuda.core._utils.cuda_utils import driver
+from cuda.core._utils.cuda_utils cimport (
     _check_driver_error as raise_if_driver_error,
 )
 
@@ -1903,7 +1807,7 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
 
 ```python
 from cuda.bindings cimport cydriver
-from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
+from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
 
 def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
     stream = Stream_accept(stream)
@@ -1938,4 +1842,4 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
 
 ## Copyright and Licensing
 
-All source files in `cuda/core/experimental` must include a copyright header at the top of the file using the SPDX format. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.
+All source files in `cuda/core` must include a copyright header at the top of the file using the SPDX format. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.

From 7b5700b1eac6eecb98534500eff38e57acf3a397 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 11:44:44 -0800
Subject: [PATCH 07/17] Address review feedback: __all__, file structure
 rationale, and more

- Clarify __all__ is for star-import behavior, not "public API"
- Remove __all__ from public/private API distinction (redundant)
- Add rationale for file structure ordering (important-to-detailed)
- Add hedge: helper functions near call sites is fine
- Extend private functions section with broader category
---
 cuda_core/docs/developer-guide.md | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index f8f929f01b..25b36cd4f7 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -30,7 +30,11 @@ This guide defines conventions for Python and Cython code in `cuda/core`.
 
 ## File Structure
 
-Files in `cuda/core` should follow a consistent structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
+Files in `cuda/core` should follow a sensible structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
+
+The guiding principle is that content should flow from most important to least important—principal classes first, then supporting classes, then implementation details. This allows readers to start at the top and quickly find what's most relevant. Unlike C/C++ where definitions must precede uses, Python has no such constraint, so we can optimize for readability.
+
+This is not a strict rule. It is sometimes better (and perfectly fine) to place small helper functions near their point of use. When in doubt, optimize for readability.
 
 ### 1. SPDX Copyright Header
 
@@ -42,7 +46,7 @@ Import statements come immediately after the copyright header. Follow the import
 
 ### 3. `__all__` Declaration
 
-If the module exports public API elements, include an `__all__` list after the imports and before any other code. This explicitly defines the public API of the module.
+Each submodule should define `__all__` to specify symbols included in star imports.
 
 ```python
 __all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
@@ -88,6 +92,7 @@ Finally, define private functions and implementation details. These include:
 - Functions with names starting with `_` (private)
 - `cdef inline` functions used for internal implementation
 - Helper functions not part of the public API
+- Any specialized or low-level code that would distract from the principal content
 
 ### Example Structure
 
@@ -206,7 +211,7 @@ cdef class DeviceProperties:
 For complex subpackages that require extra structure (like `_memory/`), use the following pattern:
 
 1. **Private submodules**: Each component is implemented in a private submodule (e.g., `_buffer.pyx`, `_device_memory_resource.pyx`)
-2. **Submodule `__all__`**: Each submodule defines its own `__all__` list with the symbols it exports
+2. **Submodule `__all__`**: Each submodule defines its own `__all__` list
 3. **Subpackage `__init__.py`**: The subpackage `__init__.py` uses `from ._module import *` to assemble the package
 
 **Example structure for `_memory/` subpackage:**
@@ -245,7 +250,7 @@ from ._virtual_memory_resource import *  # noqa: F403
 
 This pattern allows:
 - **Modular organization**: Each component lives in its own file
-- **Clear public API**: Each submodule explicitly defines what it exports via `__all__`
+- **Clear star-import behavior**: Each submodule explicitly defines what it exports via `__all__`
 - **Clean package interface**: The subpackage `__init__.py` assembles all exports into a single namespace
 - **Easier refactoring**: Components can be moved or reorganized without changing the public API
 
@@ -257,7 +262,7 @@ This pattern allows:
 
 2. **Keep `.pxd` files minimal**: Only include declarations needed for Cython compilation. Omit implementation details, docstrings, and Python-only code.
 
-3. **Use `__all__` in submodules**: Each submodule should define `__all__` to explicitly declare its public API.
+3. **Use `__all__` in submodules**: Each submodule should define `__all__`.
 
 4. **Use `from ._module import *` in subpackage `__init__.py`**: This pattern assembles the subpackage API from its submodules. Use `# noqa: F403` to suppress linting warnings about wildcard imports.
 
@@ -1516,12 +1521,10 @@ cdef size_t size = self._size
 
 Use naming conventions to distinguish public and private APIs:
 
-- **Public API**: No leading underscore, documented in docstrings, included in `__all__`
-- **Private API**: Leading underscore (`_`), may have minimal documentation, not in `__all__`
+- **Public API**: No leading underscore, documented in docstrings
+- **Private API**: Leading underscore (`_`), may have minimal documentation
 
 ```python
-__all__ = ['Buffer', 'MemoryResource']  # Public API
-
 # Public API
 cdef class Buffer:
     def allocate(self):  # Public method
@@ -1578,7 +1581,7 @@ Design APIs to fail fast with clear error messages:
 
 2. **Document public APIs**: All public APIs must have complete docstrings following the [Docstrings](#docstrings) guidelines.
 
-3. **Use `__all__` explicitly**: List all public symbols in `__all__` to clearly define the module's public API.
+3. **Use `__all__` explicitly**: Each submodule should define `__all__` to specify symbols included in star imports.
 
 4. **Use type hints**: Provide type annotations for all public APIs to improve IDE support and documentation.
 

From e9898ec5d3d42e179e498f86a8e50ba0d6f27b6a Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 12:06:07 -0800
Subject: [PATCH 08/17] Streamline developer guide: remove redundant sections

- Rewrite File Structure intro for clarity (readability/maintainability focus)
- Remove sections: Thread Safety, Cython-Specific Features, Constants,
  Code Organization, API Design, Copyright, Comments, Memory Management
- These were either too prescriptive, duplicative, or better left to
  existing documentation and developer judgment
---
 cuda_core/docs/developer-guide.md | 479 +-----------------------------
 1 file changed, 6 insertions(+), 473 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 25b36cd4f7..375b398a21 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -14,27 +14,19 @@ This guide defines conventions for Python and Cython code in `cuda/core`.
 6. [Type Annotations and Declarations](#type-annotations-and-declarations)
 7. [Docstrings](#docstrings)
 8. [Errors and Warnings](#errors-and-warnings)
-9. [Memory Management](#memory-management)
-10. [Thread Safety and Concurrency](#thread-safety-and-concurrency)
-11. [Cython-Specific Features](#cython-specific-features)
-12. [Constants and Magic Numbers](#constants-and-magic-numbers)
-13. [Comments and Inline Documentation](#comments-and-inline-documentation)
-14. [Code Organization Within Files](#code-organization-within-files)
-15. [Performance Considerations](#performance-considerations)
-16. [API Design Principles](#api-design-principles)
-17. [CUDA-Specific Patterns](#cuda-specific-patterns)
-18. [Development Lifecycle](#development-lifecycle)
-19. [Copyright and Licensing](#copyright-and-licensing)
+9. [Performance Considerations](#performance-considerations)
+10. [CUDA-Specific Patterns](#cuda-specific-patterns)
+11. [Development Lifecycle](#development-lifecycle)
 
 ---
 
 ## File Structure
 
-Files in `cuda/core` should follow a sensible structure. The suggested ordering of elements within a file is as follows. Developers are free to deviate when a different organization makes more sense for a particular file.
+The goal is **readability and maintainability**. A well-organized file lets readers quickly find what they're looking for and understand how the pieces fit together.
 
-The guiding principle is that content should flow from most important to least important—principal classes first, then supporting classes, then implementation details. This allows readers to start at the top and quickly find what's most relevant. Unlike C/C++ where definitions must precede uses, Python has no such constraint, so we can optimize for readability.
+To support this, we suggest organizing content from most important to least important: principal classes first, then supporting classes, then implementation details. This way, readers can start at the top and immediately see what matters most. Unlike C/C++ where definitions must precede uses, Python imposes no such constraint—we're free to optimize for the reader.
 
-This is not a strict rule. It is sometimes better (and perfectly fine) to place small helper functions near their point of use. When in doubt, optimize for readability.
+These are guidelines, not rules. Place helper functions near their call sites if that's clearer. Group related code together if it aids understanding. When in doubt, choose whatever makes the code easiest to read and maintain.
 
 ### 1. SPDX Copyright Header
 
@@ -1055,391 +1047,6 @@ warnings.warn(
 
 6. **Prefer warnings over errors for recoverable issues**: Use warnings for issues that don't prevent execution but may cause problems.
 
-## Memory Management
-
-### Resource Lifecycle
-
-CUDA memory resources and buffers follow a clear lifecycle pattern:
-
-1. **Creation**: Resources and buffers are created through factory methods or constructors
-2. **Usage**: Objects are used for CUDA operations
-3. **Cleanup**: Resources are explicitly closed or automatically cleaned up
-
-### Explicit Cleanup
-
-Always provide explicit cleanup methods for resources that manage CUDA handles:
-
-```python
-cdef class DeviceMemoryResource:
-    def close(self):
-        """Close the memory resource and release CUDA handles."""
-        DMR_close(self)
-
-    def __dealloc__(self):
-        """Automatic cleanup when object is garbage collected."""
-        DMR_close(self)
-```
-
-### Buffer Lifecycle
-
-Buffers are associated with memory resources and should be closed when no longer needed:
-
-```python
-cdef class Buffer:
-    def close(self, stream: Stream | GraphBuilder | None = None):
-        """Deallocate this buffer asynchronously on the given stream."""
-        Buffer_close(self, stream)
-
-    def __dealloc__(self):
-        """Automatic cleanup if not explicitly closed."""
-        self.close(self._alloc_stream)
-```
-
-### Guidelines
-
-1. **Provide explicit `close()` methods**: All resources managing CUDA handles should have a `close()` method for explicit cleanup.
-
-2. **Implement `__dealloc__` as safety net**: Use `__dealloc__` to ensure cleanup happens even if users forget to call `close()`, but don't rely on it for normal operation.
-
-3. **Document cleanup behavior**: Clearly document when cleanup happens automatically versus when it must be called explicitly.
-
-4. **Handle cleanup errors gracefully**: Cleanup methods should be idempotent (safe to call multiple times) and handle errors without raising exceptions when possible.
-
-5. **Use stream-ordered deallocation**: When deallocating buffers, use the appropriate stream for asynchronous cleanup to avoid blocking operations.
-
-6. **Track resource ownership**: Clearly document which objects own CUDA handles and are responsible for cleanup.
-
-## Thread Safety and Concurrency
-
-### Thread-Local Storage
-
-Use `threading.local()` for thread-local state that needs to persist across function calls:
-
-```python
-import threading
-
-_tls = threading.local()
-
-def some_function():
-    if not hasattr(_tls, 'devices'):
-        _tls.devices = []
-    return _tls.devices
-```
-
-### Locks for Shared State
-
-Use `threading.Lock()` to protect shared mutable state:
-
-```python
-import threading
-
-_lock = threading.Lock()
-
-def thread_safe_operation():
-    with _lock:
-        # Critical section
-        # Modify shared state
-        pass
-```
-
-### Combining Locks with `nogil`
-
-When protecting CUDA operations, acquire the lock before entering `nogil` context:
-
-```python
-def thread_safe_cuda_operation():
-    with _lock, nogil:
-        HANDLE_RETURN(cydriver.cuSomeOperation())
-```
-
-### One-Time Initialization
-
-For one-time initialization that must be thread-safe, use a lock with a flag:
-
-```python
-cdef bint _initialized = False
-_lock = threading.Lock()
-
-def initialize():
-    global _initialized
-    with _lock:
-        if not _initialized:
-            # Perform initialization
-            _initialized = True
-```
-
-### Guidelines
-
-1. **Use thread-local storage for per-thread state**: When state needs to be isolated per thread, use `threading.local()`.
-
-2. **Protect shared mutable state**: Use locks to protect any shared mutable state that could be accessed from multiple threads.
-
-3. **Minimize lock scope**: Keep critical sections as short as possible to reduce contention.
-
-4. **Document thread safety**: Clearly document which operations are thread-safe and which require external synchronization.
-
-5. **Avoid global mutable state**: Prefer thread-local storage or instance variables over global mutable state when possible.
-
-6. **Combine locks with `nogil` correctly**: Acquire locks before entering `nogil` contexts, not inside them.
-
-## Cython-Specific Features
-
-### Function Declarations
-
-Cython provides three types of function declarations:
-
-1. **`def`**: Python function, callable from Python, slower than C functions
-2. **`cdef`**: C function, not callable from Python, fastest
-3. **`cpdef`**: Hybrid function, callable from both Python and C, faster than `def` but slower than `cdef`
-
-**Guidelines:**
-
-- Use `cdef` for internal helper functions that are only called from Cython code
-- Use `cpdef` when a function needs to be callable from Python but performance is important
-- Use `def` for public Python API functions where flexibility is more important than performance
-
-```python
-# Internal helper - only used in Cython
-cdef inline void Buffer_close(Buffer self, stream):
-    # ...
-
-# Public API - callable from Python, performance important
-cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
-    # ...
-
-# Public API - standard Python function
-def allocate(self, size_t size, stream=None) -> Buffer:
-    # ...
-```
-
-### Class Declarations
-
-Use `cdef class` for Cython extension types:
-
-```python
-cdef class Buffer:
-    cdef:
-        uintptr_t _ptr
-        size_t _size
-        MemoryResource _memory_resource
-```
-
-### The `nogil` Context
-
-Use `nogil` to release the Global Interpreter Lock (GIL) for performance-critical C operations. See [CUDA-Specific Patterns](#cuda-specific-patterns) for detailed guidelines.
-
-### Exception Handling
-
-Use `except?` or `except` clauses to propagate exceptions from `nogil` functions:
-
-```python
-cdef int get_device_from_ctx(...) except?cydriver.CU_DEVICE_INVALID nogil:
-    # Returns CU_DEVICE_INVALID on error, otherwise raises exception
-```
-
-### Type Declarations
-
-Declare C types explicitly for performance:
-
-```python
-cdef:
-    int device_id
-    size_t buffer_size
-    cydriver.CUdeviceptr ptr
-```
-
-### Inline Functions
-
-Use `inline` for small, frequently-called functions:
-
-```python
-cdef inline void Buffer_close(Buffer self, stream):
-    # ...
-```
-
-### Guidelines
-
-1. **Choose the right function type**: Use `cdef` for internal code, `cpdef` for performance-critical public APIs, `def` for standard public APIs.
-
-2. **Declare types explicitly**: Use `cdef` declarations for C-level types to enable optimizations.
-
-3. **Use `inline` judiciously**: Mark small, frequently-called functions as `inline`, but avoid overuse.
-
-4. **Handle exceptions properly**: Use appropriate exception clauses (`except`, `except?`) for `nogil` functions.
-
-5. **Document Cython-specific behavior**: When using Cython features that affect the Python API, document them clearly.
-
-## Constants and Magic Numbers
-
-### Naming Constants
-
-Use **UPPER_SNAKE_CASE** for module-level constants:
-
-```python
-LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
-PER_THREAD_DEFAULT_STREAM = C_PER_THREAD_DEFAULT_STREAM
-
-RUNTIME_CUDA_ERROR_EXPLANATIONS = {
-    # ...
-}
-```
-
-### CUDA Constants
-
-For CUDA API constants, use the bindings directly or create aliases with descriptive names:
-
-```python
-from cuda.bindings cimport cydriver
-
-# Use CUDA constants directly
-cdef cydriver.CUdevice device_id = cydriver.CU_DEVICE_INVALID
-
-# Or create descriptive aliases
-cdef object CU_DEVICE_INVALID = cydriver.CU_DEVICE_INVALID
-```
-
-### Avoid Magic Numbers
-
-Replace magic numbers with named constants:
-
-**Avoid:**
-```python
-if flags & 1:  # What does 1 mean?
-    # ...
-```
-
-**Preferred:**
-```python
-if flags & cydriver.CUstream_flags.CU_STREAM_NON_BLOCKING:
-    # ...
-```
-
-### Dictionary Mappings
-
-Use dictionaries to map between string representations and constants:
-
-```python
-_access_flags = {
-    "rw": cydriver.CU_MEM_ACCESS_FLAGS_PROT_READWRITE,
-    "r": cydriver.CU_MEM_ACCESS_FLAGS_PROT_READ,
-    None: 0
-}
-```
-
-### Guidelines
-
-1. **Avoid magic numbers and strings**: Use descriptive constant names.
-
-2. **Prefer CUDA bindings**: Use constants from `cuda.bindings` directly when possible rather than redefining them.
-
-3. **Group related constants**: Define related constants together, optionally in a dictionary or class.
-
-## Comments and Inline Documentation
-
-### TODO Comments
-
-Use `TODO` comments to mark incomplete work or future improvements:
-
-```python
-# TODO: It is better to take a stream for latter deallocation
-return Buffer._init(ptr, size, mr=mr)
-
-# TODO: consider lower this to Cython
-expl = DRIVER_CU_RESULT_EXPLANATIONS.get(int(error))
-```
-
-### NOTE Comments
-
-Use `NOTE` comments to explain non-obvious implementation details:
-
-```python
-# NOTE: match this behavior to DeviceMemoryResource.allocate()
-stream = default_stream()
-
-# NOTE: this is referenced in instructions to debug nvbug 5698116
-cpdef DMR_mempool_get_access(DeviceMemoryResource dmr, int device_id):
-```
-
-### Implementation Comments
-
-Add comments to explain complex logic or non-obvious behavior:
-
-```python
-# Must not serialize the parent's stream!
-return Buffer.from_ipc_descriptor, (self.memory_resource, self.get_ipc_descriptor())
-
-# This works around nvbug 5698116. When a memory pool handle is recycled
-# the new handle inherits the peer access state of the previous handle.
-if self._peer_accessible_by:
-    self.peer_accessible_by = []
-```
-
-### Inline Type Comments
-
-Use type comments sparingly, only when type annotations aren't sufficient:
-
-```python
-import platform  # no-cython-lint
-```
-
-### Guidelines
-
-1. **Use TODO for incomplete work**: Mark known limitations, future improvements, or incomplete features with `TODO` comments.
-
-2. **Use NOTE for important context**: Add `NOTE` comments to explain non-obvious implementation decisions or workarounds.
-
-3. **Document workarounds**: Always document workarounds for bugs (include bug numbers when available) and explain why they're necessary.
-
-## Code Organization Within Files
-
-### Overall Structure
-
-Follow the ordering specified in [File Structure](#file-structure):
-
-1. SPDX copyright header
-2. Import statements
-3. `__all__` declaration
-4. Type aliases and constants (optional)
-5. Principal class/function
-6. Other public classes and functions
-7. Public module functions
-8. Private/implementation functions
-
-### Within Classes
-
-Follow the ordering specified in [Class and Function Definitions](#class-and-function-definitions):
-
-1. Special (dunder) methods (`__init__`/`__cinit__` first by convention)
-2. Methods
-3. Properties
-
-### Helper Functions
-
-Move complex implementation details to helper functions at the end of the file. See [Class and Function Definitions - Helper Functions](#helper-functions) for details.
-
-### Type Aliases and Constants
-
-Type aliases and module-level constants should be defined after `__all__` (if present) or after imports, before classes. See [File Structure](#file-structure) for the complete ordering.
-
-```python
-DevicePointerT = driver.CUdeviceptr | int | None
-"""Type union for device pointer representations."""
-
-LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
-```
-
-### Guidelines
-
-1. **Follow the established ordering**: Maintain consistency with the file structure and class definition ordering rules.
-
-2. **Group related code**: Keep related functions and classes together.
-
-3. **Separate public and private**: Clearly separate public API from implementation details.
-
-4. **Use helper functions**: Extract complex logic into helper functions to improve readability.
-
-5. **Keep related code close**: Place helper functions near the code that uses them, or group all helpers at the end of the file.
-
 ## Performance Considerations
 
 ### Use Cython Types
@@ -1515,76 +1122,6 @@ cdef size_t size = self._size
 
 6. **Cache expensive lookups**: Cache results of expensive operations (e.g., dictionary lookups, attribute access) when used repeatedly.
 
-## API Design Principles
-
-### Public vs Private API
-
-Use naming conventions to distinguish public and private APIs:
-
-- **Public API**: No leading underscore, documented in docstrings
-- **Private API**: Leading underscore (`_`), may have minimal documentation
-
-```python
-# Public API
-cdef class Buffer:
-    def allocate(self):  # Public method
-        # ...
-
-# Private API
-cdef inline void Buffer_close(Buffer self, stream):  # Private helper
-    # ...
-```
-
-### Backward Compatibility
-
-Maintain backward compatibility when possible:
-
-- **Deprecation warnings**: Use `DeprecationWarning` for APIs that will be removed
-- **Gradual migration**: Provide both old and new APIs during transition periods
-- **Version documentation**: Document when APIs were introduced or deprecated
-
-### Consistency
-
-Maintain consistency across the API:
-
-- **Naming patterns**: Use consistent naming patterns (e.g., `from_*` for factory methods)
-- **Parameter ordering**: Use consistent parameter ordering across similar functions
-- **Return types**: Use consistent return types for similar operations
-
-### Factory Methods
-
-Use class methods or static methods for factory functions:
-
-```python
-@classmethod
-def from_ipc_descriptor(cls, mr, ipc_descriptor, stream=None) -> Buffer:
-    """Factory method to create Buffer from IPC descriptor."""
-    # ...
-
-@staticmethod
-def from_handle(ptr, size, mr=None) -> Buffer:
-    """Factory method to create Buffer from handle."""
-    # ...
-```
-
-### Error Handling
-
-Design APIs to fail fast with clear error messages:
-
-- **Validate inputs early**: Check parameters at the start of functions
-- **Use appropriate exceptions**: Raise specific exception types for different error conditions
-- **Provide context**: Include relevant values and context in error messages
-
-### Guidelines
-
-1. **Minimize public API surface**: Keep the public API small and focused. Use private helpers for implementation details.
-
-2. **Document public APIs**: All public APIs must have complete docstrings following the [Docstrings](#docstrings) guidelines.
-
-3. **Use `__all__` explicitly**: Each submodule should define `__all__` to specify symbols included in star imports.
-
-4. **Use type hints**: Provide type annotations for all public APIs to improve IDE support and documentation.
-
 ## CUDA-Specific Patterns
 
 ### GIL Management for CUDA Driver API Calls
@@ -1842,7 +1379,3 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
 5. **Consistent pattern**: Follow this pattern for all new CUDA driver API wrappers to maintain consistency across the codebase.
 
 6. **Error handling**: The `HANDLE_RETURN` macro is designed to work in `nogil` contexts and will automatically raise appropriate exceptions when needed.
-
-## Copyright and Licensing
-
-All source files in `cuda/core` must include a copyright header at the top of the file using the SPDX format. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.

From 76c986262309797ad5abb102d492c21d7555c60e Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 12:32:25 -0800
Subject: [PATCH 09/17] Refine File Structure section for clarity and
 consistency

- Add introductory clause before numbered subsections
- Mark __all__ and Type Aliases as optional
- Fix contradiction in Principal Class section
- Make language consistent with "guidelines, not rules" framing
- Streamline all subsections for parallel structure
---
 cuda_core/docs/developer-guide.md | 39 +++++++++++--------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 375b398a21..396f9c2abe 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -28,25 +28,27 @@ To support this, we suggest organizing content from most important to least impo
 
 These are guidelines, not rules. Place helper functions near their call sites if that's clearer. Group related code together if it aids understanding. When in doubt, choose whatever makes the code easiest to read and maintain.
 
+The following is a suggested file organization:
+
 ### 1. SPDX Copyright Header
 
-The file must begin with an SPDX copyright header. Follow the pattern used in existing files. The pre-commit hook will add or update these notices automatically when necessary.
+Every file begins with an SPDX copyright header. The pre-commit hook adds or updates these automatically.
 
 ### 2. Import Statements
 
-Import statements come immediately after the copyright header. Follow the import ordering conventions specified in [Import Statements](#import-statements).
+Imports come next. See [Import Statements](#import-statements) for ordering conventions.
 
-### 3. `__all__` Declaration
+### 3. `__all__` Declaration (Optional)
 
-Each submodule should define `__all__` to specify symbols included in star imports.
+If present, `__all__` specifies symbols included in star imports.
 
 ```python
 __all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
 ```
 
-### 4. Type Aliases and Constants
+### 4. Type Aliases and Constants (Optional)
 
-Type aliases and module-level constants may immediately follow `__all__` (if present) or come after imports:
+Type aliases and module-level constants, if any, come next.
 
 ```python
 DevicePointerT = driver.CUdeviceptr | int | None
@@ -57,34 +59,19 @@ LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
 
 ### 5. Principal Class or Function
 
-If the file principally implements a single class or function (e.g., `_buffer.pyx` defines the `Buffer` class, `_device.pyx` defines the `Device` class), that principal class or function should come next, immediately after `__all__` (if present).
-
-The principal class or function typically appears first in its section.
+If the file centers on a single class or function (e.g., `_buffer.pyx` defines `Buffer`, `_device.pyx` defines `Device`), that principal element comes first among the definitions.
 
 ### 6. Other Public Classes and Functions
 
-Following the principal class or function, define other public classes and functions. These include:
-
-- **Auxiliary classes**: Supporting classes that are part of the public API (e.g., `DeviceMemoryResourceOptions` is an auxiliary class used by `DeviceMemoryResource`)
-- **Abstract base classes**: ABCs that define interfaces (e.g., `MemoryResource` in `_buffer.pyx`)
-- **Other public classes**: Additional classes exported by the module
-
-Consider organizing classes and functions logically—for example, by grouping related functionality or by order of typical usage. When no clear logical ordering exists, alphabetical ordering can help with discoverability.
-
-**Example:** In `_device_memory_resource.pyx`, `DeviceMemoryResource` is the principal class and appears first, followed by `DeviceMemoryResourceOptions` (its options class).
+Other public classes and functions follow. These might include auxiliary classes (e.g., `DeviceMemoryResourceOptions`), abstract base classes, or additional exports. Organize them logically—by related functionality or typical usage.
 
 ### 7. Public Module Functions
 
-After all classes, define public module-level functions that are part of the API.
-
-### 8. Private or Implementation Functions
+Public module-level functions come after classes.
 
-Finally, define private functions and implementation details. These include:
+### 8. Private and Implementation Details
 
-- Functions with names starting with `_` (private)
-- `cdef inline` functions used for internal implementation
-- Helper functions not part of the public API
-- Any specialized or low-level code that would distract from the principal content
+Finally, private functions and implementation details: functions prefixed with `_`, `cdef inline` helpers, and any specialized code that would distract from the principal content.
 
 ### Example Structure
 

From f89d169b5f8ca058d758c51f19414e9bef1c1898 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 13:13:23 -0800
Subject: [PATCH 10/17] Refine developer guide: docstrings, cross-references,
 errors section

- Add PEP 257 reference and module docstring placement guidance
- Add Sphinx cross-reference roles documentation with link
- Update examples with proper :class: cross-references
- Streamline Errors and Warnings section (CUDA-specific only)
- Remove .pyx cdef variable duplication in example
- Simplify Helper Functions section
- Clean up Import Statements example
---
 cuda_core/docs/developer-guide.md | 260 +++++++-----------------------
 1 file changed, 57 insertions(+), 203 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 396f9c2abe..5ed78d035d 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -2,7 +2,7 @@
 
 This guide defines conventions for Python and Cython code in `cuda/core`.
 
-**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide.** The conventions in this document extend PEP 8 with project-specific patterns, particularly for Cython code and the structure of this codebase. Standard PEP 8 conventions (naming, whitespace, etc.) are not repeated here.
+**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide and [PEP 257](https://peps.python.org/pep-0257/) for docstring conventions.** The guidance in this document extends these with project-specific patterns, particularly for Cython code and the structure of this codebase. Standard conventions are not repeated here.
 
 ## Table of Contents
 
@@ -34,11 +34,15 @@ The following is a suggested file organization:
 
 Every file begins with an SPDX copyright header. The pre-commit hook adds or updates these automatically.
 
-### 2. Import Statements
+### 2. Module Docstring (Optional)
+
+If present, the module docstring comes immediately after the copyright header, before any imports. Per PEP 257, this is the standard location for module-level documentation.
+
+### 3. Import Statements
 
 Imports come next. See [Import Statements](#import-statements) for ordering conventions.
 
-### 3. `__all__` Declaration (Optional)
+### 4. `__all__` Declaration (Optional)
 
 If present, `__all__` specifies symbols included in star imports.
 
@@ -46,7 +50,7 @@ If present, `__all__` specifies symbols included in star imports.
 __all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
 ```
 
-### 4. Type Aliases and Constants (Optional)
+### 5. Type Aliases and Constants (Optional)
 
 Type aliases and module-level constants, if any, come next.
 
@@ -57,19 +61,19 @@ DevicePointerT = driver.CUdeviceptr | int | None
 LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
 ```
 
-### 5. Principal Class or Function
+### 6. Principal Class or Function
 
 If the file centers on a single class or function (e.g., `_buffer.pyx` defines `Buffer`, `_device.pyx` defines `Device`), that principal element comes first among the definitions.
 
-### 6. Other Public Classes and Functions
+### 7. Other Public Classes and Functions
 
 Other public classes and functions follow. These might include auxiliary classes (e.g., `DeviceMemoryResourceOptions`), abstract base classes, or additional exports. Organize them logically—by related functionality or typical usage.
 
-### 7. Public Module Functions
+### 8. Public Module Functions
 
 Public module-level functions come after classes.
 
-### 8. Private and Implementation Details
+### 9. Private and Implementation Details
 
 Finally, private functions and implementation details: functions prefixed with `_`, `cdef inline` helpers, and any specialized code that would distract from the principal content.
 
@@ -77,34 +81,29 @@ Finally, private functions and implementation details: functions prefixed with `
 
 ```python
 # <SPDX copyright header>
+"""Module for buffer and memory resource management."""
 
-# Imports (cimports first, then regular imports)
 from libc.stdint cimport uintptr_t
 from cuda.core._memory._device_memory_resource cimport DeviceMemoryResource
 import abc
 
 __all__ = ['Buffer', 'MemoryResource', 'some_public_function']
 
-# Type aliases (if any)
 DevicePointerT = driver.CUdeviceptr | int | None
 """Type union for device pointer representations."""
 
-# Principal class
 cdef class Buffer:
     """Principal class for this module."""
     # ...
 
-# Other public classes
 cdef class MemoryResource:
     """Abstract base class."""
     # ...
 
-# Public module functions
 def some_public_function():
     """Public API function."""
     # ...
 
-# Private implementation functions
 cdef inline void Buffer_close(Buffer self, stream):
     """Private implementation helper."""
     # ...
@@ -157,11 +156,6 @@ cdef class Buffer:
 ```python
 cdef class Buffer:
     """Full implementation with methods and docstrings."""
-    cdef:
-        uintptr_t      _ptr
-        size_t         _size
-        MemoryResource _memory_resource
-        object         _ipc_data
 
     def close(self, stream=None):
         """Implementation here."""
@@ -241,7 +235,7 @@ This pattern allows:
 
 2. **Keep `.pxd` files minimal**: Only include declarations needed for Cython compilation. Omit implementation details, docstrings, and Python-only code.
 
-3. **Use `__all__` in submodules**: Each submodule should define `__all__`.
+3. **Use `__all__` when helpful**: Define `__all__` to control exported symbols when it simplifies or clarifies the module structure.
 
 4. **Use `from ._module import *` in subpackage `__init__.py`**: This pattern assembles the subpackage API from its submodules. Use `# noqa: F403` to suppress linting warnings about wildcard imports.
 
@@ -343,24 +337,19 @@ from cuda.core._utils.cuda_utils import (
 ```python
 # <SPDX copyright header>
 
-# 1. __future__ imports
 from __future__ import annotations
 
-# 2. External cimports
 cimport cpython
 from libc.stdint cimport uintptr_t
 from libc.stdlib cimport malloc, free
 from cuda.bindings cimport cydriver
 
-# 3. cuda-core cimports
 from cuda.core._memory._buffer cimport Buffer, MemoryResource
 from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
 
-# 4. External imports
 import abc
 from dataclasses import dataclass
 
-# 5. cuda-core imports
 from cuda.core._context import Context
 from cuda.core._device import Device
 from cuda.core._utils.cuda_utils import driver
@@ -452,13 +441,9 @@ cdef class Buffer:
 
 ### Helper Functions
 
-Sometimes, implementation details are moved outside of the class definition to improve readability. Helper functions should be placed at the end of the file (in the private/implementation section) when:
-
-- The indentation level exceeds 4 levels
-- A method definition is long (>20 lines)
-- The class definition itself is very long
+When a class grows long or a method becomes deeply nested, consider extracting implementation details into helper functions. The goal is to keep class definitions easy to navigate—readers shouldn't have to scroll through hundreds of lines to understand a class's interface.
 
-In Cython files, these are often `cdef` or `cdef inline` functions. The helper function name typically follows the pattern `ClassName_methodname` (e.g., `DMR_close`, `Buffer_close`).
+In Cython files, helpers are typically `cdef` or `cdef inline` functions named with the pattern `ClassName_methodname` (e.g., `DMR_close`, `Buffer_close`). Place them at the end of the file or near their call sites, whichever aids readability.
 
 **Example:**
 
@@ -466,13 +451,10 @@ In Cython files, these are often `cdef` or `cdef inline` functions. The helper f
 cdef class DeviceMemoryResource:
     def close(self):
         """Close the memory resource."""
-        DMR_close(self)  # Calls helper function
-
-# ... other classes and functions ...
+        DMR_close(self)
 
-# Helper function at end of file
+# Helper function (at end of file or nearby)
 cdef inline DMR_close(DeviceMemoryResource self):
-    """Implementation moved outside class for readability."""
     if self._handle == NULL:
         return
     # ... implementation ...
@@ -640,14 +622,18 @@ result
 
 ### Module Docstrings
 
-Module docstrings should appear after imports and `__all__` (if present), before any classes or functions. They should provide a brief overview of the module's purpose.
+Per PEP 257, module docstrings appear at the top of the file, immediately after the copyright header and before any imports. They provide a brief overview of the module's purpose.
 
 ```python
+# <SPDX copyright header>
 """Module for managing CUDA device memory resources.
 
 This module provides classes and functions for allocating and managing
 device memory using CUDA's stream-ordered memory pool API.
 """
+
+from __future__ import annotations
+# ... imports ...
 ```
 
 For simple utility modules, a single-line docstring may suffice:
@@ -676,9 +662,9 @@ cdef class DeviceMemoryResource(MemoryResource):
 
     Parameters
     ----------
-    device_id : Device | int
-        Device or Device ordinal for which a memory resource is constructed.
-    options : DeviceMemoryResourceOptions, optional
+    device_id : :class:`Device` | int
+        Device or device ordinal for which a memory resource is constructed.
+    options : :class:`DeviceMemoryResourceOptions`, optional
         Memory resource creation options. If None, uses the driver's current
         or default memory pool for the specified device.
 
@@ -740,13 +726,13 @@ def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) ->
     ----------
     size : int
         The size of the buffer to allocate, in bytes.
-    stream : Stream | GraphBuilder, optional
+    stream : :class:`Stream` | :class:`GraphBuilder`, optional
         The stream on which to perform the allocation asynchronously.
         If None, an internal stream is used.
 
     Returns
     -------
-    Buffer
+    :class:`Buffer`
         The allocated buffer object, which is accessible on the device
         that this memory resource was created for.
 
@@ -768,7 +754,7 @@ For simple functions, a brief docstring may suffice:
 
 ```python
 def get_ipc_descriptor(self) -> IPCBufferDescriptor:
-    """Export a buffer allocated for sharing between processes."""
+    """Export a :class:`Buffer` for sharing between processes."""
 ```
 
 ### Property Docstrings
@@ -799,7 +785,7 @@ def peer_accessible_by(self):
 
     Notes
     -----
-    When setting, accepts a sequence of Device objects or device IDs.
+    When setting, accepts a sequence of :class:`Device` objects or device IDs.
     Setting to an empty sequence revokes all peer access.
 
     Examples
@@ -811,13 +797,20 @@ def peer_accessible_by(self):
 
 ### Type References in Docstrings
 
-Use Sphinx-style cross-references for types:
+Use Sphinx cross-reference roles to link to other documented objects. Use the most specific role for each type:
+
+| Role | Use for | Example |
+|------|---------|---------|
+| `:class:` | Classes | `:class:`Buffer`` |
+| `:func:` | Functions | `:func:`launch`` |
+| `:meth:` | Methods | `:meth:`Device.create_stream`` |
+| `:attr:` | Attributes | `:attr:`device_id`` |
+| `:mod:` | Modules | `:mod:`multiprocessing`` |
+| `:obj:` | Type aliases, other objects | `:obj:`DevicePointerT`` |
+
+The `~` prefix displays only the final component: `:class:`~cuda.core.Buffer`` renders as "Buffer" while still linking to the full path.
 
-- **Classes**: ``:class:`Buffer` `` or ``:class:`~_memory.Buffer` `` (with `~` to hide module path)
-- **Methods**: ``:meth:`DeviceMemoryResource.allocate` ``
-- **Attributes**: ``:attr:`device_id` ``
-- **Modules**: ``:mod:`multiprocessing` ``
-- **Objects**: ``:obj:`~_memory.DevicePointerT` ``
+For more details, see the [Sphinx Python domain documentation](https://www.sphinx-doc.org/en/master/usage/domains/python.html#cross-referencing-python-objects).
 
 **Example:**
 
@@ -825,15 +818,15 @@ Use Sphinx-style cross-references for types:
 def from_handle(
     ptr: DevicePointerT, size_t size, mr: MemoryResource | None = None
 ) -> Buffer:
-    """Create a new :class:`Buffer` object from a pointer.
+    """Create a new :class:`Buffer` from a pointer.
 
     Parameters
     ----------
-    ptr : :obj:`~_memory.DevicePointerT`
+    ptr : :obj:`DevicePointerT`
         Allocated buffer handle object.
     size : int
         Memory size of the buffer.
-    mr : :obj:`~_memory.MemoryResource`, optional
+    mr : :class:`MemoryResource`, optional
         Memory resource associated with the buffer.
     """
 ```
@@ -860,179 +853,40 @@ def from_handle(
 
 ## Errors and Warnings
 
-### Exception Types
-
-#### Custom Exceptions
-
-The project defines custom exception types for CUDA-specific errors:
-
-- **`CUDAError`**: Base exception for CUDA-related errors
-- **`NVRTCError`**: Exception for NVRTC (compiler) errors, inherits from `CUDAError`
-
-```python
-from cuda.core._utils.cuda_utils import CUDAError, NVRTCError
-
-raise CUDAError("CUDA operation failed")
-raise NVRTCError("NVRTC compilation error")
-```
-
-#### Standard Python Exceptions
-
-Use standard Python exceptions when appropriate:
-
-- **`ValueError`**: Invalid argument values
-- **`TypeError`**: Invalid argument types
-- **`RuntimeError`**: Runtime errors that don't fit other categories
-- **`NotImplementedError`**: Features that are not yet implemented
-- **`BufferError`**: Buffer protocol-related errors
+### CUDA Exceptions
 
-```python
-if size < 0:
-    raise ValueError(f"size must be non-negative, got {size}")
-
-if not isinstance(stream, Stream):
-    raise TypeError(f"stream must be a Stream, got {type(stream)}")
+The project defines custom exceptions for CUDA-specific errors:
 
-if self.is_mapped:
-    raise RuntimeError("Memory resource is not IPC-enabled")
-```
+- **`CUDAError`**: Base exception for CUDA driver errors
+- **`NVRTCError`**: Exception for NVRTC compiler errors (inherits from `CUDAError`)
 
-### Raising Errors
+Use these instead of generic exceptions when reporting CUDA failures.
 
-#### Error Messages
+### CUDA API Error Handling
 
-Error messages should be clear and include context:
+In `nogil` contexts, use the `HANDLE_RETURN` macro:
 
-**Preferred:**
-```python
-if dst_size != src_size:
-    raise ValueError(
-        f"buffer sizes mismatch between src and dst "
-        f"(sizes are: src={src_size}, dst={dst_size})"
-    )
-```
-
-**Avoid:**
 ```python
-if dst_size != src_size:
-    raise ValueError("sizes don't match")
-```
-
-#### CUDA API Error Handling
-
-For CUDA Driver API calls, use the `HANDLE_RETURN` macro in `nogil` contexts:
-
-```python
-cdef int allocate_buffer(uintptr_t* ptr, size_t size) except?-1 nogil:
+with nogil:
     HANDLE_RETURN(cydriver.cuMemAlloc(ptr, size))
-    return 0
 ```
 
-For Python-level CUDA error handling, use `handle_return()`:
+At the Python level, use `handle_return()` or `raise_if_driver_error()`:
 
 ```python
-from cuda.core._utils.cuda_utils import handle_return
-
 err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
 handle_return((err,))
 ```
 
-Or use `raise_if_driver_error()` for direct error raising:
-
-```python
-from cuda.core._utils.cuda_utils cimport (
-    _check_driver_error as raise_if_driver_error,
-)
-
-err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
-raise_if_driver_error(err)
-```
-
-#### Error Explanations
-
-CUDA errors include explanations from dictionaries (`DRIVER_CU_RESULT_EXPLANATIONS`, `RUNTIME_CUDA_ERROR_EXPLANATIONS`) when available. The error checking functions (`_check_driver_error()`, `_check_runtime_error()`) automatically include these explanations in the error message.
-
 ### Warnings
 
-#### Warning Categories
-
-Use appropriate warning categories:
-
-- **`UserWarning`**: For user-facing warnings about potentially problematic usage
-- **`DeprecationWarning`**: For deprecated features that will be removed in future versions
-
-```python
-import warnings
-
-warnings.warn(
-    "multiprocessing start method is 'fork', which CUDA does not support. "
-    "Forked subprocesses exhibit undefined behavior. "
-    "Set the start method to 'spawn' before creating processes that use CUDA.",
-    UserWarning,
-    stacklevel=3
-)
-
-warnings.warn(
-    "Implementing __cuda_stream__ as an attribute is deprecated; "
-    "it must be implemented as a method",
-    DeprecationWarning,
-    stacklevel=3
-)
-```
-
-#### Stack Level
-
-Always specify the `stacklevel` parameter to point to the caller, not the warning location:
+When emitting warnings, always specify `stacklevel` so the warning points to the caller:
 
 ```python
 warnings.warn(message, UserWarning, stacklevel=3)
 ```
 
-The `stacklevel` value depends on the call depth. Use `stacklevel=2` for direct function calls, `stacklevel=3` for calls through helper functions.
-
-#### One-Time Warnings
-
-For warnings that should only be emitted once per process, use a module-level flag:
-
-```python
-_fork_warning_checked = False
-
-def check_multiprocessing_start_method():
-    global _fork_warning_checked
-    if _fork_warning_checked:
-        return
-    _fork_warning_checked = True
-
-    # ... check condition and emit warning ...
-    warnings.warn(message, UserWarning, stacklevel=3)
-```
-
-#### Deprecation Warnings
-
-For deprecation warnings, use `warnings.simplefilter("once", DeprecationWarning)` to ensure each deprecation message is shown only once:
-
-```python
-warnings.simplefilter("once", DeprecationWarning)
-warnings.warn(
-    "Feature X is deprecated and will be removed in a future version",
-    DeprecationWarning,
-    stacklevel=3
-)
-```
-
-### Guidelines
-
-1. **Use specific exception types**: Choose the most appropriate exception type for the error condition.
-
-2. **Include context in error messages**: Error messages should include relevant values and context to help users diagnose issues.
-
-3. **Use custom exceptions for CUDA errors**: Use `CUDAError` or `NVRTCError` for CUDA-specific errors rather than generic exceptions.
-
-4. **Specify stacklevel for warnings**: Always include `stacklevel` parameter in `warnings.warn()` calls to point to the actual caller.
-
-5. **Use one-time warnings for repeated operations**: When a warning could be triggered multiple times, use a flag to ensure it's only shown once.
-
-6. **Prefer warnings over errors for recoverable issues**: Use warnings for issues that don't prevent execution but may cause problems.
+The value depends on call depth—typically `stacklevel=2` for direct calls, `stacklevel=3` when called through a helper.
 
 ## Performance Considerations
 

From e26f6357ed1da85b9cb0c5466396159c7672e034 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 13:23:21 -0800
Subject: [PATCH 11/17] Streamline CUDA-Specific Patterns and remove
 Performance section

- Remove Performance Considerations section (generic Cython advice)
- Rewrite GIL management section with softer language
- Frame GIL release as optimization, not requirement
- Reference Development Lifecycle for pure Python phase
- Remove Function-Level nogil subsection (note alternative inline)
- Simplify exception raising guidance
---
 cuda_core/docs/developer-guide.md | 156 ++----------------------------
 1 file changed, 10 insertions(+), 146 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 5ed78d035d..1703c72296 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -14,9 +14,8 @@ This guide defines conventions for Python and Cython code in `cuda/core`.
 6. [Type Annotations and Declarations](#type-annotations-and-declarations)
 7. [Docstrings](#docstrings)
 8. [Errors and Warnings](#errors-and-warnings)
-9. [Performance Considerations](#performance-considerations)
-10. [CUDA-Specific Patterns](#cuda-specific-patterns)
-11. [Development Lifecycle](#development-lifecycle)
+9. [CUDA-Specific Patterns](#cuda-specific-patterns)
+10. [Development Lifecycle](#development-lifecycle)
 
 ---
 
@@ -888,90 +887,17 @@ warnings.warn(message, UserWarning, stacklevel=3)
 
 The value depends on call depth—typically `stacklevel=2` for direct calls, `stacklevel=3` when called through a helper.
 
-## Performance Considerations
-
-### Use Cython Types
-
-Declare C types explicitly for performance-critical code:
-
-```python
-cdef:
-    int device_id
-    size_t buffer_size
-    cydriver.CUdeviceptr ptr
-```
-
-### Prefer `cdef` for Internal Functions
-
-Use `cdef` functions for internal operations that don't need to be callable from Python:
-
-```python
-cdef inline void Buffer_close(Buffer self, stream):
-    # Fast C-level function
-```
-
-### Release GIL for CUDA Operations
-
-Always release the GIL when calling CUDA driver APIs. See [CUDA-Specific Patterns](#cuda-specific-patterns) for details.
-
-### Minimize Python Object Creation
-
-Avoid creating Python objects in hot paths:
-
-```python
-# Avoid: Creates Python list
-result = []
-for i in range(n):
-    result.append(i)
-
-# Preferred: Use C array or pre-allocate
-cdef int* c_result = <int*>malloc(n * sizeof(int))
-```
-
-### Use `inline` for Small Functions
-
-Mark small, frequently-called functions as `inline`:
-
-```python
-cdef inline int get_device_id(DeviceMemoryResource mr):
-    return mr._dev_id
-```
-
-### Avoid Unnecessary Type Conversions
-
-Minimize conversions between C and Python types:
-
-```python
-# Avoid: Unnecessary conversion
-cdef int size = int(self._size)
-
-# Preferred: Use C type directly
-cdef size_t size = self._size
-```
-
-### Guidelines
-
-1. **Profile before optimizing**: Don't optimize prematurely. Use profiling to identify actual bottlenecks.
-
-2. **Use C types in hot paths**: Declare C types (`cdef`) for variables used in performance-critical loops.
-
-3. **Release GIL appropriately**: Always release GIL for CUDA operations, but be careful about Python object access.
-
-4. **Minimize Python overhead**: Avoid Python object creation, method calls, and attribute access in hot paths.
-
-5. **Use `inline` judiciously**: Mark small, frequently-called functions as `inline`, but don't overuse (compiler may ignore if function is too large).
-
-6. **Cache expensive lookups**: Cache results of expensive operations (e.g., dictionary lookups, attribute access) when used repeatedly.
-
 ## CUDA-Specific Patterns
 
 ### GIL Management for CUDA Driver API Calls
 
-**Always release the Global Interpreter Lock (GIL) when calling CUDA driver API functions.** This is critical for performance and thread safety.
+For optimized Cython code, release the GIL when calling CUDA driver APIs. This improves performance and allows other Python threads to run during CUDA operations.
+
+During initial development, it's fine to use the Python `driver` module without releasing the GIL (see [Development Lifecycle](#development-lifecycle)). GIL release is a performance optimization that can be applied once the implementation is correct.
 
 #### Using `with nogil:` Blocks
 
-Wrap CUDA driver API calls in `with nogil:` blocks:
+Wrap `cydriver` calls in `with nogil:` blocks (or declare entire functions as `nogil`):
 
 ```python
 cdef cydriver.CUstream s
@@ -980,7 +906,7 @@ with nogil:
 self._handle = s
 ```
 
-For multiple driver calls, group them in a single `with nogil:` block:
+Group multiple driver calls in a single block:
 
 ```python
 cdef int high, low
@@ -989,75 +915,13 @@ with nogil:
     HANDLE_RETURN(cydriver.cuStreamCreateWithPriority(&s, flags, prio))
 ```
 
-#### Function-Level `nogil` Declaration
-
-For functions that primarily call CUDA driver APIs, declare the function `nogil`:
-
-```python
-cdef int get_device_from_ctx(
-        cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
-    """Get device ID from the given ctx."""
-    cdef bint switch_context = (curr_ctx != target_ctx)
-    cdef cydriver.CUcontext ctx
-    cdef cydriver.CUdevice target_dev
-    with nogil:
-        if switch_context:
-            HANDLE_RETURN(cydriver.cuCtxPopCurrent(&ctx))
-            HANDLE_RETURN(cydriver.cuCtxPushCurrent(target_ctx))
-        HANDLE_RETURN(cydriver.cuCtxGetDevice(&target_dev))
-        if switch_context:
-            HANDLE_RETURN(cydriver.cuCtxPopCurrent(&ctx))
-            HANDLE_RETURN(cydriver.cuCtxPushCurrent(curr_ctx))
-    return target_dev
-```
-
 #### Raising Exceptions from `nogil` Context
 
-When raising exceptions from a `nogil` context, acquire the GIL first using `with gil:`:
-
-```python
-cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
-    if error == cydriver.CUresult.CUDA_SUCCESS:
-        return 0
-    cdef const char* name
-    name_err = cydriver.cuGetErrorName(error, &name)
-    if name_err != cydriver.CUresult.CUDA_SUCCESS:
-        with gil:
-            raise CUDAError(f"UNEXPECTED ERROR CODE: {error}")
-    with gil:
-        expl = DRIVER_CU_RESULT_EXPLANATIONS.get(int(error))
-        if expl is not None:
-            raise CUDAError(f"{name.decode()}: {expl}")
-    # ... rest of error handling ...
-```
-
-#### Guidelines
-
-1. **Always use `with nogil:` for CUDA driver calls**: Every call to `cydriver.*` functions should be within a `with nogil:` block.
-
-2. **Use `HANDLE_RETURN` within `nogil` blocks**: The `HANDLE_RETURN` macro is designed to work in `nogil` contexts.
-
-3. **Acquire GIL before raising exceptions**: When raising Python exceptions from a `nogil` context, use `with gil:` to acquire the GIL first.
-
-4. **Group related driver calls**: If multiple driver calls are made sequentially, group them in a single `with nogil:` block for efficiency.
-
-5. **Declare functions `nogil` when appropriate**: Functions that primarily call CUDA driver APIs and don't need Python object access should be declared `nogil` at the function level.
-
-### Example
+To raise exceptions from a `nogil` context, acquire the GIL first:
 
 ```python
-cdef inline void DMR_close(DeviceMemoryResource self):
-    if self._handle == NULL:
-        return
-
-    try:
-        if self._mempool_owned:
-            with nogil:
-                HANDLE_RETURN(cydriver.cuMemPoolDestroy(self._handle))
-    finally:
-        self._dev_id = cydriver.CU_DEVICE_INVALID
-        self._handle = NULL
-        # ... cleanup ...
+with gil:
+    raise CUDAError(f"CUDA operation failed: {error}")
 ```
 
 ## Development Lifecycle

From 0f3d9767d2d9a7b88652d0a07dd6225e971e4b2c Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 13:28:35 -0800
Subject: [PATCH 12/17] Streamline Development Lifecycle section

- Rewrite as "common pattern" rather than mandatory approach
- Remove rigid Guidelines subsection ("Always", "Don't skip")
- Simplify examples, keeping one before/after pair
- Add concise bullet list of key conversion changes
- Frame as practical advice, not mandates
---
 cuda_core/docs/developer-guide.md | 154 +++++-------------------------
 1 file changed, 25 insertions(+), 129 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 1703c72296..2f29b600c0 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -926,108 +926,19 @@ with gil:
 
 ## Development Lifecycle
 
-### Two-Phase Development Approach
+### Two-Phase Development
 
-When implementing new CUDA functionality, follow a two-phase development approach:
+A common pattern when implementing CUDA functionality is to develop in two phases:
 
-1. **Phase 1: Python Implementation with Tests**
-   - Start with a pure Python implementation using the CUDA driver module
-   - Write comprehensive tests to verify correctness
-   - Ensure all tests pass before proceeding to Phase 2
+1. **Start with Python**: Use the `driver` module for a straightforward implementation. Write tests to verify correctness. This allows faster iteration and easier debugging.
 
-2. **Phase 2: Cythonization for Performance**
-   - After tests are passing, optimize by switching to `cydriver`
-   - Add `with nogil:` blocks around CUDA driver API calls
-   - Use `HANDLE_RETURN` macro for error handling
-   - Verify tests still pass after optimization
+2. **Optimize with Cython**: Once the implementation is correct, switch to `cydriver` with `nogil` blocks and `HANDLE_RETURN` for better performance.
 
-### Phase 1: Initial Python Implementation
+This approach separates correctness from optimization. Getting the logic right first—with Python's better error messages and stack traces—often saves time overall.
 
-Begin with a straightforward Python implementation using the `driver` module from `cuda.core._utils.cuda_utils`:
+### Python Implementation
 
-```python
-from cuda.core._utils.cuda_utils import driver
-from cuda.core._utils.cuda_utils cimport (
-    _check_driver_error as raise_if_driver_error,
-)
-
-def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
-    stream = Stream_accept(stream)
-    cdef size_t src_size = self._size
-
-    # ... validation logic ...
-
-    err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
-    raise_if_driver_error(err)
-    return dst
-```
-
-**Benefits of starting with Python:**
-- Faster iteration during development
-- Easier debugging with Python stack traces
-- Simpler error handling
-- Focus on correctness before optimization
-
-### Phase 2: Cythonization Process
-
-Once tests are passing, optimize the implementation by:
-
-1. **Switching to `cydriver`**: Replace `driver` module calls with direct `cydriver` calls
-2. **Adding `with nogil:` blocks**: Wrap CUDA driver API calls to release the GIL
-3. **Using `HANDLE_RETURN`**: Replace `raise_if_driver_error()` with the `HANDLE_RETURN` macro
-4. **Casting stream handles**: Access the C-level stream handle for `cydriver` calls
-
-#### Step-by-Step Conversion
-
-**Step 1: Update imports**
-
-```python
-# Remove Python driver import
-# from cuda.core._utils.cuda_utils import driver
-
-# Add cydriver cimport
-from cuda.bindings cimport cydriver
-
-# Add HANDLE_RETURN
-from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
-```
-
-**Step 2: Cast stream and extract C-level handle**
-
-```python
-stream = Stream_accept(stream)
-cdef Stream s_stream = <Stream>stream
-cdef cydriver.CUstream s = s_stream._handle
-```
-
-**Step 3: Wrap driver calls in `with nogil:` and use `HANDLE_RETURN`**
-
-```python
-# Before (Python driver):
-err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
-raise_if_driver_error(err)
-
-# After (cydriver):
-with nogil:
-    HANDLE_RETURN(cydriver.cuMemcpyAsync(
-        <cydriver.CUdeviceptr>dst._ptr,
-        <cydriver.CUdeviceptr>self._ptr,
-        src_size,
-        s
-    ))
-```
-
-**Step 4: Cast pointers to `cydriver.CUdeviceptr`**
-
-All device pointers passed to `cydriver` functions must be cast to `cydriver.CUdeviceptr`:
-
-```python
-<cydriver.CUdeviceptr>self._ptr
-```
-
-### Complete Example: Before and After
-
-**Before (Python driver implementation):**
+Use the `driver` module from `cuda.core._utils.cuda_utils`:
 
 ```python
 from cuda.core._utils.cuda_utils import driver
@@ -1037,18 +948,14 @@ from cuda.core._utils.cuda_utils cimport (
 
 def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
     stream = Stream_accept(stream)
-    cdef size_t buffer_size = self._size
-    cdef unsigned char c_value8
-
-    # Validation...
-    if width == 1:
-        c_value8 = <unsigned char>value
-        N = buffer_size
-        err, = driver.cuMemsetD8Async(self._ptr, c_value8, N, stream.handle)
-        raise_if_driver_error(err)
+    # ... validation ...
+    err, = driver.cuMemsetD8Async(self._ptr, value, size, stream.handle)
+    raise_if_driver_error(err)
 ```
 
-**After (Cythonized with cydriver):**
+### Cython Optimization
+
+When ready to optimize, convert to `cydriver`:
 
 ```python
 from cuda.bindings cimport cydriver
@@ -1058,29 +965,18 @@ def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
     stream = Stream_accept(stream)
     cdef Stream s_stream = <Stream>stream
     cdef cydriver.CUstream s = s_stream._handle
-    cdef size_t buffer_size = self._size
-    cdef unsigned char c_value8
-
-    # Validation...
-    if width == 1:
-        c_value8 = <unsigned char>value
-        N = buffer_size
-        with nogil:
-            HANDLE_RETURN(cydriver.cuMemsetD8Async(
-                <cydriver.CUdeviceptr>self._ptr, c_value8, N, s
-            ))
+    # ... validation ...
+    with nogil:
+        HANDLE_RETURN(cydriver.cuMemsetD8Async(
+            <cydriver.CUdeviceptr>self._ptr, value, size, s
+        ))
 ```
 
-### Guidelines
-
-1. **Always write tests first**: Implement comprehensive tests before optimizing. This ensures correctness is established before performance improvements.
-
-2. **Verify tests after optimization**: After converting to `cydriver`, run all tests to ensure behavior is unchanged.
-
-3. **Don't skip Phase 1**: Even if you're confident about the implementation, starting with Python helps catch logic errors early.
-
-4. **Performance benefits**: The Cythonized version eliminates Python overhead and releases the GIL, providing significant performance improvements for CUDA operations.
-
-5. **Consistent pattern**: Follow this pattern for all new CUDA driver API wrappers to maintain consistency across the codebase.
+Key changes:
+- Replace `driver` with `cydriver`
+- Extract C-level handles (e.g., `s_stream._handle`)
+- Wrap calls in `with nogil:`
+- Use `HANDLE_RETURN` instead of `raise_if_driver_error`
+- Cast pointers to `cydriver.CUdeviceptr`
 
-6. **Error handling**: The `HANDLE_RETURN` macro is designed to work in `nogil` contexts and will automatically raise appropriate exceptions when needed.
+Run tests after optimization to verify behavior is unchanged.

From 4cc1745a7739d6254822211640e476f43f294a11 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 13:49:29 -0800
Subject: [PATCH 13/17] Simplify Development Lifecycle and GIL examples

- Use device attribute queries instead of buffer/stream operations
- Remove resource handle complexity (cu(), _h_ptr, _h_stream)
- Keep examples focused on Cythonization process
- Align GIL management and Development Lifecycle examples
---
 cuda_core/docs/developer-guide.md | 33 +++++++++++--------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 2f29b600c0..38a1cf8399 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -900,19 +900,17 @@ During initial development, it's fine to use the Python `driver` module without
 Wrap `cydriver` calls in `with nogil:` blocks (or declare entire functions as `nogil`):
 
 ```python
-cdef cydriver.CUstream s
+cdef int value
 with nogil:
-    HANDLE_RETURN(cydriver.cuStreamCreateWithPriority(&s, flags, prio))
-self._handle = s
+    HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, device_id))
 ```
 
 Group multiple driver calls in a single block:
 
 ```python
-cdef int high, low
+cdef int low, high
 with nogil:
-    HANDLE_RETURN(cydriver.cuCtxGetStreamPriorityRange(&high, &low))
-    HANDLE_RETURN(cydriver.cuStreamCreateWithPriority(&s, flags, prio))
+    HANDLE_RETURN(cydriver.cuCtxGetStreamPriorityRange(&low, &high))
 ```
 
 #### Raising Exceptions from `nogil` Context
@@ -946,37 +944,30 @@ from cuda.core._utils.cuda_utils cimport (
     _check_driver_error as raise_if_driver_error,
 )
 
-def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
-    stream = Stream_accept(stream)
-    # ... validation ...
-    err, = driver.cuMemsetD8Async(self._ptr, value, size, stream.handle)
+def get_attribute(self, attr: int) -> int:
+    err, value = driver.cuDeviceGetAttribute(attr, self._id)
     raise_if_driver_error(err)
+    return value
 ```
 
 ### Cython Optimization
 
-When ready to optimize, convert to `cydriver`:
+When ready to optimize, switch to `cydriver`:
 
 ```python
 from cuda.bindings cimport cydriver
 from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
 
-def fill(self, value: int, width: int, *, stream: Stream | GraphBuilder):
-    stream = Stream_accept(stream)
-    cdef Stream s_stream = <Stream>stream
-    cdef cydriver.CUstream s = s_stream._handle
-    # ... validation ...
+def get_attribute(self, attr: int) -> int:
+    cdef int value
     with nogil:
-        HANDLE_RETURN(cydriver.cuMemsetD8Async(
-            <cydriver.CUdeviceptr>self._ptr, value, size, s
-        ))
+        HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, self._id))
+    return value
 ```
 
 Key changes:
 - Replace `driver` with `cydriver`
-- Extract C-level handles (e.g., `s_stream._handle`)
 - Wrap calls in `with nogil:`
 - Use `HANDLE_RETURN` instead of `raise_if_driver_error`
-- Cast pointers to `cydriver.CUdeviceptr`
 
 Run tests after optimization to verify behavior is unchanged.

From 7027fc27bdf25b190f0015a445e6ea3d33e7f915 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 14:06:38 -0800
Subject: [PATCH 14/17] Fix backtick rendering in Sphinx cross-reference
 examples

---
 cuda_core/docs/developer-guide.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
index 38a1cf8399..61dcac8878 100644
--- a/cuda_core/docs/developer-guide.md
+++ b/cuda_core/docs/developer-guide.md
@@ -800,14 +800,14 @@ Use Sphinx cross-reference roles to link to other documented objects. Use the mo
 
 | Role | Use for | Example |
 |------|---------|---------|
-| `:class:` | Classes | `:class:`Buffer`` |
-| `:func:` | Functions | `:func:`launch`` |
-| `:meth:` | Methods | `:meth:`Device.create_stream`` |
-| `:attr:` | Attributes | `:attr:`device_id`` |
-| `:mod:` | Modules | `:mod:`multiprocessing`` |
-| `:obj:` | Type aliases, other objects | `:obj:`DevicePointerT`` |
-
-The `~` prefix displays only the final component: `:class:`~cuda.core.Buffer`` renders as "Buffer" while still linking to the full path.
+| `:class:` | Classes | `` :class:`Buffer` `` |
+| `:func:` | Functions | `` :func:`launch` `` |
+| `:meth:` | Methods | `` :meth:`Device.create_stream` `` |
+| `:attr:` | Attributes | `` :attr:`device_id` `` |
+| `:mod:` | Modules | `` :mod:`multiprocessing` `` |
+| `:obj:` | Type aliases, other objects | `` :obj:`DevicePointerT` `` |
+
+The `~` prefix displays only the final component: `` :class:`~cuda.core.Buffer` `` renders as "Buffer" while still linking to the full path.
 
 For more details, see the [Sphinx Python domain documentation](https://www.sphinx-doc.org/en/master/usage/domains/python.html#cross-referencing-python-objects).
 

From 98a0a648baf0ea27f34e09b155947218f51600b0 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Thu, 8 Jan 2026 14:39:15 -0800
Subject: [PATCH 15/17] Convert developer guide to RST and integrate with
 Sphinx docs

- Convert developer-guide.md to developer-guide.rst using pandoc
- Move to docs/source/ for Sphinx integration
- Add to toctree in index.rst
- Add SPDX header
- Delete original markdown file
---
 cuda_core/docs/developer-guide.md         |  973 ----------------
 cuda_core/docs/source/developer-guide.rst | 1258 +++++++++++++++++++++
 cuda_core/docs/source/index.rst           |    1 +
 3 files changed, 1259 insertions(+), 973 deletions(-)
 delete mode 100644 cuda_core/docs/developer-guide.md
 create mode 100644 cuda_core/docs/source/developer-guide.rst

diff --git a/cuda_core/docs/developer-guide.md b/cuda_core/docs/developer-guide.md
deleted file mode 100644
index 61dcac8878..0000000000
--- a/cuda_core/docs/developer-guide.md
+++ /dev/null
@@ -1,973 +0,0 @@
-# CUDA Core Developer Guide
-
-This guide defines conventions for Python and Cython code in `cuda/core`.
-
-**This project follows [PEP 8](https://peps.python.org/pep-0008/) as the base style guide and [PEP 257](https://peps.python.org/pep-0257/) for docstring conventions.** The guidance in this document extends these with project-specific patterns, particularly for Cython code and the structure of this codebase. Standard conventions are not repeated here.
-
-## Table of Contents
-
-1. [File Structure](#file-structure)
-2. [Package Layout](#package-layout)
-3. [Import Statements](#import-statements)
-4. [Class and Function Definitions](#class-and-function-definitions)
-5. [Naming Conventions](#naming-conventions)
-6. [Type Annotations and Declarations](#type-annotations-and-declarations)
-7. [Docstrings](#docstrings)
-8. [Errors and Warnings](#errors-and-warnings)
-9. [CUDA-Specific Patterns](#cuda-specific-patterns)
-10. [Development Lifecycle](#development-lifecycle)
-
----
-
-## File Structure
-
-The goal is **readability and maintainability**. A well-organized file lets readers quickly find what they're looking for and understand how the pieces fit together.
-
-To support this, we suggest organizing content from most important to least important: principal classes first, then supporting classes, then implementation details. This way, readers can start at the top and immediately see what matters most. Unlike C/C++ where definitions must precede uses, Python imposes no such constraint—we're free to optimize for the reader.
-
-These are guidelines, not rules. Place helper functions near their call sites if that's clearer. Group related code together if it aids understanding. When in doubt, choose whatever makes the code easiest to read and maintain.
-
-The following is a suggested file organization:
-
-### 1. SPDX Copyright Header
-
-Every file begins with an SPDX copyright header. The pre-commit hook adds or updates these automatically.
-
-### 2. Module Docstring (Optional)
-
-If present, the module docstring comes immediately after the copyright header, before any imports. Per PEP 257, this is the standard location for module-level documentation.
-
-### 3. Import Statements
-
-Imports come next. See [Import Statements](#import-statements) for ordering conventions.
-
-### 4. `__all__` Declaration (Optional)
-
-If present, `__all__` specifies symbols included in star imports.
-
-```python
-__all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
-```
-
-### 5. Type Aliases and Constants (Optional)
-
-Type aliases and module-level constants, if any, come next.
-
-```python
-DevicePointerT = driver.CUdeviceptr | int | None
-"""Type union for device pointer representations."""
-
-LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
-```
-
-### 6. Principal Class or Function
-
-If the file centers on a single class or function (e.g., `_buffer.pyx` defines `Buffer`, `_device.pyx` defines `Device`), that principal element comes first among the definitions.
-
-### 7. Other Public Classes and Functions
-
-Other public classes and functions follow. These might include auxiliary classes (e.g., `DeviceMemoryResourceOptions`), abstract base classes, or additional exports. Organize them logically—by related functionality or typical usage.
-
-### 8. Public Module Functions
-
-Public module-level functions come after classes.
-
-### 9. Private and Implementation Details
-
-Finally, private functions and implementation details: functions prefixed with `_`, `cdef inline` helpers, and any specialized code that would distract from the principal content.
-
-### Example Structure
-
-```python
-# <SPDX copyright header>
-"""Module for buffer and memory resource management."""
-
-from libc.stdint cimport uintptr_t
-from cuda.core._memory._device_memory_resource cimport DeviceMemoryResource
-import abc
-
-__all__ = ['Buffer', 'MemoryResource', 'some_public_function']
-
-DevicePointerT = driver.CUdeviceptr | int | None
-"""Type union for device pointer representations."""
-
-cdef class Buffer:
-    """Principal class for this module."""
-    # ...
-
-cdef class MemoryResource:
-    """Abstract base class."""
-    # ...
-
-def some_public_function():
-    """Public API function."""
-    # ...
-
-cdef inline void Buffer_close(Buffer self, stream):
-    """Private implementation helper."""
-    # ...
-```
-
-### Notes
-
-- Not every file will have all sections. For example, a utility module may not have a principal class.
-- The distinction between "principal" and "other" classes is based on the file's primary purpose. If a file exists primarily to define one class, that class is the principal class.
-- Private implementation functions should be placed at the end of the file to keep the public API visible at the top.
-- **Within each section**, prefer logical ordering (e.g., by functionality or typical usage). Alphabetical ordering is a reasonable fallback when no clear logical structure exists.
-
-## Package Layout
-
-### File Types
-
-The `cuda/core` package uses three types of files:
-
-1. **`.pyx` files**: Cython implementation files containing the actual code
-2. **`.pxd` files**: Cython declaration files containing type definitions and function signatures for C-level access
-3. **`.py` files**: Pure Python files for utilities and high-level interfaces
-
-### File Naming Conventions
-
-- **Implementation files**: Use `.pyx` for Cython code, `.py` for pure Python code
-- **Declaration files**: Use `.pxd` for Cython type declarations
-- **Private modules**: Prefix with underscore (e.g., `_buffer.pyx`, `_device.pyx`)
-- **Public modules**: No underscore prefix (e.g., `utils.py`)
-
-### Relationship Between `.pxd` and `.pyx` Files
-
-For each `.pyx` file that defines classes or functions used by other Cython modules, create a corresponding `.pxd` file:
-
-- **`.pxd` file**: Contains `cdef` class declarations, `cdef`/`cpdef` function signatures, and `cdef` attribute declarations
-- **`.pyx` file**: Contains the full implementation including Python methods, docstrings, and implementation details
-
-**Example:**
-
-`_buffer.pxd`:
-```python
-cdef class Buffer:
-    cdef:
-        uintptr_t      _ptr
-        size_t         _size
-        MemoryResource _memory_resource
-        object         _ipc_data
-```
-
-`_buffer.pyx`:
-```python
-cdef class Buffer:
-    """Full implementation with methods and docstrings."""
-
-    def close(self, stream=None):
-        """Implementation here."""
-        # ...
-```
-
-### Module Organization
-
-#### Simple Top-Level Modules
-
-For simple modules at the `cuda/core` level, define classes and functions directly in the module file with an `__all__` list:
-
-```python
-# _device.pyx
-__all__ = ['Device', 'DeviceProperties']
-
-cdef class Device:
-    # ...
-
-cdef class DeviceProperties:
-    # ...
-```
-
-#### Complex Subpackages
-
-For complex subpackages that require extra structure (like `_memory/`), use the following pattern:
-
-1. **Private submodules**: Each component is implemented in a private submodule (e.g., `_buffer.pyx`, `_device_memory_resource.pyx`)
-2. **Submodule `__all__`**: Each submodule defines its own `__all__` list
-3. **Subpackage `__init__.py`**: The subpackage `__init__.py` uses `from ._module import *` to assemble the package
-
-**Example structure for `_memory/` subpackage:**
-
-`_memory/_buffer.pyx`:
-```python
-__all__ = ['Buffer', 'MemoryResource']
-
-cdef class Buffer:
-    # ...
-
-cdef class MemoryResource:
-    # ...
-```
-
-`_memory/_device_memory_resource.pyx`:
-```python
-__all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
-
-cdef class DeviceMemoryResourceOptions:
-    # ...
-
-cdef class DeviceMemoryResource:
-    # ...
-```
-
-`_memory/__init__.py`:
-```python
-from ._buffer import *  # noqa: F403
-from ._device_memory_resource import *  # noqa: F403
-from ._graph_memory_resource import *  # noqa: F403
-from ._ipc import *  # noqa: F403
-from ._legacy import *  # noqa: F403
-from ._virtual_memory_resource import *  # noqa: F403
-```
-
-This pattern allows:
-- **Modular organization**: Each component lives in its own file
-- **Clear star-import behavior**: Each submodule explicitly defines what it exports via `__all__`
-- **Clean package interface**: The subpackage `__init__.py` assembles all exports into a single namespace
-- **Easier refactoring**: Components can be moved or reorganized without changing the public API
-
-**Migration guidance**: Simple top-level modules can be migrated to this subpackage structure when they become sufficiently complex (e.g., when a module grows to multiple related classes or when logical grouping would improve maintainability).
-
-### Guidelines
-
-1. **Always create `.pxd` files for shared Cython types**: If a class or function is `cimport`ed by other modules, provide a `.pxd` declaration file.
-
-2. **Keep `.pxd` files minimal**: Only include declarations needed for Cython compilation. Omit implementation details, docstrings, and Python-only code.
-
-3. **Use `__all__` when helpful**: Define `__all__` to control exported symbols when it simplifies or clarifies the module structure.
-
-4. **Use `from ._module import *` in subpackage `__init__.py`**: This pattern assembles the subpackage API from its submodules. Use `# noqa: F403` to suppress linting warnings about wildcard imports.
-
-5. **Migrate to subpackage structure when complex**: When a top-level module becomes complex (multiple related classes, logical grouping needed), consider refactoring to the subpackage pattern.
-
-6. **Separate concerns**: Use `.py` files for pure Python utilities, `.pyx` files for Cython implementations that need C-level performance.
-
-## Import Statements
-
-Import statements must be organized into five groups, in the following order.
-
-**Note**: Within each group, imports must be sorted alphabetically. This is enforced by pre-commit linters (`ruff`).
-
-### 1. `__future__` Imports
-
-`__future__` imports must come first, before all other imports.
-
-
-```python
-from __future__ import annotations
-```
-
-### 2. External `cimport` Statements
-
-External Cython imports from standard libraries and third-party packages. This includes:
-
-- `libc.*` (e.g., `libc.stdint`, `libc.stdlib`, `libc.string`)
-- `cpython`
-- `cython`
-- `cuda.bindings` (CUDA bindings package)
-
-```python
-cimport cpython
-from libc.stdint cimport uintptr_t
-from libc.stdlib cimport malloc, free
-from cuda.bindings cimport cydriver
-```
-
-### 3. cuda-core `cimport` Statements
-
-Cython imports from within the `cuda.core` package.
-
-```python
-from cuda.core._memory._buffer cimport Buffer, MemoryResource
-from cuda.core._stream cimport Stream_accept, Stream
-from cuda.core._utils.cuda_utils cimport (
-    HANDLE_RETURN,
-    check_or_create_options,
-)
-```
-
-### 4. External `import` Statements
-
-Regular Python imports from standard libraries and third-party packages. This includes:
-
-- Standard library modules (e.g., `abc`, `typing`, `threading`, `dataclasses`)
-- Third-party packages
-
-```python
-import abc
-import threading
-from dataclasses import dataclass
-```
-
-### 5. cuda-core `import` Statements
-
-Regular Python imports from within the `cuda.core` package.
-
-```python
-from cuda.core._context import Context, ContextOptions
-from cuda.core._dlpack import DLDeviceType, make_py_capsule
-from cuda.core._utils.cuda_utils import (
-    CUDAError,
-    driver,
-    handle_return,
-)
-```
-
-### Additional Rules
-
-1. **Alphabetical Ordering**: Within each group, imports must be sorted alphabetically by module name. This is enforced by pre-commit linters.
-
-2. **Multi-line Imports**: When importing multiple items from a single module, use parentheses for multi-line formatting:
-   ```python
-   from cuda.core._utils.cuda_utils cimport (
-       HANDLE_RETURN,
-       check_or_create_options,
-   )
-   ```
-
-3. **Type-only imports**: With `from __future__ import annotations`, types can be imported normally even if only used in annotations. Avoid `TYPE_CHECKING` blocks (see [Type Annotations and Declarations](#type-annotations-and-declarations) for details).
-
-4. **Blank Lines**: Use blank lines to separate the five import groups. Do not use blank lines within a group unless using multi-line import formatting.
-
-5. **`try/except` Blocks**: Import fallbacks (e.g., for optional dependencies) should be placed in the appropriate group (external or cuda-core) using `try/except` blocks.
-
-### Example
-
-```python
-# <SPDX copyright header>
-
-from __future__ import annotations
-
-cimport cpython
-from libc.stdint cimport uintptr_t
-from libc.stdlib cimport malloc, free
-from cuda.bindings cimport cydriver
-
-from cuda.core._memory._buffer cimport Buffer, MemoryResource
-from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
-
-import abc
-from dataclasses import dataclass
-
-from cuda.core._context import Context
-from cuda.core._device import Device
-from cuda.core._utils.cuda_utils import driver
-```
-
-## Class and Function Definitions
-
-### Class Definition Order
-
-Within a class definition, the suggested organization is:
-
-1. **Special (dunder) methods**: Methods with names starting and ending with double underscores. By convention, `__init__` (or `__cinit__` in Cython) should be first among dunder methods, as it defines the class interface.
-
-2. **Methods**: Regular instance methods, class methods (`@classmethod`), and static methods (`@staticmethod`)
-
-3. **Properties**: Properties defined with `@property` decorator
-
-**Note**: Within each section, prefer logical ordering (e.g., grouping related methods). Alphabetical ordering is acceptable when no clear logical structure exists. Developers should use their judgment.
-
-### Example
-
-```python
-cdef class Buffer:
-    """Example class demonstrating the ordering."""
-
-    # 1. Special (dunder) methods (__cinit__/__init__ first by convention)
-    def __cinit__(self):
-        """Cython initialization."""
-        # ...
-
-    def __init__(self, *args, **kwargs):
-        """Python initialization."""
-        # ...
-
-    def __buffer__(self, flags: int, /) -> memoryview:
-        """Buffer protocol support."""
-        # ...
-
-    def __dealloc__(self):
-        """Cleanup."""
-        # ...
-
-    def __dlpack__(self, *, stream=None):
-        """DLPack protocol support."""
-        # ...
-
-    def __reduce__(self):
-        """Pickle support."""
-        # ...
-
-    # 2. Methods
-    def close(self, stream=None):
-        """Close the buffer."""
-        # ...
-
-    def copy_from(self, src, *, stream):
-        """Copy data from source buffer."""
-        # ...
-
-    def copy_to(self, dst=None, *, stream):
-        """Copy data to destination buffer."""
-        # ...
-
-    @classmethod
-    def from_handle(cls, ptr, size, mr=None):
-        """Create buffer from handle."""
-        # ...
-
-    def get_ipc_descriptor(self):
-        """Get IPC descriptor."""
-        # ...
-
-    # 3. Properties
-    @property
-    def device_id(self) -> int:
-        """Device ID property."""
-        # ...
-
-    @property
-    def handle(self):
-        """Handle property."""
-        # ...
-
-    @property
-    def size(self) -> int:
-        """Size property."""
-        # ...
-```
-
-### Helper Functions
-
-When a class grows long or a method becomes deeply nested, consider extracting implementation details into helper functions. The goal is to keep class definitions easy to navigate—readers shouldn't have to scroll through hundreds of lines to understand a class's interface.
-
-In Cython files, helpers are typically `cdef` or `cdef inline` functions named with the pattern `ClassName_methodname` (e.g., `DMR_close`, `Buffer_close`). Place them at the end of the file or near their call sites, whichever aids readability.
-
-**Example:**
-
-```python
-cdef class DeviceMemoryResource:
-    def close(self):
-        """Close the memory resource."""
-        DMR_close(self)
-
-# Helper function (at end of file or nearby)
-cdef inline DMR_close(DeviceMemoryResource self):
-    if self._handle == NULL:
-        return
-    # ... implementation ...
-```
-
-### Function Definitions
-
-For module-level functions (outside of classes), follow the ordering specified in [File Structure](#file-structure): principal functions first (if applicable), then other public functions, then private functions. Within each group, prefer logical ordering; alphabetical ordering is a reasonable fallback.
-
-## Naming Conventions
-
-Follow PEP 8 naming conventions (CamelCase for classes, snake_case for functions/variables, UPPER_SNAKE_CASE for constants, leading underscore for private names).
-
-### Cython `cdef` Variables
-
-Consider prefixing `cdef` variables with `c_` to distinguish them from Python variables. This improves code readability by making it clear which variables are C-level types.
-
-**Preferred:**
-```python
-def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
-    stream = Stream_accept(stream)
-    cdef size_t c_src_size = self._size
-
-    if dst is None:
-        dst = self._memory_resource.allocate(c_src_size, stream)
-
-    cdef size_t c_dst_size = dst._size
-    if c_dst_size != c_src_size:
-        raise ValueError(f"buffer sizes mismatch: src={c_src_size}, dst={c_dst_size}")
-    # ...
-```
-
-**Also acceptable (if context is clear):**
-```python
-cdef cydriver.CUdevice get_device_from_ctx(
-        cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
-    cdef bint switch_context = (curr_ctx != target_ctx)
-    cdef cydriver.CUcontext ctx
-    cdef cydriver.CUdevice target_dev
-    # ...
-```
-
-The `c_` prefix is particularly helpful when mixing Python and Cython variables in the same scope, or when the variable name would otherwise be ambiguous.
-
-## Type Annotations and Declarations
-
-### Python Type Annotations
-
-#### PEP 604 Union Syntax
-
-Use the modern [PEP 604](https://peps.python.org/pep-0604/) union syntax (`X | Y`) instead of `typing.Union` or `typing.Optional`.
-
-**Preferred:**
-```python
-def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
-    # ...
-
-def close(self, stream: Stream | None = None):
-    # ...
-```
-
-**Avoid:**
-```python
-from typing import Optional, Union
-
-def allocate(self, size_t size, stream: Optional[Union[Stream, GraphBuilder]] = None) -> Buffer:
-    # ...
-
-def close(self, stream: Optional[Stream] = None):
-    # ...
-```
-
-#### Forward References and `from __future__ import annotations`
-
-Where needed, files should include `from __future__ import annotations` at the top (after the SPDX header). This enables:
-
-1. **Forward references**: Type annotations can reference types that are defined later in the file or in other modules without requiring `TYPE_CHECKING` blocks.
-
-2. **Cleaner syntax**: Annotations are evaluated as strings, avoiding circular import issues.
-
-**Preferred:**
-```python
-from __future__ import annotations
-
-# Can reference Stream even if it's defined later or in another module
-def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
-    # ...
-```
-
-**Avoid:**
-```python
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from cuda.core._stream import Stream
-
-def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
-    # ...
-```
-
-#### Guidelines
-
-1. **Use `from __future__ import annotations`**: This should be present in all `.py` and `.pyx` files with type annotations.
-
-2. **Use `|` for unions**: Prefer `X | Y | None` over `Union[X, Y]` or `Optional[X]`.
-
-3. **Avoid `TYPE_CHECKING` blocks**: With `from __future__ import annotations`, forward references work without `TYPE_CHECKING` guards.
-
-4. **Import types normally**: Even if a type is only used in annotations, import it normally (not in a `TYPE_CHECKING` block).
-
-### Cython Type Declarations
-
-Cython uses `cdef` declarations for C-level types. These follow different rules:
-
-```python
-cdef class Buffer:
-    cdef:
-        uintptr_t _ptr
-        size_t _size
-        MemoryResource _memory_resource
-```
-
-For Cython-specific type declarations, see [Cython-Specific Features](#cython-specific-features).
-
-## Docstrings
-
-This project uses the **NumPy docstring style** for all documentation. This format is well-suited for scientific and technical libraries and integrates well with Sphinx documentation generation.
-
-### Format Overview
-
-Docstrings use triple double-quotes (`"""`) and follow this general structure:
-
-```python
-"""Summary line.
-
-Extended description (optional).
-
-Parameters
-----------
-param1 : type
-    Description of param1.
-param2 : type, optional
-    Description of param2. Default is value.
-
-Returns
--------
-return_type
-    Description of return value.
-
-Raises
-------
-ExceptionType
-    Description of when this exception is raised.
-
-Notes
------
-Additional notes and implementation details.
-
-Examples
---------
->>> example_code()
-result
-"""
-```
-
-### Module Docstrings
-
-Per PEP 257, module docstrings appear at the top of the file, immediately after the copyright header and before any imports. They provide a brief overview of the module's purpose.
-
-```python
-# <SPDX copyright header>
-"""Module for managing CUDA device memory resources.
-
-This module provides classes and functions for allocating and managing
-device memory using CUDA's stream-ordered memory pool API.
-"""
-
-from __future__ import annotations
-# ... imports ...
-```
-
-For simple utility modules, a single-line docstring may suffice:
-
-```python
-"""Utility functions for CUDA error handling."""
-```
-
-### Class Docstrings
-
-Class docstrings should include:
-
-1. **Summary line**: A one-line description of the class
-2. **Extended description** (optional): Additional context about the class
-3. **Parameters section**: If the class is callable (has `__init__`), document constructor parameters
-4. **Attributes section**: Document public attributes (if any)
-5. **Notes section**: Important usage notes, implementation details, or examples
-6. **Examples section**: Usage examples (if helpful)
-
-**Example:**
-
-```python
-cdef class DeviceMemoryResource(MemoryResource):
-    """
-    A device memory resource managing a stream-ordered memory pool.
-
-    Parameters
-    ----------
-    device_id : :class:`Device` | int
-        Device or device ordinal for which a memory resource is constructed.
-    options : :class:`DeviceMemoryResourceOptions`, optional
-        Memory resource creation options. If None, uses the driver's current
-        or default memory pool for the specified device.
-
-    Attributes
-    ----------
-    device_id : int
-        The device ID associated with this memory resource.
-    is_ipc_enabled : bool
-        Whether this memory resource supports IPC.
-
-    Notes
-    -----
-    To create an IPC-enabled memory resource, specify ``ipc_enabled=True``
-    in the options. IPC-enabled resources can share allocations between
-    processes.
-
-    Examples
-    --------
-    >>> dmr = DeviceMemoryResource(0)
-    >>> buffer = dmr.allocate(1024)
-    """
-```
-
-For simple classes, a brief docstring may be sufficient:
-
-```python
-@dataclass
-cdef class DeviceMemoryResourceOptions:
-    """Customizable DeviceMemoryResource options.
-
-    Attributes
-    ----------
-    ipc_enabled : bool, optional
-        Whether to create an IPC-enabled memory pool. Default is False.
-    max_size : int, optional
-        Maximum pool size. Default is 0 (system-dependent).
-    """
-```
-
-### Method and Function Docstrings
-
-Method and function docstrings should include:
-
-1. **Summary line**: A one-line description starting with a verb (e.g., "Allocate", "Return", "Create")
-2. **Extended description** (optional): Additional details about behavior
-3. **Parameters section**: All parameters with types and descriptions
-4. **Returns section**: Return type and description
-5. **Raises section**: Exceptions that may be raised (if any)
-6. **Notes section**: Important implementation details or usage notes (if needed)
-7. **Examples section**: Usage examples (if helpful)
-
-**Example:**
-
-```python
-def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
-    """Allocate a buffer of the requested size.
-
-    Parameters
-    ----------
-    size : int
-        The size of the buffer to allocate, in bytes.
-    stream : :class:`Stream` | :class:`GraphBuilder`, optional
-        The stream on which to perform the allocation asynchronously.
-        If None, an internal stream is used.
-
-    Returns
-    -------
-    :class:`Buffer`
-        The allocated buffer object, which is accessible on the device
-        that this memory resource was created for.
-
-    Raises
-    ------
-    TypeError
-        If called on a mapped IPC-enabled memory resource.
-    RuntimeError
-        If allocation fails.
-
-    Notes
-    -----
-    The allocated buffer is associated with this memory resource and will
-    be deallocated when the buffer is closed or when this resource is closed.
-    """
-```
-
-For simple functions, a brief docstring may suffice:
-
-```python
-def get_ipc_descriptor(self) -> IPCBufferDescriptor:
-    """Export a :class:`Buffer` for sharing between processes."""
-```
-
-### Property Docstrings
-
-Property docstrings should be concise and focus on what the property represents. For read-write properties, document both getter and setter behavior.
-
-**Read-only property:**
-
-```python
-@property
-def device_id(self) -> int:
-    """Return the device ordinal of this buffer."""
-```
-
-**Read-write property:**
-
-```python
-@property
-def peer_accessible_by(self):
-    """
-    Get or set the devices that can access allocations from this memory pool.
-
-    Returns
-    -------
-    tuple of int
-        A tuple of sorted device IDs that currently have peer access to
-        allocations from this memory pool.
-
-    Notes
-    -----
-    When setting, accepts a sequence of :class:`Device` objects or device IDs.
-    Setting to an empty sequence revokes all peer access.
-
-    Examples
-    --------
-    >>> dmr.peer_accessible_by = [1]  # Grant access to device 1
-    >>> assert dmr.peer_accessible_by == (1,)
-    """
-```
-
-### Type References in Docstrings
-
-Use Sphinx cross-reference roles to link to other documented objects. Use the most specific role for each type:
-
-| Role | Use for | Example |
-|------|---------|---------|
-| `:class:` | Classes | `` :class:`Buffer` `` |
-| `:func:` | Functions | `` :func:`launch` `` |
-| `:meth:` | Methods | `` :meth:`Device.create_stream` `` |
-| `:attr:` | Attributes | `` :attr:`device_id` `` |
-| `:mod:` | Modules | `` :mod:`multiprocessing` `` |
-| `:obj:` | Type aliases, other objects | `` :obj:`DevicePointerT` `` |
-
-The `~` prefix displays only the final component: `` :class:`~cuda.core.Buffer` `` renders as "Buffer" while still linking to the full path.
-
-For more details, see the [Sphinx Python domain documentation](https://www.sphinx-doc.org/en/master/usage/domains/python.html#cross-referencing-python-objects).
-
-**Example:**
-
-```python
-def from_handle(
-    ptr: DevicePointerT, size_t size, mr: MemoryResource | None = None
-) -> Buffer:
-    """Create a new :class:`Buffer` from a pointer.
-
-    Parameters
-    ----------
-    ptr : :obj:`DevicePointerT`
-        Allocated buffer handle object.
-    size : int
-        Memory size of the buffer.
-    mr : :class:`MemoryResource`, optional
-        Memory resource associated with the buffer.
-    """
-```
-
-### Guidelines
-
-1. **Always include docstrings**: All public classes, methods, functions, and properties should have docstrings.
-
-2. **Start with a verb**: Summary lines for methods and functions should start with a verb in imperative mood (e.g., "Allocate", "Return", "Create", not "Allocates", "Returns", "Creates").
-
-3. **Be concise but complete**: Provide enough information for users to understand and use the API, but avoid unnecessary verbosity.
-
-4. **Use proper sections**: Include Parameters, Returns, Raises sections when applicable. Use Notes and Examples sections when they add value.
-
-5. **Document optional parameters**: Clearly indicate optional parameters and their default values.
-
-6. **Use type hints**: Type information in docstrings should complement (not duplicate) type annotations. Use docstrings to provide additional context about types.
-
-7. **Cross-reference related APIs**: Use Sphinx cross-references to link to related classes, methods, and attributes.
-
-8. **Keep private methods brief**: Private methods (starting with `_`) may have minimal docstrings, but should still document non-obvious behavior.
-
-9. **Update docstrings with code changes**: Keep docstrings synchronized with implementation changes.
-
-## Errors and Warnings
-
-### CUDA Exceptions
-
-The project defines custom exceptions for CUDA-specific errors:
-
-- **`CUDAError`**: Base exception for CUDA driver errors
-- **`NVRTCError`**: Exception for NVRTC compiler errors (inherits from `CUDAError`)
-
-Use these instead of generic exceptions when reporting CUDA failures.
-
-### CUDA API Error Handling
-
-In `nogil` contexts, use the `HANDLE_RETURN` macro:
-
-```python
-with nogil:
-    HANDLE_RETURN(cydriver.cuMemAlloc(ptr, size))
-```
-
-At the Python level, use `handle_return()` or `raise_if_driver_error()`:
-
-```python
-err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
-handle_return((err,))
-```
-
-### Warnings
-
-When emitting warnings, always specify `stacklevel` so the warning points to the caller:
-
-```python
-warnings.warn(message, UserWarning, stacklevel=3)
-```
-
-The value depends on call depth—typically `stacklevel=2` for direct calls, `stacklevel=3` when called through a helper.
-
-## CUDA-Specific Patterns
-
-### GIL Management for CUDA Driver API Calls
-
-For optimized Cython code, release the GIL when calling CUDA driver APIs. This improves performance and allows other Python threads to run during CUDA operations.
-
-During initial development, it's fine to use the Python `driver` module without releasing the GIL (see [Development Lifecycle](#development-lifecycle)). GIL release is a performance optimization that can be applied once the implementation is correct.
-
-#### Using `with nogil:` Blocks
-
-Wrap `cydriver` calls in `with nogil:` blocks (or declare entire functions as `nogil`):
-
-```python
-cdef int value
-with nogil:
-    HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, device_id))
-```
-
-Group multiple driver calls in a single block:
-
-```python
-cdef int low, high
-with nogil:
-    HANDLE_RETURN(cydriver.cuCtxGetStreamPriorityRange(&low, &high))
-```
-
-#### Raising Exceptions from `nogil` Context
-
-To raise exceptions from a `nogil` context, acquire the GIL first:
-
-```python
-with gil:
-    raise CUDAError(f"CUDA operation failed: {error}")
-```
-
-## Development Lifecycle
-
-### Two-Phase Development
-
-A common pattern when implementing CUDA functionality is to develop in two phases:
-
-1. **Start with Python**: Use the `driver` module for a straightforward implementation. Write tests to verify correctness. This allows faster iteration and easier debugging.
-
-2. **Optimize with Cython**: Once the implementation is correct, switch to `cydriver` with `nogil` blocks and `HANDLE_RETURN` for better performance.
-
-This approach separates correctness from optimization. Getting the logic right first—with Python's better error messages and stack traces—often saves time overall.
-
-### Python Implementation
-
-Use the `driver` module from `cuda.core._utils.cuda_utils`:
-
-```python
-from cuda.core._utils.cuda_utils import driver
-from cuda.core._utils.cuda_utils cimport (
-    _check_driver_error as raise_if_driver_error,
-)
-
-def get_attribute(self, attr: int) -> int:
-    err, value = driver.cuDeviceGetAttribute(attr, self._id)
-    raise_if_driver_error(err)
-    return value
-```
-
-### Cython Optimization
-
-When ready to optimize, switch to `cydriver`:
-
-```python
-from cuda.bindings cimport cydriver
-from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
-
-def get_attribute(self, attr: int) -> int:
-    cdef int value
-    with nogil:
-        HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, self._id))
-    return value
-```
-
-Key changes:
-- Replace `driver` with `cydriver`
-- Wrap calls in `with nogil:`
-- Use `HANDLE_RETURN` instead of `raise_if_driver_error`
-
-Run tests after optimization to verify behavior is unchanged.
diff --git a/cuda_core/docs/source/developer-guide.rst b/cuda_core/docs/source/developer-guide.rst
new file mode 100644
index 0000000000..4b6d851cf8
--- /dev/null
+++ b/cuda_core/docs/source/developer-guide.rst
@@ -0,0 +1,1258 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+CUDA Core Developer Guide
+=========================
+
+This guide defines conventions for Python and Cython code in
+``cuda/core``.
+
+**This project follows** `PEP 8 <https://peps.python.org/pep-0008/>`__
+**as the base style guide and** `PEP
+257 <https://peps.python.org/pep-0257/>`__ **for docstring
+conventions.** The guidance in this document extends these with
+project-specific patterns, particularly for Cython code and the
+structure of this codebase. Standard conventions are not repeated here.
+
+Table of Contents
+-----------------
+
+1.  `File Structure <#file-structure>`__
+2.  `Package Layout <#package-layout>`__
+3.  `Import Statements <#import-statements>`__
+4.  `Class and Function Definitions <#class-and-function-definitions>`__
+5.  `Naming Conventions <#naming-conventions>`__
+6.  `Type Annotations and
+    Declarations <#type-annotations-and-declarations>`__
+7.  `Docstrings <#docstrings>`__
+8.  `Errors and Warnings <#errors-and-warnings>`__
+9.  `CUDA-Specific Patterns <#cuda-specific-patterns>`__
+10. `Development Lifecycle <#development-lifecycle>`__
+
+--------------
+
+File Structure
+--------------
+
+The goal is **readability and maintainability**. A well-organized file
+lets readers quickly find what they're looking for and understand how
+the pieces fit together.
+
+To support this, we suggest organizing content from most important to
+least important: principal classes first, then supporting classes, then
+implementation details. This way, readers can start at the top and
+immediately see what matters most. Unlike C/C++ where definitions must
+precede uses, Python imposes no such constraint—we're free to optimize
+for the reader.
+
+These are guidelines, not rules. Place helper functions near their call
+sites if that's clearer. Group related code together if it aids
+understanding. When in doubt, choose whatever makes the code easiest to
+read and maintain.
+
+The following is a suggested file organization:
+
+.. _1-spdx-copyright-header:
+
+1. SPDX Copyright Header
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Every file begins with an SPDX copyright header. The pre-commit hook
+adds or updates these automatically.
+
+.. _2-module-docstring-optional:
+
+2. Module Docstring (Optional)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If present, the module docstring comes immediately after the copyright
+header, before any imports. Per PEP 257, this is the standard location
+for module-level documentation.
+
+.. _3-import-statements:
+
+3. Import Statements
+~~~~~~~~~~~~~~~~~~~~
+
+Imports come next. See `Import Statements <#import-statements>`__ for
+ordering conventions.
+
+.. _4-__all__-declaration-optional:
+
+4. ``__all__`` Declaration (Optional)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If present, ``__all__`` specifies symbols included in star imports.
+
+.. code:: python
+
+   __all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
+
+.. _5-type-aliases-and-constants-optional:
+
+5. Type Aliases and Constants (Optional)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Type aliases and module-level constants, if any, come next.
+
+.. code:: python
+
+   DevicePointerT = driver.CUdeviceptr | int | None
+   """Type union for device pointer representations."""
+
+   LEGACY_DEFAULT_STREAM = C_LEGACY_DEFAULT_STREAM
+
+.. _6-principal-class-or-function:
+
+6. Principal Class or Function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the file centers on a single class or function (e.g., ``_buffer.pyx``
+defines ``Buffer``, ``_device.pyx`` defines ``Device``), that principal
+element comes first among the definitions.
+
+.. _7-other-public-classes-and-functions:
+
+7. Other Public Classes and Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Other public classes and functions follow. These might include auxiliary
+classes (e.g., ``DeviceMemoryResourceOptions``), abstract base classes,
+or additional exports. Organize them logically—by related functionality
+or typical usage.
+
+.. _8-public-module-functions:
+
+8. Public Module Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Public module-level functions come after classes.
+
+.. _9-private-and-implementation-details:
+
+9. Private and Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Finally, private functions and implementation details: functions
+prefixed with ``_``, ``cdef inline`` helpers, and any specialized code
+that would distract from the principal content.
+
+Example Structure
+~~~~~~~~~~~~~~~~~
+
+.. code:: python
+
+   # <SPDX copyright header>
+   """Module for buffer and memory resource management."""
+
+   from libc.stdint cimport uintptr_t
+   from cuda.core._memory._device_memory_resource cimport DeviceMemoryResource
+   import abc
+
+   __all__ = ['Buffer', 'MemoryResource', 'some_public_function']
+
+   DevicePointerT = driver.CUdeviceptr | int | None
+   """Type union for device pointer representations."""
+
+   cdef class Buffer:
+       """Principal class for this module."""
+       # ...
+
+   cdef class MemoryResource:
+       """Abstract base class."""
+       # ...
+
+   def some_public_function():
+       """Public API function."""
+       # ...
+
+   cdef inline void Buffer_close(Buffer self, stream):
+       """Private implementation helper."""
+       # ...
+
+Notes
+~~~~~
+
+- Not every file will have all sections. For example, a utility module
+  may not have a principal class.
+- The distinction between "principal" and "other" classes is based on
+  the file's primary purpose. If a file exists primarily to define one
+  class, that class is the principal class.
+- Private implementation functions should be placed at the end of the
+  file to keep the public API visible at the top.
+- **Within each section**, prefer logical ordering (e.g., by
+  functionality or typical usage). Alphabetical ordering is a reasonable
+  fallback when no clear logical structure exists.
+
+Package Layout
+--------------
+
+File Types
+~~~~~~~~~~
+
+The ``cuda/core`` package uses three types of files:
+
+1. **``.pyx`` files**: Cython implementation files containing the actual
+   code
+2. **``.pxd`` files**: Cython declaration files containing type
+   definitions and function signatures for C-level access
+3. **``.py`` files**: Pure Python files for utilities and high-level
+   interfaces
+
+File Naming Conventions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+- **Implementation files**: Use ``.pyx`` for Cython code, ``.py`` for
+  pure Python code
+- **Declaration files**: Use ``.pxd`` for Cython type declarations
+- **Private modules**: Prefix with underscore (e.g., ``_buffer.pyx``,
+  ``_device.pyx``)
+- **Public modules**: No underscore prefix (e.g., ``utils.py``)
+
+.. _relationship-between-pxd-and-pyx-files:
+
+Relationship Between ``.pxd`` and ``.pyx`` Files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For each ``.pyx`` file that defines classes or functions used by other
+Cython modules, create a corresponding ``.pxd`` file:
+
+- **``.pxd`` file**: Contains ``cdef`` class declarations,
+  ``cdef``/``cpdef`` function signatures, and ``cdef`` attribute
+  declarations
+- **``.pyx`` file**: Contains the full implementation including Python
+  methods, docstrings, and implementation details
+
+**Example:**
+
+``_buffer.pxd``:
+
+.. code:: python
+
+   cdef class Buffer:
+       cdef:
+           uintptr_t      _ptr
+           size_t         _size
+           MemoryResource _memory_resource
+           object         _ipc_data
+
+``_buffer.pyx``:
+
+.. code:: python
+
+   cdef class Buffer:
+       """Full implementation with methods and docstrings."""
+
+       def close(self, stream=None):
+           """Implementation here."""
+           # ...
+
+Module Organization
+~~~~~~~~~~~~~~~~~~~
+
+Simple Top-Level Modules
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For simple modules at the ``cuda/core`` level, define classes and
+functions directly in the module file with an ``__all__`` list:
+
+.. code:: python
+
+   # _device.pyx
+   __all__ = ['Device', 'DeviceProperties']
+
+   cdef class Device:
+       # ...
+
+   cdef class DeviceProperties:
+       # ...
+
+Complex Subpackages
+^^^^^^^^^^^^^^^^^^^
+
+For complex subpackages that require extra structure (like
+``_memory/``), use the following pattern:
+
+1. **Private submodules**: Each component is implemented in a private
+   submodule (e.g., ``_buffer.pyx``, ``_device_memory_resource.pyx``)
+2. **Submodule ``__all__``**: Each submodule defines its own ``__all__``
+   list
+3. **Subpackage ``__init__.py``**: The subpackage ``__init__.py`` uses
+   ``from ._module import *`` to assemble the package
+
+**Example structure for ``_memory/`` subpackage:**
+
+``_memory/_buffer.pyx``:
+
+.. code:: python
+
+   __all__ = ['Buffer', 'MemoryResource']
+
+   cdef class Buffer:
+       # ...
+
+   cdef class MemoryResource:
+       # ...
+
+``_memory/_device_memory_resource.pyx``:
+
+.. code:: python
+
+   __all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
+
+   cdef class DeviceMemoryResourceOptions:
+       # ...
+
+   cdef class DeviceMemoryResource:
+       # ...
+
+``_memory/__init__.py``:
+
+.. code:: python
+
+   from ._buffer import *  # noqa: F403
+   from ._device_memory_resource import *  # noqa: F403
+   from ._graph_memory_resource import *  # noqa: F403
+   from ._ipc import *  # noqa: F403
+   from ._legacy import *  # noqa: F403
+   from ._virtual_memory_resource import *  # noqa: F403
+
+This pattern allows:
+
+- **Modular organization**: Each component lives in its own file
+- **Clear star-import behavior**: Each submodule explicitly defines what
+  it exports via ``__all__``
+- **Clean package interface**: The subpackage ``__init__.py`` assembles
+  all exports into a single namespace
+- **Easier refactoring**: Components can be moved or reorganized without
+  changing the public API
+
+**Migration guidance**: Simple top-level modules can be migrated to this
+subpackage structure when they become sufficiently complex (e.g., when a
+module grows to multiple related classes or when logical grouping would
+improve maintainability).
+
+Guidelines
+~~~~~~~~~~
+
+1. **Always create ``.pxd`` files for shared Cython types**: If a class
+   or function is ``cimport``\ ed by other modules, provide a ``.pxd``
+   declaration file.
+
+2. **Keep ``.pxd`` files minimal**: Only include declarations needed for
+   Cython compilation. Omit implementation details, docstrings, and
+   Python-only code.
+
+3. **Use ``__all__`` when helpful**: Define ``__all__`` to control
+   exported symbols when it simplifies or clarifies the module
+   structure.
+
+4. **Use ``from ._module import *`` in subpackage ``__init__.py``**:
+   This pattern assembles the subpackage API from its submodules. Use
+   ``# noqa: F403`` to suppress linting warnings about wildcard imports.
+
+5. **Migrate to subpackage structure when complex**: When a top-level
+   module becomes complex (multiple related classes, logical grouping
+   needed), consider refactoring to the subpackage pattern.
+
+6. **Separate concerns**: Use ``.py`` files for pure Python utilities,
+   ``.pyx`` files for Cython implementations that need C-level
+   performance.
+
+Import Statements
+-----------------
+
+Import statements must be organized into five groups, in the following
+order.
+
+**Note**: Within each group, imports must be sorted alphabetically. This
+is enforced by pre-commit linters (``ruff``).
+
+.. _1-__future__-imports:
+
+1. ``__future__`` Imports
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``__future__`` imports must come first, before all other imports.
+
+.. code:: python
+
+   from __future__ import annotations
+
+.. _2-external-cimport-statements:
+
+2. External ``cimport`` Statements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+External Cython imports from standard libraries and third-party
+packages. This includes:
+
+- ``libc.*`` (e.g., ``libc.stdint``, ``libc.stdlib``, ``libc.string``)
+- ``cpython``
+- ``cython``
+- ``cuda.bindings`` (CUDA bindings package)
+
+.. code:: python
+
+   cimport cpython
+   from libc.stdint cimport uintptr_t
+   from libc.stdlib cimport malloc, free
+   from cuda.bindings cimport cydriver
+
+.. _3-cuda-core-cimport-statements:
+
+3. cuda-core ``cimport`` Statements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Cython imports from within the ``cuda.core`` package.
+
+.. code:: python
+
+   from cuda.core._memory._buffer cimport Buffer, MemoryResource
+   from cuda.core._stream cimport Stream_accept, Stream
+   from cuda.core._utils.cuda_utils cimport (
+       HANDLE_RETURN,
+       check_or_create_options,
+   )
+
+.. _4-external-import-statements:
+
+4. External ``import`` Statements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regular Python imports from standard libraries and third-party packages.
+This includes:
+
+- Standard library modules (e.g., ``abc``, ``typing``, ``threading``,
+  ``dataclasses``)
+- Third-party packages
+
+.. code:: python
+
+   import abc
+   import threading
+   from dataclasses import dataclass
+
+.. _5-cuda-core-import-statements:
+
+5. cuda-core ``import`` Statements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regular Python imports from within the ``cuda.core`` package.
+
+.. code:: python
+
+   from cuda.core._context import Context, ContextOptions
+   from cuda.core._dlpack import DLDeviceType, make_py_capsule
+   from cuda.core._utils.cuda_utils import (
+       CUDAError,
+       driver,
+       handle_return,
+   )
+
+Additional Rules
+~~~~~~~~~~~~~~~~
+
+1. **Alphabetical Ordering**: Within each group, imports must be sorted
+   alphabetically by module name. This is enforced by pre-commit
+   linters.
+
+2. **Multi-line Imports**: When importing multiple items from a single
+   module, use parentheses for multi-line formatting:
+
+   .. code:: python
+
+      from cuda.core._utils.cuda_utils cimport (
+          HANDLE_RETURN,
+          check_or_create_options,
+      )
+
+3. **Type-only imports**: With ``from __future__ import annotations``,
+   types can be imported normally even if only used in annotations.
+   Avoid ``TYPE_CHECKING`` blocks (see `Type Annotations and
+   Declarations <#type-annotations-and-declarations>`__ for details).
+
+4. **Blank Lines**: Use blank lines to separate the five import groups.
+   Do not use blank lines within a group unless using multi-line import
+   formatting.
+
+5. **``try/except`` Blocks**: Import fallbacks (e.g., for optional
+   dependencies) should be placed in the appropriate group (external or
+   cuda-core) using ``try/except`` blocks.
+
+Example
+~~~~~~~
+
+.. code:: python
+
+   # <SPDX copyright header>
+
+   from __future__ import annotations
+
+   cimport cpython
+   from libc.stdint cimport uintptr_t
+   from libc.stdlib cimport malloc, free
+   from cuda.bindings cimport cydriver
+
+   from cuda.core._memory._buffer cimport Buffer, MemoryResource
+   from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
+
+   import abc
+   from dataclasses import dataclass
+
+   from cuda.core._context import Context
+   from cuda.core._device import Device
+   from cuda.core._utils.cuda_utils import driver
+
+Class and Function Definitions
+------------------------------
+
+Class Definition Order
+~~~~~~~~~~~~~~~~~~~~~~
+
+Within a class definition, the suggested organization is:
+
+1. **Special (dunder) methods**: Methods with names starting and ending
+   with double underscores. By convention, ``__init__`` (or
+   ``__cinit__`` in Cython) should be first among dunder methods, as it
+   defines the class interface.
+
+2. **Methods**: Regular instance methods, class methods
+   (``@classmethod``), and static methods (``@staticmethod``)
+
+3. **Properties**: Properties defined with ``@property`` decorator
+
+**Note**: Within each section, prefer logical ordering (e.g., grouping
+related methods). Alphabetical ordering is acceptable when no clear
+logical structure exists. Developers should use their judgment.
+
+.. _example-1:
+
+Example
+~~~~~~~
+
+.. code:: python
+
+   cdef class Buffer:
+       """Example class demonstrating the ordering."""
+
+       # 1. Special (dunder) methods (__cinit__/__init__ first by convention)
+       def __cinit__(self):
+           """Cython initialization."""
+           # ...
+
+       def __init__(self, *args, **kwargs):
+           """Python initialization."""
+           # ...
+
+       def __buffer__(self, flags: int, /) -> memoryview:
+           """Buffer protocol support."""
+           # ...
+
+       def __dealloc__(self):
+           """Cleanup."""
+           # ...
+
+       def __dlpack__(self, *, stream=None):
+           """DLPack protocol support."""
+           # ...
+
+       def __reduce__(self):
+           """Pickle support."""
+           # ...
+
+       # 2. Methods
+       def close(self, stream=None):
+           """Close the buffer."""
+           # ...
+
+       def copy_from(self, src, *, stream):
+           """Copy data from source buffer."""
+           # ...
+
+       def copy_to(self, dst=None, *, stream):
+           """Copy data to destination buffer."""
+           # ...
+
+       @classmethod
+       def from_handle(cls, ptr, size, mr=None):
+           """Create buffer from handle."""
+           # ...
+
+       def get_ipc_descriptor(self):
+           """Get IPC descriptor."""
+           # ...
+
+       # 3. Properties
+       @property
+       def device_id(self) -> int:
+           """Device ID property."""
+           # ...
+
+       @property
+       def handle(self):
+           """Handle property."""
+           # ...
+
+       @property
+       def size(self) -> int:
+           """Size property."""
+           # ...
+
+Helper Functions
+~~~~~~~~~~~~~~~~
+
+When a class grows long or a method becomes deeply nested, consider
+extracting implementation details into helper functions. The goal is to
+keep class definitions easy to navigate—readers shouldn't have to scroll
+through hundreds of lines to understand a class's interface.
+
+In Cython files, helpers are typically ``cdef`` or ``cdef inline``
+functions named with the pattern ``ClassName_methodname`` (e.g.,
+``DMR_close``, ``Buffer_close``). Place them at the end of the file or
+near their call sites, whichever aids readability.
+
+**Example:**
+
+.. code:: python
+
+   cdef class DeviceMemoryResource:
+       def close(self):
+           """Close the memory resource."""
+           DMR_close(self)
+
+   # Helper function (at end of file or nearby)
+   cdef inline DMR_close(DeviceMemoryResource self):
+       if self._handle == NULL:
+           return
+       # ... implementation ...
+
+Function Definitions
+~~~~~~~~~~~~~~~~~~~~
+
+For module-level functions (outside of classes), follow the ordering
+specified in `File Structure <#file-structure>`__: principal functions
+first (if applicable), then other public functions, then private
+functions. Within each group, prefer logical ordering; alphabetical
+ordering is a reasonable fallback.
+
+Naming Conventions
+------------------
+
+Follow PEP 8 naming conventions (CamelCase for classes, snake_case for
+functions/variables, UPPER_SNAKE_CASE for constants, leading underscore
+for private names).
+
+Cython ``cdef`` Variables
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consider prefixing ``cdef`` variables with ``c_`` to distinguish them
+from Python variables. This improves code readability by making it clear
+which variables are C-level types.
+
+**Preferred:**
+
+.. code:: python
+
+   def copy_to(self, dst: Buffer = None, *, stream: Stream | GraphBuilder) -> Buffer:
+       stream = Stream_accept(stream)
+       cdef size_t c_src_size = self._size
+
+       if dst is None:
+           dst = self._memory_resource.allocate(c_src_size, stream)
+
+       cdef size_t c_dst_size = dst._size
+       if c_dst_size != c_src_size:
+           raise ValueError(f"buffer sizes mismatch: src={c_src_size}, dst={c_dst_size}")
+       # ...
+
+**Also acceptable (if context is clear):**
+
+.. code:: python
+
+   cdef cydriver.CUdevice get_device_from_ctx(
+           cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
+       cdef bint switch_context = (curr_ctx != target_ctx)
+       cdef cydriver.CUcontext ctx
+       cdef cydriver.CUdevice target_dev
+       # ...
+
+The ``c_`` prefix is particularly helpful when mixing Python and Cython
+variables in the same scope, or when the variable name would otherwise
+be ambiguous.
+
+Type Annotations and Declarations
+---------------------------------
+
+Python Type Annotations
+~~~~~~~~~~~~~~~~~~~~~~~
+
+PEP 604 Union Syntax
+^^^^^^^^^^^^^^^^^^^^
+
+Use the modern `PEP 604 <https://peps.python.org/pep-0604/>`__ union
+syntax (``X | Y``) instead of ``typing.Union`` or ``typing.Optional``.
+
+**Preferred:**
+
+.. code:: python
+
+   def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
+       # ...
+
+   def close(self, stream: Stream | None = None):
+       # ...
+
+**Avoid:**
+
+.. code:: python
+
+   from typing import Optional, Union
+
+   def allocate(self, size_t size, stream: Optional[Union[Stream, GraphBuilder]] = None) -> Buffer:
+       # ...
+
+   def close(self, stream: Optional[Stream] = None):
+       # ...
+
+Forward References and ``from __future__ import annotations``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Where needed, files should include
+``from __future__ import annotations`` at the top (after the SPDX
+header). This enables:
+
+1. **Forward references**: Type annotations can reference types that are
+   defined later in the file or in other modules without requiring
+   ``TYPE_CHECKING`` blocks.
+
+2. **Cleaner syntax**: Annotations are evaluated as strings, avoiding
+   circular import issues.
+
+**Preferred:**
+
+.. code:: python
+
+   from __future__ import annotations
+
+   # Can reference Stream even if it's defined later or in another module
+   def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
+       # ...
+
+**Avoid:**
+
+.. code:: python
+
+   from typing import TYPE_CHECKING
+
+   if TYPE_CHECKING:
+       from cuda.core._stream import Stream
+
+   def allocate(self, size_t size, stream: Stream | None = None) -> Buffer:
+       # ...
+
+.. _guidelines-1:
+
+Guidelines
+^^^^^^^^^^
+
+1. **Use ``from __future__ import annotations``**: This should be
+   present in all ``.py`` and ``.pyx`` files with type annotations.
+
+2. **Use ``|`` for unions**: Prefer ``X | Y | None`` over
+   ``Union[X, Y]`` or ``Optional[X]``.
+
+3. **Avoid ``TYPE_CHECKING`` blocks**: With
+   ``from __future__ import annotations``, forward references work
+   without ``TYPE_CHECKING`` guards.
+
+4. **Import types normally**: Even if a type is only used in
+   annotations, import it normally (not in a ``TYPE_CHECKING`` block).
+
+Cython Type Declarations
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Cython uses ``cdef`` declarations for C-level types. These follow
+different rules:
+
+.. code:: python
+
+   cdef class Buffer:
+       cdef:
+           uintptr_t _ptr
+           size_t _size
+           MemoryResource _memory_resource
+
+For Cython-specific type declarations, see `Cython-Specific
+Features <#cython-specific-features>`__.
+
+Docstrings
+----------
+
+This project uses the **NumPy docstring style** for all documentation.
+This format is well-suited for scientific and technical libraries and
+integrates well with Sphinx documentation generation.
+
+Format Overview
+~~~~~~~~~~~~~~~
+
+Docstrings use triple double-quotes (``"""``) and follow this general
+structure:
+
+.. code:: python
+
+   """Summary line.
+
+   Extended description (optional).
+
+   Parameters
+   ----------
+   param1 : type
+       Description of param1.
+   param2 : type, optional
+       Description of param2. Default is value.
+
+   Returns
+   -------
+   return_type
+       Description of return value.
+
+   Raises
+   ------
+   ExceptionType
+       Description of when this exception is raised.
+
+   Notes
+   -----
+   Additional notes and implementation details.
+
+   Examples
+   --------
+   >>> example_code()
+   result
+   """
+
+Module Docstrings
+~~~~~~~~~~~~~~~~~
+
+Per PEP 257, module docstrings appear at the top of the file,
+immediately after the copyright header and before any imports. They
+provide a brief overview of the module's purpose.
+
+.. code:: python
+
+   # <SPDX copyright header>
+   """Module for managing CUDA device memory resources.
+
+   This module provides classes and functions for allocating and managing
+   device memory using CUDA's stream-ordered memory pool API.
+   """
+
+   from __future__ import annotations
+   # ... imports ...
+
+For simple utility modules, a single-line docstring may suffice:
+
+.. code:: python
+
+   """Utility functions for CUDA error handling."""
+
+Class Docstrings
+~~~~~~~~~~~~~~~~
+
+Class docstrings should include:
+
+1. **Summary line**: A one-line description of the class
+2. **Extended description** (optional): Additional context about the
+   class
+3. **Parameters section**: If the class is callable (has ``__init__``),
+   document constructor parameters
+4. **Attributes section**: Document public attributes (if any)
+5. **Notes section**: Important usage notes, implementation details, or
+   examples
+6. **Examples section**: Usage examples (if helpful)
+
+**Example:**
+
+.. code:: python
+
+   cdef class DeviceMemoryResource(MemoryResource):
+       """
+       A device memory resource managing a stream-ordered memory pool.
+
+       Parameters
+       ----------
+       device_id : :class:`Device` | int
+           Device or device ordinal for which a memory resource is constructed.
+       options : :class:`DeviceMemoryResourceOptions`, optional
+           Memory resource creation options. If None, uses the driver's current
+           or default memory pool for the specified device.
+
+       Attributes
+       ----------
+       device_id : int
+           The device ID associated with this memory resource.
+       is_ipc_enabled : bool
+           Whether this memory resource supports IPC.
+
+       Notes
+       -----
+       To create an IPC-enabled memory resource, specify ``ipc_enabled=True``
+       in the options. IPC-enabled resources can share allocations between
+       processes.
+
+       Examples
+       --------
+       >>> dmr = DeviceMemoryResource(0)
+       >>> buffer = dmr.allocate(1024)
+       """
+
+For simple classes, a brief docstring may be sufficient:
+
+.. code:: python
+
+   @dataclass
+   cdef class DeviceMemoryResourceOptions:
+       """Customizable DeviceMemoryResource options.
+
+       Attributes
+       ----------
+       ipc_enabled : bool, optional
+           Whether to create an IPC-enabled memory pool. Default is False.
+       max_size : int, optional
+           Maximum pool size. Default is 0 (system-dependent).
+       """
+
+Method and Function Docstrings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Method and function docstrings should include:
+
+1. **Summary line**: A one-line description starting with a verb (e.g.,
+   "Allocate", "Return", "Create")
+2. **Extended description** (optional): Additional details about
+   behavior
+3. **Parameters section**: All parameters with types and descriptions
+4. **Returns section**: Return type and description
+5. **Raises section**: Exceptions that may be raised (if any)
+6. **Notes section**: Important implementation details or usage notes
+   (if needed)
+7. **Examples section**: Usage examples (if helpful)
+
+**Example:**
+
+.. code:: python
+
+   def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
+       """Allocate a buffer of the requested size.
+
+       Parameters
+       ----------
+       size : int
+           The size of the buffer to allocate, in bytes.
+       stream : :class:`Stream` | :class:`GraphBuilder`, optional
+           The stream on which to perform the allocation asynchronously.
+           If None, an internal stream is used.
+
+       Returns
+       -------
+       :class:`Buffer`
+           The allocated buffer object, which is accessible on the device
+           that this memory resource was created for.
+
+       Raises
+       ------
+       TypeError
+           If called on a mapped IPC-enabled memory resource.
+       RuntimeError
+           If allocation fails.
+
+       Notes
+       -----
+       The allocated buffer is associated with this memory resource and will
+       be deallocated when the buffer is closed or when this resource is closed.
+       """
+
+For simple functions, a brief docstring may suffice:
+
+.. code:: python
+
+   def get_ipc_descriptor(self) -> IPCBufferDescriptor:
+       """Export a :class:`Buffer` for sharing between processes."""
+
+Property Docstrings
+~~~~~~~~~~~~~~~~~~~
+
+Property docstrings should be concise and focus on what the property
+represents. For read-write properties, document both getter and setter
+behavior.
+
+**Read-only property:**
+
+.. code:: python
+
+   @property
+   def device_id(self) -> int:
+       """Return the device ordinal of this buffer."""
+
+**Read-write property:**
+
+.. code:: python
+
+   @property
+   def peer_accessible_by(self):
+       """
+       Get or set the devices that can access allocations from this memory pool.
+
+       Returns
+       -------
+       tuple of int
+           A tuple of sorted device IDs that currently have peer access to
+           allocations from this memory pool.
+
+       Notes
+       -----
+       When setting, accepts a sequence of :class:`Device` objects or device IDs.
+       Setting to an empty sequence revokes all peer access.
+
+       Examples
+       --------
+       >>> dmr.peer_accessible_by = [1]  # Grant access to device 1
+       >>> assert dmr.peer_accessible_by == (1,)
+       """
+
+Type References in Docstrings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use Sphinx cross-reference roles to link to other documented objects.
+Use the most specific role for each type:
+
++-------------+---------------------------+-------------------------------------------+
+| Role        | Use for                   | Example                                   |
++=============+===========================+===========================================+
+| ``:class:`` | Classes                   | :literal:`:class:\`Buffer\``              |
++-------------+---------------------------+-------------------------------------------+
+| ``:func:``  | Functions                 | :literal:`:func:\`launch\``               |
++-------------+---------------------------+-------------------------------------------+
+| ``:meth:``  | Methods                   | :literal:`:meth:\`Device.create_stream\`` |
++-------------+---------------------------+-------------------------------------------+
+| ``:attr:``  | Attributes                | :literal:`:attr:\`device_id\``            |
++-------------+---------------------------+-------------------------------------------+
+| ``:mod:``   | Modules                   | :literal:`:mod:\`multiprocessing\``       |
++-------------+---------------------------+-------------------------------------------+
+| ``:obj:``   | Type aliases, other       | :literal:`:obj:\`DevicePointerT\``        |
+|             | objects                   |                                           |
++-------------+---------------------------+-------------------------------------------+
+
+The ``~`` prefix displays only the final component:
+:literal:`:class:\`~cuda.core.Buffer\`` renders as "Buffer" while still
+linking to the full path.
+
+For more details, see the `Sphinx Python domain
+documentation <https://www.sphinx-doc.org/en/master/usage/domains/python.html#cross-referencing-python-objects>`__.
+
+**Example:**
+
+.. code:: python
+
+   def from_handle(
+       ptr: DevicePointerT, size_t size, mr: MemoryResource | None = None
+   ) -> Buffer:
+       """Create a new :class:`Buffer` from a pointer.
+
+       Parameters
+       ----------
+       ptr : :obj:`DevicePointerT`
+           Allocated buffer handle object.
+       size : int
+           Memory size of the buffer.
+       mr : :class:`MemoryResource`, optional
+           Memory resource associated with the buffer.
+       """
+
+.. _guidelines-2:
+
+Guidelines
+~~~~~~~~~~
+
+1. **Always include docstrings**: All public classes, methods,
+   functions, and properties should have docstrings.
+
+2. **Start with a verb**: Summary lines for methods and functions should
+   start with a verb in imperative mood (e.g., "Allocate", "Return",
+   "Create", not "Allocates", "Returns", "Creates").
+
+3. **Be concise but complete**: Provide enough information for users to
+   understand and use the API, but avoid unnecessary verbosity.
+
+4. **Use proper sections**: Include Parameters, Returns, Raises sections
+   when applicable. Use Notes and Examples sections when they add value.
+
+5. **Document optional parameters**: Clearly indicate optional
+   parameters and their default values.
+
+6. **Use type hints**: Type information in docstrings should complement
+   (not duplicate) type annotations. Use docstrings to provide
+   additional context about types.
+
+7. **Cross-reference related APIs**: Use Sphinx cross-references to link
+   to related classes, methods, and attributes.
+
+8. **Keep private methods brief**: Private methods (starting with ``_``)
+   may have minimal docstrings, but should still document non-obvious
+   behavior.
+
+9. **Update docstrings with code changes**: Keep docstrings synchronized
+   with implementation changes.
+
+Errors and Warnings
+-------------------
+
+CUDA Exceptions
+~~~~~~~~~~~~~~~
+
+The project defines custom exceptions for CUDA-specific errors:
+
+- **``CUDAError``**: Base exception for CUDA driver errors
+- **``NVRTCError``**: Exception for NVRTC compiler errors (inherits from
+  ``CUDAError``)
+
+Use these instead of generic exceptions when reporting CUDA failures.
+
+CUDA API Error Handling
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In ``nogil`` contexts, use the ``HANDLE_RETURN`` macro:
+
+.. code:: python
+
+   with nogil:
+       HANDLE_RETURN(cydriver.cuMemAlloc(ptr, size))
+
+At the Python level, use ``handle_return()`` or
+``raise_if_driver_error()``:
+
+.. code:: python
+
+   err, = driver.cuMemcpyAsync(dst._ptr, self._ptr, src_size, stream.handle)
+   handle_return((err,))
+
+Warnings
+~~~~~~~~
+
+When emitting warnings, always specify ``stacklevel`` so the warning
+points to the caller:
+
+.. code:: python
+
+   warnings.warn(message, UserWarning, stacklevel=3)
+
+The value depends on call depth—typically ``stacklevel=2`` for direct
+calls, ``stacklevel=3`` when called through a helper.
+
+CUDA-Specific Patterns
+----------------------
+
+GIL Management for CUDA Driver API Calls
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimized Cython code, release the GIL when calling CUDA driver
+APIs. This improves performance and allows other Python threads to run
+during CUDA operations.
+
+During initial development, it's fine to use the Python ``driver``
+module without releasing the GIL (see `Development
+Lifecycle <#development-lifecycle>`__). GIL release is a performance
+optimization that can be applied once the implementation is correct.
+
+Using ``with nogil:`` Blocks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Wrap ``cydriver`` calls in ``with nogil:`` blocks (or declare entire
+functions as ``nogil``):
+
+.. code:: python
+
+   cdef int value
+   with nogil:
+       HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, device_id))
+
+Group multiple driver calls in a single block:
+
+.. code:: python
+
+   cdef int low, high
+   with nogil:
+       HANDLE_RETURN(cydriver.cuCtxGetStreamPriorityRange(&low, &high))
+
+Raising Exceptions from ``nogil`` Context
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To raise exceptions from a ``nogil`` context, acquire the GIL first:
+
+.. code:: python
+
+   with gil:
+       raise CUDAError(f"CUDA operation failed: {error}")
+
+Development Lifecycle
+---------------------
+
+Two-Phase Development
+~~~~~~~~~~~~~~~~~~~~~
+
+A common pattern when implementing CUDA functionality is to develop in
+two phases:
+
+1. **Start with Python**: Use the ``driver`` module for a
+   straightforward implementation. Write tests to verify correctness.
+   This allows faster iteration and easier debugging.
+
+2. **Optimize with Cython**: Once the implementation is correct, switch
+   to ``cydriver`` with ``nogil`` blocks and ``HANDLE_RETURN`` for
+   better performance.
+
+This approach separates correctness from optimization. Getting the logic
+right first—with Python's better error messages and stack traces—often
+saves time overall.
+
+Python Implementation
+~~~~~~~~~~~~~~~~~~~~~
+
+Use the ``driver`` module from ``cuda.core._utils.cuda_utils``:
+
+.. code:: python
+
+   from cuda.core._utils.cuda_utils import driver
+   from cuda.core._utils.cuda_utils cimport (
+       _check_driver_error as raise_if_driver_error,
+   )
+
+   def get_attribute(self, attr: int) -> int:
+       err, value = driver.cuDeviceGetAttribute(attr, self._id)
+       raise_if_driver_error(err)
+       return value
+
+Cython Optimization
+~~~~~~~~~~~~~~~~~~~
+
+When ready to optimize, switch to ``cydriver``:
+
+.. code:: python
+
+   from cuda.bindings cimport cydriver
+   from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
+
+   def get_attribute(self, attr: int) -> int:
+       cdef int value
+       with nogil:
+           HANDLE_RETURN(cydriver.cuDeviceGetAttribute(&value, attr, self._id))
+       return value
+
+Key changes:
+
+- Replace ``driver`` with ``cydriver``
+- Wrap calls in ``with nogil:``
+- Use ``HANDLE_RETURN`` instead of ``raise_if_driver_error``
+
+Run tests after optimization to verify behavior is unchanged.
diff --git a/cuda_core/docs/source/index.rst b/cuda_core/docs/source/index.rst
index b6907de160..42330a3853 100644
--- a/cuda_core/docs/source/index.rst
+++ b/cuda_core/docs/source/index.rst
@@ -15,6 +15,7 @@ Welcome to the documentation for ``cuda.core``.
    interoperability
    api
    contribute
+   developer-guide
 
 .. toctree::
    :maxdepth: 1

From f5dc2db684de438c56f04ba17c5a4a69a1c49bb3 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Fri, 9 Jan 2026 11:19:01 -0800
Subject: [PATCH 16/17] Show developer-guide as single link in docs TOC

---
 cuda_core/docs/source/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cuda_core/docs/source/index.rst b/cuda_core/docs/source/index.rst
index 42330a3853..cb93225f2c 100644
--- a/cuda_core/docs/source/index.rst
+++ b/cuda_core/docs/source/index.rst
@@ -15,11 +15,11 @@ Welcome to the documentation for ``cuda.core``.
    interoperability
    api
    contribute
-   developer-guide
 
 .. toctree::
    :maxdepth: 1
 
+   developer-guide
    conduct
    license
 

From e72f8a0611b0749793094cfacc95b7a49ece21e1 Mon Sep 17 00:00:00 2001
From: Andy Jost <ajost@nvidia.com>
Date: Fri, 9 Jan 2026 13:24:55 -0800
Subject: [PATCH 17/17] Fix RST nested markup in developer guide

RST does not support nesting inline markup, so bold labels containing
code (e.g., **``__all__``**) render incorrectly. Remove backticks from
within bold markers to fix rendering.
---
 cuda_core/docs/source/developer-guide.rst | 36 +++++++++++------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/cuda_core/docs/source/developer-guide.rst b/cuda_core/docs/source/developer-guide.rst
index 4b6d851cf8..a8c8de811f 100644
--- a/cuda_core/docs/source/developer-guide.rst
+++ b/cuda_core/docs/source/developer-guide.rst
@@ -192,11 +192,11 @@ File Types
 
 The ``cuda/core`` package uses three types of files:
 
-1. **``.pyx`` files**: Cython implementation files containing the actual
+1. **.pyx files**: Cython implementation files containing the actual
    code
-2. **``.pxd`` files**: Cython declaration files containing type
+2. **.pxd files**: Cython declaration files containing type
    definitions and function signatures for C-level access
-3. **``.py`` files**: Pure Python files for utilities and high-level
+3. **.py files**: Pure Python files for utilities and high-level
    interfaces
 
 File Naming Conventions
@@ -217,10 +217,10 @@ Relationship Between ``.pxd`` and ``.pyx`` Files
 For each ``.pyx`` file that defines classes or functions used by other
 Cython modules, create a corresponding ``.pxd`` file:
 
-- **``.pxd`` file**: Contains ``cdef`` class declarations,
+- **.pxd file**: Contains ``cdef`` class declarations,
   ``cdef``/``cpdef`` function signatures, and ``cdef`` attribute
   declarations
-- **``.pyx`` file**: Contains the full implementation including Python
+- **.pyx file**: Contains the full implementation including Python
   methods, docstrings, and implementation details
 
 **Example:**
@@ -275,12 +275,12 @@ For complex subpackages that require extra structure (like
 
 1. **Private submodules**: Each component is implemented in a private
    submodule (e.g., ``_buffer.pyx``, ``_device_memory_resource.pyx``)
-2. **Submodule ``__all__``**: Each submodule defines its own ``__all__``
+2. **Submodule __all__**: Each submodule defines its own ``__all__``
    list
-3. **Subpackage ``__init__.py``**: The subpackage ``__init__.py`` uses
+3. **Subpackage __init__.py**: The subpackage ``__init__.py`` uses
    ``from ._module import *`` to assemble the package
 
-**Example structure for ``_memory/`` subpackage:**
+**Example structure for _memory/ subpackage:**
 
 ``_memory/_buffer.pyx``:
 
@@ -335,19 +335,19 @@ improve maintainability).
 Guidelines
 ~~~~~~~~~~
 
-1. **Always create ``.pxd`` files for shared Cython types**: If a class
+1. **Always create .pxd files for shared Cython types**: If a class
    or function is ``cimport``\ ed by other modules, provide a ``.pxd``
    declaration file.
 
-2. **Keep ``.pxd`` files minimal**: Only include declarations needed for
+2. **Keep .pxd files minimal**: Only include declarations needed for
    Cython compilation. Omit implementation details, docstrings, and
    Python-only code.
 
-3. **Use ``__all__`` when helpful**: Define ``__all__`` to control
+3. **Use __all__ when helpful**: Define ``__all__`` to control
    exported symbols when it simplifies or clarifies the module
    structure.
 
-4. **Use ``from ._module import *`` in subpackage ``__init__.py``**:
+4. **Use from ._module import * in subpackage __init__.py**:
    This pattern assembles the subpackage API from its submodules. Use
    ``# noqa: F403`` to suppress linting warnings about wildcard imports.
 
@@ -476,7 +476,7 @@ Additional Rules
    Do not use blank lines within a group unless using multi-line import
    formatting.
 
-5. **``try/except`` Blocks**: Import fallbacks (e.g., for optional
+5. **try/except blocks**: Import fallbacks (e.g., for optional
    dependencies) should be placed in the appropriate group (external or
    cuda-core) using ``try/except`` blocks.
 
@@ -756,13 +756,13 @@ header). This enables:
 Guidelines
 ^^^^^^^^^^
 
-1. **Use ``from __future__ import annotations``**: This should be
+1. **Use from __future__ import annotations**: This should be
    present in all ``.py`` and ``.pyx`` files with type annotations.
 
-2. **Use ``|`` for unions**: Prefer ``X | Y | None`` over
+2. **Use | for unions**: Prefer ``X | Y | None`` over
    ``Union[X, Y]`` or ``Optional[X]``.
 
-3. **Avoid ``TYPE_CHECKING`` blocks**: With
+3. **Avoid TYPE_CHECKING blocks**: With
    ``from __future__ import annotations``, forward references work
    without ``TYPE_CHECKING`` guards.
 
@@ -1113,8 +1113,8 @@ CUDA Exceptions
 
 The project defines custom exceptions for CUDA-specific errors:
 
-- **``CUDAError``**: Base exception for CUDA driver errors
-- **``NVRTCError``**: Exception for NVRTC compiler errors (inherits from
+- **CUDAError**: Base exception for CUDA driver errors
+- **NVRTCError**: Exception for NVRTC compiler errors (inherits from
   ``CUDAError``)
 
 Use these instead of generic exceptions when reporting CUDA failures.