Skip to content

Ensure that OneFormerProcessor place text task_inputs to the same device as other inputs #42722

@simonreise

Description

@simonreise

System Info

  • transformers version: 5.0.0.dev0
  • Platform: Windows-10-10.0.26200-SP0
  • Python version: 3.11.9
  • Huggingface_hub version: 1.2.1
  • Safetensors version: 0.5.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA GeForce RTX 2060

Who can help?

@yonigozlan @molbap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Fast image processors can keep the output tensors on the same device as the input tensors.

This works for most of the processors, but OneFormerProcessor, which can combine OneFormerImageProcessorFast and a tokenizer and takes both image and text as inputs, do not ensure that both outputs are on the same device.

Let's create a simple script that loads an image, moves it to cuda and then tries to preprocess it with OneFormerProcessor

import transformers
from PIL import Image
import requests
from torchvision import transforms

to_tensor_transform = transforms.ToTensor()

processor = transformers.OneFormerImageProcessorFast()
processor = transformers.OneFormerProcessor(
    image_processor=processor,
    tokenizer=transformers.AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"),
)

url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Convert image to tensor and move it to cuda
image = to_tensor_transform(image).to("cuda")

# Semantic Segmentation
inputs = processor(image, ["semantic"], return_tensors="pt")

When we check the inputs generated by the processor, we will see that pixel_values are on cuda, but task_inputs are not

Expected behavior

Probably OneFormerProcessor should not only tokenize the text inputs, but also move them to the same device as the images

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions