-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Open
Labels
Description
System Info
transformersversion: 5.0.0.dev0- Platform: Windows-10-10.0.26200-SP0
- Python version: 3.11.9
- Huggingface_hub version: 1.2.1
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA GeForce RTX 2060
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Fast image processors can keep the output tensors on the same device as the input tensors.
This works for most of the processors, but OneFormerProcessor, which can combine OneFormerImageProcessorFast and a tokenizer and takes both image and text as inputs, do not ensure that both outputs are on the same device.
Let's create a simple script that loads an image, moves it to cuda and then tries to preprocess it with OneFormerProcessor
import transformers
from PIL import Image
import requests
from torchvision import transforms
to_tensor_transform = transforms.ToTensor()
processor = transformers.OneFormerImageProcessorFast()
processor = transformers.OneFormerProcessor(
image_processor=processor,
tokenizer=transformers.AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"),
)
url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Convert image to tensor and move it to cuda
image = to_tensor_transform(image).to("cuda")
# Semantic Segmentation
inputs = processor(image, ["semantic"], return_tensors="pt")When we check the inputs generated by the processor, we will see that pixel_values are on cuda, but task_inputs are not
Expected behavior
Probably OneFormerProcessor should not only tokenize the text inputs, but also move them to the same device as the images