Ensure that `OneFormerProcessor` place text `task_inputs` to the same device as other inputs

### System Info

- `transformers` version: 5.0.0.dev0
- Platform: Windows-10-10.0.26200-SP0
- Python version: 3.11.9
- Huggingface_hub version: 1.2.1
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA GeForce RTX 2060

### Who can help?

@yonigozlan @molbap

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Fast image processors can keep the output tensors on the same device as the input tensors.

This works for most of the processors, but `OneFormerProcessor`, which can combine `OneFormerImageProcessorFast` and a tokenizer and takes both image and text as inputs, do not ensure that both outputs are on the same device.

Let's create a simple script that loads an image, moves it to cuda and then tries to preprocess it with `OneFormerProcessor`

```py
import transformers
from PIL import Image
import requests
from torchvision import transforms

to_tensor_transform = transforms.ToTensor()

processor = transformers.OneFormerImageProcessorFast()
processor = transformers.OneFormerProcessor(
    image_processor=processor,
    tokenizer=transformers.AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"),
)

url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Convert image to tensor and move it to cuda
image = to_tensor_transform(image).to("cuda")

# Semantic Segmentation
inputs = processor(image, ["semantic"], return_tensors="pt")
```

When we check the `inputs` generated by the processor, we will see that `pixel_values` are on cuda, but `task_inputs` are not

### Expected behavior

Probably `OneFormerProcessor` should not only tokenize the text inputs, but also move them to the same device as the images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure that `OneFormerProcessor` place text `task_inputs` to the same device as other inputs #42722

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ensure that OneFormerProcessor place text task_inputs to the same device as other inputs #42722

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Ensure that `OneFormerProcessor` place text `task_inputs` to the same device as other inputs #42722