PyTorch-style tensor library & deep-learning framework written entirely in Rust (with a sprinkle of C/CUDA under the hood).
• Autograd • CUDA & Multi-GPU • nn Modules • Optimisers • Datasets • torch-style API
Rust gives us memory-safety, fearless concurrency and zero-cost abstractions. RStorch keeps PyTorch’s ergonomics while enjoying Rust’s compile-time guarantees.
- Write models in plain idiomatic Rust (
Tensor,nn::Module,optim::SGD…) - Run them on CPU or GPU with the same code (
tensor.to("cuda")) - Scale to multi-GPU via NCCL & MPI bindings (
distributed::init_process_group) - Unit-tested end-to-end – run
cargo testand watch it train MNIST in seconds.
The repo is split in two worlds – high-level Rust and low-level C/CUDA. The Mermaid map shows the relationships:
classDiagram
Tensor <|-- CTensor
Autograd --> Tensor : builds graph of
AutogradFunctions --> Autograd
NN --> Tensor : consumes
NN --> Autograd : uses gradients
NNmodules --> NN
NNactivation --> NN
NNloss --> NN
NNparallel --> Distributed
Optim --> Tensor : updates params
Optim --> NN : accesses Parameter
SGD --> Optim
UtilsData --> Tensor : wraps batches
TorchVision --> UtilsData : yields
Distributed --> Tensor
Distributed --> NN
Tensor (Rust) mirrors CTensor (C-repr) so we can FFI into highly-optimised kernels.
autograd::functions implements ops with forward & backward; a dynamic graph records Function instances so tensor.backward() just works.
High-level layers live in nn/ (linear, convolution, activations, loss, parallel). Implement the Module trait, register Parameters and compose like PyTorch.
optim::Optimizer trait with an SGD impl today – call step() every iteration.
utils::data wraps Dataset, DataLoader, batched iteration, shuffling.
csrc/ contains CPU (cpu.cpp) and CUDA (cuda.cu) kernels. distributed/ wraps NCCL for all-reduce & broadcast so nn::DataParallel works.
rstorch/
├─ src/
│ ├─ tensor.rs # Safe Rust wrapper around CTensor + helper methods
│ ├─ autograd/ # Dynamic graph & Function defs
│ ├─ nn/ # Modules, functional API, losses, parallel
│ ├─ optim/ # Optimisers
│ ├─ utils/ # Dataset / dataloader helpers
│ ├─ torchvision/ # Example datasets & transforms (MNIST)
│ └─ distributed/ # Rust side of multi-GPU + run examples
├─ src/csrc/ # C/CUDA backend (CPU/GPU kernels, NCCL wrappers)
│ ├─ tensor.{h,cpp} # Dispatcher exposed to Rust
│ ├─ cpu.{h,cpp} # CPU reference implementations
│ ├─ cuda.{h,cu} # Hand-written CUDA kernels (+ host wrappers)
│ └─ distributed.{h,cpp} # NCCL + MPI helpers
├─ tests/ # Exhaustive unit & integration tests
├─ build.rs # Compiles C/CUDA and links into the crate
└─ map.mermaid # The diagram above
# add the crate – local path for now
git clone https://github.com/<you>/rstorch && cd rstorch
cargo test # run suite (CPU only)use rstorch::tensor::Tensor;
let a = Tensor::of_slice(&[1., 2., 3.]).reshape(&[3, 1]);
let b = Tensor::ones(&[3, 1]);
let c = (&a + &b).sin();
println!("{}", c);use rstorch::{tensor::Tensor, nn::{Linear, Module}, optim::{SGD, Optimizer}};
// toy data
auto x = Tensor::randn(&[64, 10], true);
auto y = Tensor::randn(&[64, 1], false);
// model
let mut model = Linear::new(10, 1);
let mut opt = SGD::new(model.parameters(), 1e-2);
for _epoch in 0..1000 {
let pred = model.forward(&x);
let loss = (&pred - &y).pow(2.0).mean(None);
opt.zero_grad();
loss.backward();
opt.step();
}Same idioms as PyTorch: build Modules, call forward, compute loss, backward(), step().
let d = Tensor::ones(&[1024, 1024]).to("cuda");
let e = Tensor::randn(&[1024, 1024]).to("cuda");
let f = d.matmul(&e);The backing data is moved with cudaMalloc/cudaMemcpy, and every op thereafter launches a CUDA kernel.
rstorch::distributed::init_process_group(backend="nccl", rank, world_size);
// wrap model
let dp = nn::DataParallel::new(model);
let out = dp.forward(&input);allreduce happens automatically in the background kernels.
You need NVCC + a working CUDA runtime. The cc build script will detect a nvcc compiler and build src/csrc/cuda.cu. If CUDA is absent it will silently fall back to CPU-only.
# on Linux / macOS w/ CUDA toolkit in PATH
cargo build --features=cudaUnit, integration & training-loop smoke tests:
cargo test -- --nocapture | catMIT.