Text-to-CT Model Weights

Checkpoints for “Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining” (Molino et al., 2025).

Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Model Description

Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Model type: 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
License: Apache 2.0 (same as code release).
Sources: Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
Demo: Use diff_model_demo.py from the code release for a one-off generation from text.

Intended Use

Direct use: Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
Downstream use: Fine-tuning or integration into broader research pipelines.
Out of scope: Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.

Risks & Limitations

Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
Synthetic outputs may contain artifacts; do not use for patient care.

Files included

autoencoder_epoch273.pt — 3D VAE for latent compression/decoding.
unet_rflow_200ep.pt — Diffusion UNet trained with rectified flow.
CLIP3D_Finding_Impression_30ep.pt — CLIP3D weights for encoding reports.

How to Get Started (Python)

from huggingface_hub import snapshot_download

repo_id = "dmolino/text2ct-weights"

snapshot_download(
    repo_id=repo_id,
    repo_type="model",
    local_dir="your_local_path" 
)

Use these in the code release configs:

trained_autoencoder_path -> autoencoder_path

existing_ckpt_filepath / model_filename -> unet_path

clip_weights (for report embeddings) -> clip_path


### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.

### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.

### Evaluation
- See paper for quantitative and qualitative results.

### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.

### Environmental Impact
- Not reported. Training used multi-GPU setup;.

### Citation
If you use these weights or code, please cite the paper:

@article{molino2025text, title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining}, author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio}, journal={arXiv preprint arXiv:2506.00633}, year={2025} } ```

Downloads last month: 12

Inference Providers NEW

Text-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train dmolino/text2ct-weights

Paper for dmolino/text2ct-weights

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Paper • 2506.00633 • Published May 31, 2025 • 1