Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Paper
•
2506.00633
•
Published
•
1
diff_model_demo.py from the code release for a one-off generation from text.autoencoder_epoch273.pt — 3D VAE for latent compression/decoding.unet_rflow_200ep.pt — Diffusion UNet trained with rectified flow.CLIP3D_Finding_Impression_30ep.pt — CLIP3D weights for encoding reports.from huggingface_hub import snapshot_download
repo_id = "dmolino/text2ct-weights"
snapshot_download(
repo_id=repo_id,
repo_type="model",
local_dir="your_local_path"
)
### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.
### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.
### Evaluation
- See paper for quantitative and qualitative results.
### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.
### Environmental Impact
- Not reported. Training used multi-GPU setup;.
### Citation
If you use these weights or code, please cite the paper:
@article{molino2025text, title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining}, author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio}, journal={arXiv preprint arXiv:2506.00633}, year={2025} } ```