Text-to-3D
English
Turkish
Italian
LDM+VAE
medical

Text-to-CT Model Weights

Checkpoints for “Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining” (Molino et al., 2025).

Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Model Description

  • Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
  • Model type: 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
  • License: Apache 2.0 (same as code release).
  • Sources: Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
  • Demo: Use diff_model_demo.py from the code release for a one-off generation from text.

Intended Use

  • Direct use: Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
  • Downstream use: Fine-tuning or integration into broader research pipelines.
  • Out of scope: Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.

Risks & Limitations

  • Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
  • Synthetic outputs may contain artifacts; do not use for patient care.

Files included

  • autoencoder_epoch273.pt — 3D VAE for latent compression/decoding.
  • unet_rflow_200ep.pt — Diffusion UNet trained with rectified flow.
  • CLIP3D_Finding_Impression_30ep.pt — CLIP3D weights for encoding reports.

How to Get Started (Python)

from huggingface_hub import snapshot_download

repo_id = "dmolino/text2ct-weights"

snapshot_download(
    repo_id=repo_id,
    repo_type="model",
    local_dir="your_local_path" 
)

Use these in the code release configs:

trained_autoencoder_path -> autoencoder_path

existing_ckpt_filepath / model_filename -> unet_path

clip_weights (for report embeddings) -> clip_path


### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.

### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.

### Evaluation
- See paper for quantitative and qualitative results.

### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.

### Environmental Impact
- Not reported. Training used multi-GPU setup;.

### Citation
If you use these weights or code, please cite the paper:

@article{molino2025text, title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining}, author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio}, journal={arXiv preprint arXiv:2506.00633}, year={2025} } ```

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train dmolino/text2ct-weights

Paper for dmolino/text2ct-weights