ChartQA Multimodal Fine-Tuning using Qwen2-VL

This repository contains LoRA adapters fine-tuned for visual question answering on chart images using the ChartQA dataset.

The adapters were trained on top of the base model:

Qwen/Qwen2-VL-2B-Instruct

using parameter-efficient fine-tuning techniques.


Model Details

Model Name: chartqa-qwen2vl-lora

Developed by: Vraj Detroja

Model Type: Vision-Language Model (Multimodal Transformer)

Base Model: Qwen/Qwen2-VL-2B-Instruct

Fine-Tuning Method: LoRA (Low Rank Adaptation)

Training Platform: Kaggle

GPU: Tesla T4 (16GB VRAM)

Frameworks Used

  • PyTorch
  • Hugging Face Transformers
  • PEFT
  • BitsAndBytes
  • Accelerate

Model Description

This model is a LoRA fine-tuned adapter for the Qwen2-VL vision-language model trained on the ChartQA dataset.

The model learns to answer questions about chart images.

Example task:

Input:

Image + Question

Is the value of Favorable 38 in 2015?

Output:

Yes

The model processes both visual and textual information to generate answers.


Dataset

Training dataset:

ChartQA

Dataset link:
https://huggingface.co/datasets/HuggingFaceM4/ChartQA

ChartQA is a visual question answering dataset for chart understanding.

Dataset structure:

Field Description
image Chart image
query Question about chart
label Ground truth answer
human_or_machine Annotation type

Dataset splits:

Split Samples
Train 28,299
Validation 1,920
Test 2,500

For this project, 1000 samples from the training set were used for fine-tuning.


Training Details

Fine-Tuning Method

The model was fine-tuned using LoRA (Low Rank Adaptation).

Instead of training the entire model, LoRA trains small adapter layers inserted into the transformer architecture.

Training statistics:

Metric Value
Total parameters ~2.21B
Trainable parameters ~2.17M
Trainable percentage ~0.1%

This significantly reduces GPU memory usage.


Quantization

The model was loaded using 4-bit quantization via BitsAndBytes.

Configuration:

load_in_4bit = True
bnb_4bit_quant_type = "nf4"
bnb_4bit_compute_dtype = float16

Benefits:

  • Reduced VRAM usage
  • Faster training
  • Enables training on T4 GPU

Training Configuration

Hyperparameters used:

Batch size: 1
Gradient accumulation: 4
Learning rate: 2e-4
Epochs: 1
Training samples: 1000
Precision: FP16

Gradient checkpointing was enabled to reduce memory consumption.

Training results:

Metric Value
Training steps 250
Final training loss ~7.59

Hardware Used

Component Value
GPU Tesla T4
VRAM 16 GB
Platform Kaggle
Framework PyTorch

Training time:

~9 minutes for 250 training steps.


How to Use the Model

This repository contains LoRA adapters, not the full model.

You must load the base model first.

Example:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map="auto"
)

model = PeftModel.from_pretrained(
    base_model,
    "vrajdetrojapes/chartqa-qwen2vl-lora"
)

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct"
)

Example Inference

from PIL import Image

image = Image.open("sample_chart.png")

question = "Is the value of Favorable 38 in 2015?"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=50
)

print(processor.decode(output[0]))

Intended Use

The model is intended for:

  • Chart question answering
  • Multimodal reasoning research
  • Vision-language experimentation
  • Educational purposes

Limitations

The model has several limitations:

  • Trained on a small subset of the dataset
  • May struggle with complex chart reasoning
  • Limited generalization beyond chart datasets
  • Not suitable for production systems

Ethical Considerations

Users should be aware that:

  • The model may generate incorrect answers.
  • Chart interpretation errors are possible.
  • Outputs should be validated for critical applications.

Citation

@misc{chartqa_qwen2vl_lora,
  author = {Detroja, Vraj},
  title = {ChartQA Multimodal Fine-Tuning using Qwen2-VL},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/vrajdetrojapes/chartqa-qwen2vl-lora}
}

Author

Vraj Detroja

Natural Language Processing with Deep Learning
Multimodal Fine-Tuning Project

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrajdetrojapes/chartqa-qwen2vl-lora

Base model

Qwen/Qwen2-VL-2B
Adapter
(160)
this model

Dataset used to train vrajdetrojapes/chartqa-qwen2vl-lora