# Parallel-T5-Translation-PyTorch

## Project Title and Introduction

**Parallel-T5-Translation-PyTorch** is a custom, optimized Transformer-based sequence-to-sequence model inspired by **T5-Small**, developed for **English-to-French Machine Translation**.

The core innovation in this project is the **Parallel Multi-Head Attention** mechanism, designed to enable experimentation with **model parallelism** and improve **attention efficiency**. This implementation provides a foundation for studying how attention heads can be executed concurrently to enhance performance in translation tasks.

---

## Custom Model Architecture: Parallel Attention

### Overview
Our model, **ParallelT5Small**, replaces the standard Multi-Head Attention (MHA) with a **novel Parallel Multi-Head Attention (P-MHA)** layer.

- **Standard MHA:**  
  Computes one set of **Query (Q)**, **Key (K)**, and **Value (V)** projections, then splits the resulting vectors across all heads.

- **Parallel MHA (Proposed):**  
  Splits the attention mechanism into **two parallel streams**, each using separate **Q/K/V projection weights** for half of the attention heads.  
  The results from both parallel streams are **independently projected** back to the hidden dimension and then **summed** to form the final attention output.

### Goal
This architecture serves as a foundation for:
- Exploring **architectural variants** of the Transformer.
- Studying the **effects of parallelized attention** on translation performance.
- Investigating **scalability** in distributed training and **efficiency** on specialized hardware (e.g., GPUs or TPUs).

---

<h2>Model Architecture</h2>

<p align="center">
  <img src="Assets/Architecture_diagram.jpg" alt="Parallel T5 Architecture" width="100%">
</p>

---

<h2>Training & Evaluation Metrics (Epoch 37)</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Train Result (Epoch 37)</th>
      <th>Validation Result (Epoch 37)</th>
      <th>Goal</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Loss (Cross-Entropy)</strong></td>
      <td>4.2213</td>
      <td>4.8907</td>
      <td>Decrease loss below <strong>2.0</strong></td>
    </tr>
    <tr>
      <td><strong>Token Accuracy</strong></td>
      <td>&#8776; 18.18%</td>
      <td>&#8776; 15.20%</td>
      <td>Achieve <strong>60%+</strong></td>
    </tr>
    <tr>
      <td><strong>BLEU Score</strong></td>
      <td><em>To be implemented</em></td>
      <td><em>To be implemented</em></td>
      <td>Target: <strong>30–40</strong></td>
    </tr>
  </tbody>
</table>


---

## Installation and Setup

### Installation

To set up the project locally, follow these steps.  
**Python 3.8+** is required.

---

#### 1Clone the Repository

```bash
git clone https://github.com/YourUsername/Parallel-T5-Translation-PyTorch.git
cd Parallel-T5-Translation-PyTorch

```
```bash
conda create -n parallel-t5 python=3.9
conda activate parallel-t5
```

```bash
# Install PyTorch (use the appropriate CUDA version for your setup)
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
```

```bash
# Install project dependencies
pip install -r requirements.txt
```


## Training and Preprocessing

The workflow consists of two main steps: **data preparation** and **model training**.

---

### Step 1: Data Preprocessing

This step performs the following:

- Downloads the **GlobalVoices EN-FR** dataset  
- Tokenizes data using the **T5 tokenizer**  
- Splits into **train**, **validation**, and **test** sets  
- Saves processed tensors to `./data/processed`

**Run the preprocessing:**

```bash
python run.py
```

---
## References

This project is built upon the foundational work of the **T5 model** and utilizes the publicly available **GlobalVoices dataset**.

### 🔹 T5 (Text-to-Text Transfer Transformer)

The model architecture is heavily inspired by the **T5 framework**, which casts all NLP problems into a text-to-text format.

**Paper:**  
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.  
*Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019).*  

**Link:** [https://arxiv.org/abs/1910.10683](https://arxiv.org/abs/1910.10683)


### 🔹 OPUS - GlobalVoices Dataset

The parallel English-French data used for training is sourced from the **OPUS collection’s GlobalVoices corpus**.

**Resource:**  
Jörg Tiedemann. *The OPUS Parallel Corpus (2012).*  

**Link (GlobalVoices source):**  
[https://object.pouta.csc.fi/OPUS-GlobalVoices/v2018q4/moses/en-fr.txt.zip](https://object.pouta.csc.fi/OPUS-GlobalVoices/v2018q4/moses/en-fr.txt.zip)


### 🔹 Hugging Face Transformers Library

The **AutoTokenizer** and several best practices for transformer training and dataset handling are derived from the **Hugging Face** ecosystem.

**Library:**  
[https://huggingface.co/docs/transformers/index](https://huggingface.co/docs/transformers/index)

---
Metric	Train Result (Epoch 37)	Validation Result (Epoch 37)	Goal
Loss (Cross-Entropy)	4.2213	4.8907	Decrease loss below 2.0
Token Accuracy	≈ 18.18%	≈ 15.20%	Achieve 60%+
BLEU Score	To be implemented	To be implemented	Target: 30–40