# Parallel-T5-Translation-PyTorch ## Project Title and Introduction **Parallel-T5-Translation-PyTorch** is a custom, optimized Transformer-based sequence-to-sequence model inspired by **T5-Small**, developed for **English-to-French Machine Translation**. The core innovation in this project is the **Parallel Multi-Head Attention** mechanism, designed to enable experimentation with **model parallelism** and improve **attention efficiency**. This implementation provides a foundation for studying how attention heads can be executed concurrently to enhance performance in translation tasks. --- ## Custom Model Architecture: Parallel Attention ### Overview Our model, **ParallelT5Small**, replaces the standard Multi-Head Attention (MHA) with a **novel Parallel Multi-Head Attention (P-MHA)** layer. - **Standard MHA:** Computes one set of **Query (Q)**, **Key (K)**, and **Value (V)** projections, then splits the resulting vectors across all heads. - **Parallel MHA (Proposed):** Splits the attention mechanism into **two parallel streams**, each using separate **Q/K/V projection weights** for half of the attention heads. The results from both parallel streams are **independently projected** back to the hidden dimension and then **summed** to form the final attention output. ### Goal This architecture serves as a foundation for: - Exploring **architectural variants** of the Transformer. - Studying the **effects of parallelized attention** on translation performance. - Investigating **scalability** in distributed training and **efficiency** on specialized hardware (e.g., GPUs or TPUs). ---

Model Architecture

Parallel T5 Architecture

---

Training & Evaluation Metrics (Epoch 37)

Metric Train Result (Epoch 37) Validation Result (Epoch 37) Goal
Loss (Cross-Entropy) 4.2213 4.8907 Decrease loss below 2.0
Token Accuracy ≈ 18.18% ≈ 15.20% Achieve 60%+
BLEU Score To be implemented To be implemented Target: 30–40
--- ## Installation and Setup ### Installation To set up the project locally, follow these steps. **Python 3.8+** is required. --- #### 1Clone the Repository ```bash git clone https://github.com/YourUsername/Parallel-T5-Translation-PyTorch.git cd Parallel-T5-Translation-PyTorch ``` ```bash conda create -n parallel-t5 python=3.9 conda activate parallel-t5 ``` ```bash # Install PyTorch (use the appropriate CUDA version for your setup) pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118 ``` ```bash # Install project dependencies pip install -r requirements.txt ``` ## Training and Preprocessing The workflow consists of two main steps: **data preparation** and **model training**. --- ### Step 1: Data Preprocessing This step performs the following: - Downloads the **GlobalVoices EN-FR** dataset - Tokenizes data using the **T5 tokenizer** - Splits into **train**, **validation**, and **test** sets - Saves processed tensors to `./data/processed` **Run the preprocessing:** ```bash python run.py ``` --- ## References This project is built upon the foundational work of the **T5 model** and utilizes the publicly available **GlobalVoices dataset**. ### 🔹 T5 (Text-to-Text Transfer Transformer) The model architecture is heavily inspired by the **T5 framework**, which casts all NLP problems into a text-to-text format. **Paper:** Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019).* **Link:** [https://arxiv.org/abs/1910.10683](https://arxiv.org/abs/1910.10683) ### 🔹 OPUS - GlobalVoices Dataset The parallel English-French data used for training is sourced from the **OPUS collection’s GlobalVoices corpus**. **Resource:** Jörg Tiedemann. *The OPUS Parallel Corpus (2012).* **Link (GlobalVoices source):** [https://object.pouta.csc.fi/OPUS-GlobalVoices/v2018q4/moses/en-fr.txt.zip](https://object.pouta.csc.fi/OPUS-GlobalVoices/v2018q4/moses/en-fr.txt.zip) ### 🔹 Hugging Face Transformers Library The **AutoTokenizer** and several best practices for transformer training and dataset handling are derived from the **Hugging Face** ecosystem. **Library:** [https://huggingface.co/docs/transformers/index](https://huggingface.co/docs/transformers/index) ---