--- license: mit base_model: - bilalzafar/CentralBank-BERT pipeline_tag: token-classification tags: - NER - named-entity-recognition - central-bank - BIS - speeches - finance - economics - monetary policy datasets: - bilalzafar/BIS-Speeches-NER-dataset language: - en metrics: - f1 - accuracy library_name: transformers --- # Central Bank-BERT for Named Entity Recognition (NER) A **domain-adapted BERT model ([`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT))** was fine-tuned for **Named Entity Recognition (NER)** in central banking discourse. The model automatically identifies and labels key entities in central bank speeches and related documents, focusing on three categories of interest: * **AUTHOR / SPEAKER** – the individual delivering the speech or statement * **POSITION** – the official title or role of the speaker (e.g., Governor, Deputy Governor, Board Member) * **AFFILIATION** – the institution or organization associated with the speaker (e.g., Bank of Japan, European Central Bank, Bank of England) The **COUNTRY** label was not explicitly modeled, since this information can be reliably **inferred from the affiliation of the central bank**. ## Data * **Source**: **BIS database of central bank speeches (1996–2024)** * **Corpus Size**: 17,648 speeches with 1,961 held out for validation. * **Input Field**: *Speech descriptions*, which typically contain a short speech title along with the name, position, and institutional affiliation of the speaker. **Annotation Process**: 1. A subset of short speech descriptions was **manually annotated** with entity spans for Author, Position, and Affiliation. 2. This annotated subset was used to **train an initial NER model**. 3. The model was then applied to the larger dataset (1996–2024) to generate preliminary labels. 4. All generated labels were **manually reviewed and corrected**, ensuring complete and consistent annotation across the entire corpus of available speeches. This approach combined **manual expertise** with **machine-assisted annotation**, making it feasible to build a large-scale, high-quality dataset covering nearly three decades of central bank communication. ## Data Preparation 1. **Normalization**: Lowercasing, removal of diacritics, and unification of punctuation. 2. **Alias resolution**: Institution abbreviations normalized (e.g., “BOJ” → “Bank of Japan”, “ECB” → “European Central Bank”). 3. **Entity alignment**: Fuzzy string matching used to locate annotated entities in raw text. 4. **BIO Encoding**: * Tokenization with *BERT WordPiece tokenizer*. * Conversion of annotations into **BIO tags** (`B-`, `I-`, `O`) at token level. * Construction of a training file in **JSONL format** with `tokens` and `ner_tags`. ## Model Training * **Base model**: [`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a domain-adapted BERT trained on central banking corpora. * **Task head**: Token classification layer with `num_labels = 7` (BIO scheme for Author, Position, Affiliation). * **Token alignment**: Word-to-token mapping with subword label propagation (`-100` used for ignored positions). * **Training setup**: * Optimizer: AdamW with weight decay `0.01` * Learning rate: `2e-5` * Batch size: `16` (train & eval) * Epochs: `3` * Mixed precision (`fp16`) when available * Evaluation with `seqeval` metrics (precision, recall, F1) ## Results The model was trained on **17,648 annotated speeches** with a **1,961-speech validation set**. Evaluation metrics are reported using **entity-level precision, recall, and F1-score** from the `seqeval` library. **Final Validation Performance (Epoch 3):** | Entity Type | Precision | Recall | F1-score | Support | | --------------- | ---------- | ---------- | ---------- | ------- | | **Affiliation** | 0.9850 | 0.9862 | 0.9856 | 1,734 | | **Author** | 0.9816 | 0.9912 | 0.9864 | 1,936 | | **Position** | 0.9735 | 0.9846 | 0.9790 | 1,942 | | **Overall** | **0.9798** | **0.9862** | **0.9830** | — | * **Accuracy (token-level):** 0.9956 | * **Overall F1 (macro):** 0.983 The results show **high precision and recall across all three categories**, confirming that the model provides reliable structured metadata extraction from central bank communications. --- ## Other CBDC Models This model is part of the **CentralBank-BERT / CBDC model family**, a suite of domain-adapted classifiers for analyzing central-bank communication. | **Model** | **Purpose** | **Intended Use** | **Link** | | ------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------- | | **bilalzafar/CentralBank-BERT** | Domain-adaptive masked LM trained on BIS speeches (1996–2024). | Base encoder for CBDC downstream tasks; fill-mask tasks. | [CentralBank-BERT](https://huggingface.co/bilalzafar/CentralBank-BERT) | | **bilalzafar/CBDC-BERT** | Binary classifier: CBDC vs. Non-CBDC. | Flagging CBDC-related discourse in large corpora. | [CBDC-BERT](https://huggingface.co/bilalzafar/CBDC-BERT) | | **bilalzafar/CBDC-Stance** | 3-class stance model (Pro, Wait-and-See, Anti). | Research on policy stances and discourse monitoring. | [CBDC-Stance](https://huggingface.co/bilalzafar/CBDC-Stance) | | **bilalzafar/CBDC-Sentiment** | 3-class sentiment model (Positive, Neutral, Negative). | Tone analysis in central bank communications. | [CBDC-Sentiment](https://huggingface.co/bilalzafar/CBDC-Sentiment) | | **bilalzafar/CBDC-Type** | Classifies Retail, Wholesale, General CBDC mentions. | Distinguishing policy focus (retail vs wholesale). | [CBDC-Type](https://huggingface.co/bilalzafar/CBDC-Type) | | **bilalzafar/CBDC-Discourse** | 3-class discourse classifier (Feature, Process, Risk-Benefit). | Structured categorization of CBDC communications. | [CBDC-Discourse](https://huggingface.co/bilalzafar/CBDC-Discourse) | | **bilalzafar/CentralBank-NER** | Named Entity Recognition (NER) model for central banking discourse. | Identifying institutions, persons, and policy entities in speeches. | [CentralBank-NER](https://huggingface.co/bilalzafar/CentralBank-NER) | ## Repository and Replication Package All **training pipelines, preprocessing scripts, evaluation notebooks, and result outputs** are available in the companion GitHub repository: 🔗 **[https://github.com/bilalezafar/CentralBank-BERT](https://github.com/bilalezafar/CentralBank-BERT)** --- ## Usage ```python from transformers import pipeline # HF model repo model = "bilalzafar/CentralBank-NER" ner = pipeline( task="token-classification", model=model, tokenizer=model, aggregation_strategy="simple" # merges subword pieces ) # Example text text = "Speech by Mr Yi Gang, Governor of the People's Bank of China, at the IMF Annual Meeting." for ent in ner(text): print(f"{ent['entity_group']:12} {ent['word']:<25} score={ent['score']:.3f}") # Example output: # [{AUTHOR yi gang score=0.997}] # [{POSITION governor score=0.999}] # [{AFFILIATION people ' s bank of china score=0.999}] ``` --- ## Citation If you use this model, please cite as: **Zafar, M. B. (2025). *CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse*. SSRN. [https://papers.ssrn.com/abstract=5404456](https://papers.ssrn.com/abstract=5404456)** ```bibtex @article{zafar2025centralbankbert, title={CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse}, author={Zafar, Muhammad Bilal}, year={2025}, journal={SSRN Electronic Journal}, url={https://papers.ssrn.com/abstract=5404456} }