--- tags: - spacy - token-classification - ancient-greek language: - grc license: mit model-index: - name: grc_dep_web_md results: - task: name: POS Tagging type: token-classification metrics: - name: POS Accuracy type: accuracy value: 0.9175 - name: TAG (XPOS) Accuracy type: accuracy value: 0.9154 - task: name: Lemmatization type: token-classification metrics: - name: Lemma Accuracy type: accuracy value: 0.9359 - task: name: Dependency Parsing type: token-classification metrics: - name: Labeled Attachment Score type: f_score value: 0.6731 --- # grc_dep_web_md **Ancient Greek** pipeline for [spaCy](https://spacy.io), part of the [LatinCy](https://huggingface.co/latincy) project. **Experimental beta release.** This is part of the first generation of Ancient Greek models porting the [LatinCy](https://huggingface.co/latincy) Latin pipeline infrastructure to Ancient Greek. Expect rough edges; scores and component behavior will improve as training data is harmonized and curated through the LatinCy flywheel (train, evaluate, curate, retrain). Medium model with 50,000-key floret vectors (300 dimensions). Trained on Universal Dependencies Ancient Greek treebanks (PTNK, PROIEL, Perseus) with a 1.2M-entry lookup lemmatizer overlay built from CLTK Morpheus, UD treebanks, and Wiktionary. | Feature | Description | | --- | --- | | **Name** | `grc_dep_web_md` | | **Version** | `3.8.1` | | **spaCy** | `>=3.8.11,<3.9.0` | | **Default Pipeline** | `senter`, `tok2vec`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser` | | **Components** | `senter`, `tok2vec`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser` | | **Vectors** | floret, 50,000 unique vectors (300 dimensions) | | **License** | `MIT` | | **Author** | [Patrick J. Burns](https://huggingface.co/latincy) | ## Install ```bash pip install https://huggingface.co/latincy/grc_dep_web_md/resolve/main/grc_dep_web_md-3.8.1-py3-none-any.whl ``` ## Usage ```python import spacy nlp = spacy.load("grc_dep_web_md") doc = nlp("\u03bc\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u03ac\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2") for token in doc: print(token.text, token.pos_, token.lemma_, token.dep_) ``` ## Evaluation Scores on held-out UD test data (combined PTNK + PROIEL + Perseus). | Metric | Score | | --- | --- | | **POS (UPOS) Accuracy** | 91.75 | | **TAG (XPOS) Accuracy** | 91.54 | | **Morph (UFeats) Accuracy** | 81.32 | | **Lemma Accuracy** | 93.59 | | **Unlabeled Attachment Score (UAS)** | 75.71 | | **Labeled Attachment Score (LAS)** | 67.31 | | **Sentences F-Score** | 88.18 | ## Training Data | Source | Description | | --- | --- | | [UD_Ancient_Greek-PTNK](https://github.com/UniversalDependencies/UD_Ancient_Greek-PTNK) | Septuagint (Codex Alexandrinus) | | [UD_Ancient_Greek-PROIEL](https://github.com/UniversalDependencies/UD_Ancient_Greek-PROIEL) | PROIEL Ancient Greek treebank | | [UD_Ancient_Greek-Perseus](https://github.com/UniversalDependencies/UD_Ancient_Greek-Perseus) | Perseus Ancient Greek treebank | ## Components - **tok2vec** -- Shared token-to-vector encoder (CNN, width 96) - **tagger** -- Fine-grained POS tagger (XPOS, harmonized 16-tag tagset) - **morphologizer** -- Morphological feature assignment (UPOS + UFeats) - **trainable_lemmatizer** -- Edit-tree lemmatizer - **lookup_lemmatizer** -- 1.2M-entry dictionary lemmatizer overlay (CLTK Morpheus + UD + Wiktionary); normalizes grave accents to acute at query time - **parser** -- Dependency parser (transition-based) - **senter** -- Sentence segmenter ## Label Scheme
View label scheme (1796 labels for 3 components) **`tagger`**: `adjective`, `adverb`, `conjunction`, `conjunction_adverb`, `conjunction_pronoun`, `determiner`, `interjection`, `noun`, `number`, `particle`, `preposition`, `pronoun`, `proper_noun`, `punc`, `unknown`, `verb` **`morphologizer`**: 1749 morphological feature combinations **`parser`**: `ROOT`, `acl`, `advcl`, `advmod`, `amod`, `appos`, `aux`, `case`, `cc`, `ccomp`, `conj`, `cop`, `csubj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `iobj`, `mark`, `nmod`, `nsubj`, `nummod`, `obj`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp`