Update README.md
Browse files
README.md
CHANGED
|
@@ -4,17 +4,18 @@ language:
|
|
| 4 |
- en
|
| 5 |
license: apache-2.0
|
| 6 |
datasets:
|
| 7 |
-
- HuggingFaceFW/
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# FinePDFs-Edu classifier (English)
|
| 11 |
|
| 12 |
## Model summary
|
| 13 |
-
This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
We used this classifier to build [FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu) dataset.
|
| 16 |
### How to use in transformers
|
| 17 |
-
To load the FinePDFs-Edu classifier, use the following code:
|
| 18 |
|
| 19 |
```python
|
| 20 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
@@ -22,8 +23,8 @@ import re
|
|
| 22 |
CHUNK_SIZE = 2048 - 2
|
| 23 |
MAX_CHARS = 10_000
|
| 24 |
|
| 25 |
-
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/
|
| 26 |
-
model = AutoModelForSequenceClassification.from_pretrained("
|
| 27 |
regex_whitespace = re.compile(r'\s')
|
| 28 |
|
| 29 |
def create_text_chunks(text: str, tokenizer):
|
|
@@ -87,35 +88,32 @@ The classifier was trained on 13824480 pairs of web samples and their scores fro
|
|
| 87 |
|
| 88 |
Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
|
| 89 |
```
|
| 90 |
-
Below is an extract from a PDF file. Evaluate whether the extract
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
content,
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
style is easy to follow and offers profound and thorough insights into the subject matter,
|
| 114 |
-
devoid of any non-educational or complex content.
|
| 115 |
-
The extract: {example}.
|
| 116 |
After examining the extract:
|
| 117 |
-
- Briefly justify your total score, up to 100 words.
|
| 118 |
-
- Conclude with the score using the format: "Educational score: <total points>"\
|
| 119 |
```
|
| 120 |
|
| 121 |
We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
|
|
@@ -131,7 +129,7 @@ We added a classification head with a single regression output to answerdotai/Mo
|
|
| 131 |
|
| 132 |
**Classification report**
|
| 133 |
|
| 134 |
-
We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of
|
| 135 |
```
|
| 136 |
Validation Report:
|
| 137 |
| class | precision | recall | f1-score | support |
|
|
@@ -161,10 +159,10 @@ Confusion Matrix:
|
|
| 161 |
|
| 162 |
|
| 163 |
## Limitations
|
| 164 |
-
While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
|
| 165 |
|
| 166 |
-
- Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is
|
| 167 |
-
- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores
|
| 168 |
- Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
|
| 169 |
|
| 170 |
The training and inference code is available on GitHub
|
|
|
|
| 4 |
- en
|
| 5 |
license: apache-2.0
|
| 6 |
datasets:
|
| 7 |
+
- HuggingFaceFW/finepdfs_eng_Latn_labeled
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# FinePDFs-Edu v2 classifier (English)
|
| 11 |
|
| 12 |
## Model summary
|
| 13 |
+
This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 1304547 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
|
| 14 |
+
|
| 15 |
+
**Unlike in original FW-EDU, we are not filtering for undergraduate content, which results in high inclusion of papers!**
|
| 16 |
|
|
|
|
| 17 |
### How to use in transformers
|
| 18 |
+
To load the FinePDFs-Edu-v2 classifier, use the following code:
|
| 19 |
|
| 20 |
```python
|
| 21 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
|
| 23 |
CHUNK_SIZE = 2048 - 2
|
| 24 |
MAX_CHARS = 10_000
|
| 25 |
|
| 26 |
+
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn")
|
| 27 |
+
model = AutoModelForSequenceClassification.from_pretrained("HHuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn")
|
| 28 |
regex_whitespace = re.compile(r'\s')
|
| 29 |
|
| 30 |
def create_text_chunks(text: str, tokenizer):
|
|
|
|
| 88 |
|
| 89 |
Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
|
| 90 |
```
|
| 91 |
+
Below is an extract from a PDF file. Evaluate whether the extract exhibits properties suitable for educational training data using the 6-point scoring system described below. Select the single score that best represents the extract's educational quality level:
|
| 92 |
+
|
| 93 |
+
**Score 0: No Educational Value**
|
| 94 |
+
- Award 0 points for content with zero educational merit including spam, promotional material, garbled text, random sequences, severely corrupted formatting, or content that provides no learning opportunities whatsoever.
|
| 95 |
+
|
| 96 |
+
**Score 1: Minimal Educational Content**
|
| 97 |
+
- Award 1 point for content with very limited educational value such as basic data listings, simple contact information, minimal factual statements without context, brief announcements, or content that presents isolated facts without meaningful educational framework.
|
| 98 |
+
|
| 99 |
+
**Score 2: Basic Informational Content**
|
| 100 |
+
- Award 2 points for content that provides basic information but lacks depth, context, or clear educational structure. This includes simple news items, basic product descriptions, brief summaries, casual observations, or informational content that states facts without explanation or educational development.
|
| 101 |
+
|
| 102 |
+
**Score 3: Moderate Educational Value**
|
| 103 |
+
- Award 3 points for content that offers solid educational information with some context and explanation. This includes informative articles with background information, basic explanatory content, introductory-level material, general knowledge content, or well-written informational pieces that provide context and some depth.
|
| 104 |
+
|
| 105 |
+
**Score 4: Strong Educational Content**
|
| 106 |
+
- Award 4 points for content with clear educational merit featuring detailed explanations, multiple perspectives, analytical depth, or comprehensive coverage of topics. This includes academic articles, detailed tutorials, in-depth analyses, research-based content, or material that demonstrates critical thinking and provides substantial learning value.
|
| 107 |
+
|
| 108 |
+
**Score 5: Exceptional Educational Value**
|
| 109 |
+
- Award 5 points for content with outstanding educational merit that demonstrates expert-level knowledge, sophisticated analysis, comprehensive understanding, and significant pedagogical value. This includes advanced academic research, expert commentary with deep insights, comprehensive educational material with multiple learning dimensions, or content that advances understanding through original thinking and thorough exploration.
|
| 110 |
+
|
| 111 |
+
## Evaluation Process
|
| 112 |
+
The extract: {example}
|
| 113 |
+
|
|
|
|
|
|
|
|
|
|
| 114 |
After examining the extract:
|
| 115 |
+
- Briefly justify your total score, focusing on the educational depth, context provided, and learning potential, up to 100 words.
|
| 116 |
+
- Conclude with the score using the format: "Educational value score: <total points>"\
|
| 117 |
```
|
| 118 |
|
| 119 |
We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
|
|
|
|
| 129 |
|
| 130 |
**Classification report**
|
| 131 |
|
| 132 |
+
We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 10000 Qwen3-235B-A22B-Instruct-2507-annotated samples.
|
| 133 |
```
|
| 134 |
Validation Report:
|
| 135 |
| class | precision | recall | f1-score | support |
|
|
|
|
| 159 |
|
| 160 |
|
| 161 |
## Limitations
|
| 162 |
+
While the FinePDFs-Edu-v2 classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
|
| 163 |
|
| 164 |
+
- Scope: The model's performance might change for other datasets, in particular for out of distribution samples. Unlike original fw-edu It is NOT focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for undergraduate school levels or specialized domains.
|
| 165 |
+
- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores, but we haven't found the classifier to improve the scores.
|
| 166 |
- Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
|
| 167 |
|
| 168 |
The training and inference code is available on GitHub
|