Create own dataset for NER

Hello all,

I have the following challenge: I want to make a custom-NER model with BERT.
Using these instructions (link), I have already been able to successfully train the bert-base-german-cased on the following data set german-ler.
Now, in a second step, I would like to create my own data set and fine-tune the aforementioned BERT model with it.
However, I could not find anything suitable in the documentation for creating my own data set. Only something about Q&A (link). I could not find any other instructions that worked either.

Therefore, I have done the following so far: With the help of BERT’s tokenizer, I have broken down my texts into subwords and labelled them in IOB2 format with the help of another separate column. I then saved my data as a csv file.

Alex S-PER
is O
going O
with O
Marty B-PER
A. I-PER

In a next step, I read in the .csv as a DataFrame and converted it into a dataset. For this purpose, I divided the dataset into a training and a test set.

import pandas as pd
from datasets import Dataset, DatasetDict,ClassLabel
raw_data= pd.read_csv(r'C:\Users\dataset.csv', encoding='unicode_escape',on_bad_lines='skip',delimiter=';')

ges_dataset = Dataset.from_pandas(raw_data)
ges_dataset = ges_dataset.class_encode_column("label")

train_dataset, validation_dataset= ges_dataset.train_test_split(test_size=0.2).values()
dataDict = DatasetDict({"train":train_dataset,"validation":validation_dataset})

When I now try to read in the data set and proceed according to the instructions for fine-tuning, I unfortunately always get an error:

    label_names_1 = dataset_1["train"].features["tags"].feature.names

AttributeError: 'Value' object has no attribute 'feature'

I suspect that the error arises from the fact that I am not creating the dataset correctly.

The following questions therefore arise for me:

  1. how do I create a data set for my own text correctly? Does anyone here happen to have good documentation or instructions?
  2. how should the data set be structured schematically? If I split the record completely, it is no longer clear where it ends and the next one begins. If I don’t do this, then I can no longer distinguish between the individual labels of the tags. For example, [B-Start, I-Start] would not be two separate labels, but an independent label even in combination.

I would be very happy if anyone has had the same difficulties and can help me here.
Many thanks and best regards Tom

Did you solve this issue yet? I’m having the same issue.

Hi Dnsibu,
unfortunately, I have not yet been able to solve this.

I find this post helpful.

python - Creating HuggingFace Dataset to train an BIO tagger - Stack Overflow