Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Fixing KeyError: 'instruction' When Splitting Datasets in Python

Tech 1

A KeyError: 'instruction' often occurs when splitting datasets loaded from JSON files, even when most entries appear to match the expected schema. A typical correctly formatted entry looks like this:

{
  "instruction": "Describe the core principles of object-oriented programming (OOP).",
  "input": "OOP principles include encapsulation, inheritance, polymorphism, and abstraction, which enable organized, maintainable code.",
  "output": "Evaluation: You have a solid grasp of OOP principles. How have these principles guided your code writing throughout your development experience?"
}

For large datasets, it is easy to miss that a small number of entries use different key names. Printing the keys for every entry reveals the mixed schema issue:

Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])

To resolve this error, add schema validation during data loading, and extract required fields dynamically based on the key structure of each entry. This handles mixed schema datasets without throwing errors.

The updated working code is below:

import json
from sklearn.model_selection import train_test_split

# Load raw dataset from JSON file
with open("raw_dataset.json", "r", encoding="utf-8") as f:
    raw_data = json.load(f)

input_texts = []
output_texts = []

# Process each entry based on its key structure
for entry in raw_data:
    # Handle the standard instruction-input-output schema
    if all(key in entry for key in ["instruction", "input", "output"]):
        combined_input = f"{entry['instruction']} {entry['input']}"
        input_texts.append(combined_input)
        output_texts.append(entry["output"])
    # Handle the alternate question-answer-output schema
    elif all(key in entry for key in ["question", "answer", "output"]):
        input_texts.append(entry["question"])
        output_texts.append(entry["output"])
    # Log malformed/unsupported entries for inspection
    else:
        print(f"Unsupported entry format: {entry}")

# Split into training, validation, and test sets
X_train_temp, X_test, y_train_temp, y_test = train_test_split(
    input_texts, output_texts, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_temp, y_train_temp, test_size=0.25, random_state=42
)

# Save split datasets to individual JSON files
with open("train_split.json", "w", encoding="utf-8") as train_out:
    json.dump({"inputs": X_train, "outputs": y_train}, train_out, ensure_ascii=False)

with open("val_split.json", "w", encoding="utf-8") as val_out:
    json.dump({"inputs": X_val, "outputs": y_val}, val_out, ensure_ascii=False)

with open("test_split.json", "w", encoding="utf-8") as test_out:
    json.dump({"inputs": X_test, "outputs": y_test}, test_out, ensure_ascii=False)

If this fix does not resolve your issue, see the related solution for KeyError: 'instruction' that occurs during ChatGLM3-6B fine-tuning with LLAMA-Factory.

Tags: Python

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.