Fixing KeyError: 'instruction' When Splitting Datasets in Python
A KeyError: 'instruction' often occurs when splitting datasets loaded from JSON files, even when most entries appear to match the expected schema. A typical correctly formatted entry looks like this:
{
"instruction": "Describe the core principles of object-oriented programming (OOP).",
"input": "OOP principles include encapsulation, inheritance, polymorphism, and abstraction, which enable organized, maintainable code.",
"output": "Evaluation: You have a solid grasp of OOP principles. How have these principles guided your code writing throughout your development experience?"
}
For large datasets, it is easy to miss that a small number of entries use different key names. Printing the keys for every entry reveals the mixed schema issue:
Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['question', 'answer', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])
Keys in entry: dict_keys(['instruction', 'input', 'output'])
To resolve this error, add schema validation during data loading, and extract required fields dynamically based on the key structure of each entry. This handles mixed schema datasets without throwing errors.
The updated working code is below:
import json
from sklearn.model_selection import train_test_split
# Load raw dataset from JSON file
with open("raw_dataset.json", "r", encoding="utf-8") as f:
raw_data = json.load(f)
input_texts = []
output_texts = []
# Process each entry based on its key structure
for entry in raw_data:
# Handle the standard instruction-input-output schema
if all(key in entry for key in ["instruction", "input", "output"]):
combined_input = f"{entry['instruction']} {entry['input']}"
input_texts.append(combined_input)
output_texts.append(entry["output"])
# Handle the alternate question-answer-output schema
elif all(key in entry for key in ["question", "answer", "output"]):
input_texts.append(entry["question"])
output_texts.append(entry["output"])
# Log malformed/unsupported entries for inspection
else:
print(f"Unsupported entry format: {entry}")
# Split into training, validation, and test sets
X_train_temp, X_test, y_train_temp, y_test = train_test_split(
input_texts, output_texts, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train_temp, y_train_temp, test_size=0.25, random_state=42
)
# Save split datasets to individual JSON files
with open("train_split.json", "w", encoding="utf-8") as train_out:
json.dump({"inputs": X_train, "outputs": y_train}, train_out, ensure_ascii=False)
with open("val_split.json", "w", encoding="utf-8") as val_out:
json.dump({"inputs": X_val, "outputs": y_val}, val_out, ensure_ascii=False)
with open("test_split.json", "w", encoding="utf-8") as test_out:
json.dump({"inputs": X_test, "outputs": y_test}, test_out, ensure_ascii=False)
If this fix does not resolve your issue, see the related solution for KeyError: 'instruction' that occurs during ChatGLM3-6B fine-tuning with LLAMA-Factory.