Home > Tech > Content

Practical Guide to Data Preprocessing Transforms in MindSpore

Tech Apr 17 23

Raw data loaded directly from storage is rarely formatted correctly for direct neural network training. MindSpore provides a suite of modular transform operations that integrate with data processing pipelines via the map method, supporting image, text, and audio preprocessing alongside custom user-defined logic.

import numpy as np
from PIL import Image
from download import download
import mindspore.dataset as ds
from mindspore.dataset import vision, text

numpy: Library for high-performance numerical array operations, used for underlying data manipulation
PIL.Image: Toolkit for image I/O, format conversion, and basic pixel editing
download: Utility for fetching and unpacking remote dataset archives
mindspore.dataset: Core module containing all data loading, transformation, and pipeline management utilities

Common Transforms

The transforms submodule includes utility operations that work across all data types. The Compose operation is used to chain multiple transforms into a single reusable pipeline.

mnist_dataset_url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip"
dataset_save_path = download(mnist_dataset_url, "./", kind="zip", replace=True)
train_split = ds.MnistDataset('MNIST_Data/train')

The download function fetches the compressed MNIST archive, unpacks it to the specified working directory, and overwrites existing files if present. MnistDataset loads the unpacked training split into a structured dataset object.

sample_img, sample_label = next(train_split.create_tuple_iterator())
print(f"Raw image shape: {sample_img.shape}")

create_tuple_iterator generates an iterable that returns dataset entries as (image, label) tuples. The next call retrieves the first sample to inspect raw data dimensions.

preprocess_pipeline = ds.transforms.Compose([
    vision.Rescale(1.0 / 255.0, 0),
    vision.Normalize(mean=(0.1307,), std=(0.3081,)),
    vision.HWC2CHW()
])
train_split = train_split.map(operations=preprocess_pipeline, input_columns='image')
processed_img, _ = next(train_split.create_tuple_iterator())
print(f"Processed image shape: {processed_img.shape}")

Compose accepts an ordered list of transform operations to apply sequential. The pipeline above scales pixel values to the 0-1 range, standardizes values using MNIST's global mean and standard deviation, then rearranges dimensions from height-width-channel to channel-height-width format required by most convolutional neural networks. The map method applies the full pipeline to all entries in the image column of the dataset.

Vision Transforms

The vision submodule includes specialized transforms for image processing, covering augmentation, normalization, and format conversion use cases.

Rescale

Rescale adjusts pixel value ranges using a linear transformation, following the formula output_pixel = input_pixel * scale_factor + shift_offset.

# Generate 48x48 single-channel random sample
grayscale_sample_arr = np.random.randint(0, 255, (48, 48), dtype=np.uint8)
random_pil_img = Image.fromarray(grayscale_sample_arr)
print(grayscale_sample_arr)

This code generates a random 48x48 grayscale image with pixel values in the 0-255 range, then converts the numpy array to a PIL Image object for processing.

rescale_op = vision.Rescale(scale=1.0/255.0, shift=0)
scaled_arr = rescale_op(random_pil_img)
print(f"Scaled value range: {np.min(scaled_arr)} ~ {np.max(scaled_arr)}")

The instantiated Rescale operation is applied directly to the PIL image, returning a numpy array with values normalized to the 0-1 range.

Nomralize

Normalize standardizes image values per channel to have a mean of 0 and standard deviation of 1, improving model training stability. It accepts per-channel mean and standard deviation values, plus a flag indicating if input is in HWC format.

normalize_op = vision.Normalize(mean=(0.1307,), std=(0.3081,), is_hwc=False)
normalized_arr = normalize_op(scaled_arr)
print(normalized_arr)

This operation applies the standardization formula output_channel = (input_channel - channel_mean) / channel_std to the single-channel scaled image.

HWC2CHW

Different hardware accelerators and model architectures expect either HWC (height, width, channel) or CHW (channel, height, width) dimension ordering. The HWC2CHW transform converts between the two formats.

# Add channel dimension to create HWC format array
hwc_formatted_arr = np.expand_dims(normalized_arr, axis=-1)
hwc_to_chw_op = vision.HWC2CHW()
chw_formatted_arr = hwc_to_chw_op(hwc_formatted_arr)
print(f"HWC shape: {hwc_formatted_arr.shape}, CHW shape: {chw_formatted_arr.shape}")

np.expand_dims adds a channel dimension to the end of the 2D normalized array to create valid HWC input. The transform then moves the channel dimension to the first position, resulting in the CHW format.

Text Transforms

The text submodule provides specialized processing operations to natural language data, including tokenization, vocabulary building, and token-to-index mapping.

def text_yielder(input_texts):
    for entry in input_texts:
        yield (entry,)

sample_texts = ["Welcome to Beijing"]
text_ds = ds.GeneratorDataset(source=text_yielder(sample_texts), column_names=["raw_text"])

GeneratorDataset creates a dataset object from a custom generator function, which yields individual text entries from the input list.

PythonTokenizer

PythonTokenizer wraps custom user-defined tokenization logic to integrate with MindSpore data pipelines.

def custom_space_splitter(content):
    return content.split()

text_ds = text_ds.map(operations=text.PythonTokenizer(custom_space_splitter), input_columns=["raw_text"])
print(next(text_ds.create_tuple_iterator()))

The custom custom_space_splitter function splits input text on whitespace to generate tokens. The PythonTokenizer wrapper allows this logic to be applied directly via the map method.

Lookup

The Lookup transform maps text tokens to integer indices using a prebuilt vocabulary. Vocabularies can be loaded from existing files or built directly from a dataset.

vocab = text.Vocab.from_dataset(text_ds, columns=["raw_text"])
print(f"Generated vocab mapping: {vocab.vocab()}")

text_ds = text_ds.map(operations=text.Lookup(vocab), input_columns=["raw_text"])
print(next(text_ds.create_tuple_iterator()))

Vocab.from_dataset scans all token entries in the dataset to build a token-to-index mapping. The Lookup transform then converts each token in the input sequence to its corresponding integer index for model input.

Lambda Transforms

Lambda transforms support arbitrary custom logic via anonymous functions, providing maximum flexibility for one-off or specialized data processing steps.

numeric_ds = ds.GeneratorDataset(source=[1,2,3], column_names=["value"], shuffle=False)
numeric_ds = numeric_ds.map(operations=lambda x: x * 2)
print(list(numeric_ds.create_tuple_iterator()))

This simple lambda function multiplies each input value by 2, applied directly to all entries in the dataset.

def quadratic_transform(x):
    return x ** 2 + 3

numeric_ds = numeric_ds.map(operations=lambda val: quadratic_transform(val))
print(list(numeric_ds.create_tuple_iterator()))

Lambda transforms can allso wrap more complex custom functions, allowing integration of arbitrary Python logic into the data processing pipeline.

Tags: MindSpore Data Preprocessing

Back to List

Prev: Handling Multiple Parameters with Identical Names in Spring MVC

Next: Implementing Data Visualization with Python

Fading Coder

Practical Guide to Data Preprocessing Transforms in MindSpore

Common Transforms

Vision Transforms

Rescale

Nomralize

HWC2CHW

Text Transforms

PythonTokenizer

Lookup

Lambda Transforms

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Practical Guide to Data Preprocessing Transforms in MindSpore

Common Transforms

Vision Transforms

Rescale

Nomralize

HWC2CHW

Text Transforms

PythonTokenizer

Lookup

Lambda Transforms

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment