Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Practical Guide to Data Preprocessing Transforms in MindSpore

Tech 3

Raw data loaded directly from storage is rarely formatted correctly for direct neural network training. MindSpore provides a suite of modular transform operations that integrate with data processing pipelines via the map method, supporting image, text, and audio preprocessing alongside custom user-defined logic.

import numpy as np
from PIL import Image
from download import download
import mindspore.dataset as ds
from mindspore.dataset import vision, text
  • numpy: Library for high-performance numerical array operations, used for underlying data manipulation
  • PIL.Image: Toolkit for image I/O, format conversion, and basic pixel editing
  • download: Utility for fetching and unpacking remote dataset archives
  • mindspore.dataset: Core module containing all data loading, transformation, and pipeline management utilities

Common Transforms

The transforms submodule includes utility operations that work across all data types. The Compose operation is used to chain multiple transforms into a single reusable pipeline.

mnist_dataset_url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip"
dataset_save_path = download(mnist_dataset_url, "./", kind="zip", replace=True)
train_split = ds.MnistDataset('MNIST_Data/train')

The download function fetches the compressed MNIST archive, unpacks it to the specified working directory, and overwrites existing files if present. MnistDataset loads the unpacked training split into a structured dataset object.

sample_img, sample_label = next(train_split.create_tuple_iterator())
print(f"Raw image shape: {sample_img.shape}")

create_tuple_iterator generates an iterable that returns dataset entries as (image, label) tuples. The next call retrieves the first sample to inspect raw data dimensions.

preprocess_pipeline = ds.transforms.Compose([
    vision.Rescale(1.0 / 255.0, 0),
    vision.Normalize(mean=(0.1307,), std=(0.3081,)),
    vision.HWC2CHW()
])
train_split = train_split.map(operations=preprocess_pipeline, input_columns='image')
processed_img, _ = next(train_split.create_tuple_iterator())
print(f"Processed image shape: {processed_img.shape}")

Compose accepts an ordered list of transform operations to apply sequential. The pipeline above scales pixel values to the 0-1 range, standardizes values using MNIST's global mean and standard deviation, then rearranges dimensions from height-width-channel to channel-height-width format required by most convolutional neural networks. The map method applies the full pipeline to all entries in the image column of the dataset.

Vision Transforms

The vision submodule includes specialized transforms for image processing, covering augmentation, normalization, and format conversion use cases.

Rescale

Rescale adjusts pixel value ranges using a linear transformation, following the formula output_pixel = input_pixel * scale_factor + shift_offset.

# Generate 48x48 single-channel random sample
grayscale_sample_arr = np.random.randint(0, 255, (48, 48), dtype=np.uint8)
random_pil_img = Image.fromarray(grayscale_sample_arr)
print(grayscale_sample_arr)

This code generates a random 48x48 grayscale image with pixel values in the 0-255 range, then converts the numpy array to a PIL Image object for processing.

rescale_op = vision.Rescale(scale=1.0/255.0, shift=0)
scaled_arr = rescale_op(random_pil_img)
print(f"Scaled value range: {np.min(scaled_arr)} ~ {np.max(scaled_arr)}")

The instantiated Rescale operation is applied directly to the PIL image, returning a numpy array with values normalized to the 0-1 range.

Nomralize

Normalize standardizes image values per channel to have a mean of 0 and standard deviation of 1, improving model training stability. It accepts per-channel mean and standard deviation values, plus a flag indicating if input is in HWC format.

normalize_op = vision.Normalize(mean=(0.1307,), std=(0.3081,), is_hwc=False)
normalized_arr = normalize_op(scaled_arr)
print(normalized_arr)

This operation applies the standardization formula output_channel = (input_channel - channel_mean) / channel_std to the single-channel scaled image.

HWC2CHW

Different hardware accelerators and model architectures expect either HWC (height, width, channel) or CHW (channel, height, width) dimension ordering. The HWC2CHW transform converts between the two formats.

# Add channel dimension to create HWC format array
hwc_formatted_arr = np.expand_dims(normalized_arr, axis=-1)
hwc_to_chw_op = vision.HWC2CHW()
chw_formatted_arr = hwc_to_chw_op(hwc_formatted_arr)
print(f"HWC shape: {hwc_formatted_arr.shape}, CHW shape: {chw_formatted_arr.shape}")

np.expand_dims adds a channel dimension to the end of the 2D normalized array to create valid HWC input. The transform then moves the channel dimension to the first position, resulting in the CHW format.

Text Transforms

The text submodule provides specialized processing operations to natural language data, including tokenization, vocabulary building, and token-to-index mapping.

def text_yielder(input_texts):
    for entry in input_texts:
        yield (entry,)

sample_texts = ["Welcome to Beijing"]
text_ds = ds.GeneratorDataset(source=text_yielder(sample_texts), column_names=["raw_text"])

GeneratorDataset creates a dataset object from a custom generator function, which yields individual text entries from the input list.

PythonTokenizer

PythonTokenizer wraps custom user-defined tokenization logic to integrate with MindSpore data pipelines.

def custom_space_splitter(content):
    return content.split()

text_ds = text_ds.map(operations=text.PythonTokenizer(custom_space_splitter), input_columns=["raw_text"])
print(next(text_ds.create_tuple_iterator()))

The custom custom_space_splitter function splits input text on whitespace to generate tokens. The PythonTokenizer wrapper allows this logic to be applied directly via the map method.

Lookup

The Lookup transform maps text tokens to integer indices using a prebuilt vocabulary. Vocabularies can be loaded from existing files or built directly from a dataset.

vocab = text.Vocab.from_dataset(text_ds, columns=["raw_text"])
print(f"Generated vocab mapping: {vocab.vocab()}")

text_ds = text_ds.map(operations=text.Lookup(vocab), input_columns=["raw_text"])
print(next(text_ds.create_tuple_iterator()))

Vocab.from_dataset scans all token entries in the dataset to build a token-to-index mapping. The Lookup transform then converts each token in the input sequence to its corresponding integer index for model input.

Lambda Transforms

Lambda transforms support arbitrary custom logic via anonymous functions, providing maximum flexibility for one-off or specialized data processing steps.

numeric_ds = ds.GeneratorDataset(source=[1,2,3], column_names=["value"], shuffle=False)
numeric_ds = numeric_ds.map(operations=lambda x: x * 2)
print(list(numeric_ds.create_tuple_iterator()))

This simple lambda function multiplies each input value by 2, applied directly to all entries in the dataset.

def quadratic_transform(x):
    return x ** 2 + 3

numeric_ds = numeric_ds.map(operations=lambda val: quadratic_transform(val))
print(list(numeric_ds.create_tuple_iterator()))

Lambda transforms can allso wrap more complex custom functions, allowing integration of arbitrary Python logic into the data processing pipeline.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.