Home > Tech > Content

Baseline Analysis for iFLYTEK Machine Translation Challenge at Datawhale AI Summer Camp

Tech 1

Dataset Overview

The official competition dataset includes 140,000 training sentence pairs, a test set for model evaluation, and a bilingual term dictionary for standardizing specialized vocabulary translations. Each line in the training file train.txt contains an English sentence and a corresponding Chinese sentence separated by a tab character \t, which will be parsed during data loading.

Core NLP Model Foundations

This baseline uses a sequence-to-sequence (seq2seq) framework with GRU-based encoder and decoder components.

Gated Recurrent Unit (GRU)

GRU is a modified variant of recurrent neural networks (RNNs) built to mitigate vanishing and exploding gradient issues present in standard RNN architectures. While standard RNN processes sequential input one token at a time, updating a hidden state at each step, GRU uses update and reset gates to better capture long-range semantic dependencies in sequences. For practical implementation, GRU accepts the same input and output formats as standard RNN but delivers more consistant performance on long input sequences.

Sequence-to-Sequence (Seq2Seq) Model

The seq2seq architecture consists of two core modular components: an encoder and a decoder, both implemented with GRU layers.

Encoder Module

Accepts an input source sequence $X = [x_1, x_2, ..., x_T]$
Feeds each token sequentially into the GRU network, generating a hidden state $h_t$ for each time step $t$. The final hidden state $h_T$ serves as the context vector $c$ for the decoding phase.

Decoder Module

Initializes its hidden state using the context vector $c$ produced by the encoder
Generates the target translation sequence $Y = [y_1, y_2, ..., y_{T'}]$ step-by-step: the first decoder GRU unit takes $c$ and the initial hidden state to produce $y_1$ and $h'_1$, with each subsequent step using the previous hidden state and output token to generate the next sequence token.

The full pipeline encodes the input sequence into a fixed-size context vector, then decodes this vector into the target translation sequence, making this framework ideal for machine translation and text summarization tasks.

Data Preprocessing Pipeline

Neural networks cannot directly process raw text strings, so the core goal of this stage is to convert text data into numerical tensors compatible with model training. The preprocessing workflow includes:

Text cleaning
Custom dataset class implementation
Vocabulary construction with tokenization
Integration of the official term dictionary

Text Cleaning

First perform data exploration to identify and remove noisy content: For English text, expand contractions (e.g., "There’s" → "There is"), remove non-printable characters, and filter to retain only alphaunmeric characters and basic punctuation. Example code:

import re
import unicodedata
import contractions

def normalize_unicode(text):
    return ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')

def clean_english_text(raw_text):
    cleaned = normalize_unicode(raw_text.strip())
    cleaned = contractions.fix(cleaned)
    # Remove parenthetical annotations
    cleaned = re.sub(r'\([^)]*\)', '', cleaned)
    # Keep allowed characters only
    cleaned = re.sub(r"[^a-zA-Z0-9.!?]+", " ", cleaned)
    return cleaned

For Chinese text, remove parenthetical filler content like "(掌声)" and filter to retain only Chinese characters, standard punctuation, and numerical values:

def clean_chinese_text(raw_text):
    # Remove all parenthetical annotations
    cleaned = re.sub(r'（[^）]*）', '', raw_text)
    # Keep allowed Chinese characters, punctuation, and numbers
    cleaned = re.sub(r"[^\u4e00-\u9fa5，。！？0-9]", "", cleaned)
    return cleaned

Custom Translation Dataset Class

Implement a PyTorch Dataset class to handle loading, tokenization, and tensor conversion of the training pairs:

from torch.utils.data import Dataset
from torchtext.data.utils import get_tokenizer
from collections import Counter
import torch

class TranslationPairDataset(Dataset):
    def __init__(self, data_path, term_dict):
        self.raw_pairs = []
        # Load and parse training data
        with open(data_path, 'r', encoding='utf-8') as f:
            for line in f:
                eng_sent, ch_sent = line.strip().split('\t')
                self.raw_pairs.append((eng_sent, ch_sent))
        
        self.term_dict = term_dict
        # Initialize tokenizers: basic English tokenizer, character-level for Chinese
        self.eng_tokenizer = get_tokenizer('basic_english')
        self.ch_tokenizer = list

        # Build vocabulary, including terms from the official dictionary
        eng_vocab_counter = Counter(self.term_dict.keys())
        ch_vocab_counter = Counter()

        for eng, ch in self.raw_pairs:
            eng_vocab_counter.update(self.eng_tokenizer(eng))
            ch_vocab_counter.update(self.ch_tokenizer(ch))

        # Add special tokens and build final vocabularies
        special_tokens = ['<pad>', '<sos>', '<eos>']
        self.eng_vocab = special_tokens + list(self.term_dict.keys()) + [word for word, _ in eng_vocab_counter.most_common(10000)]
        self.ch_vocab = special_tokens + [word for word, _ in ch_vocab_counter.most_common(10000)]

        # Create word-to-index mappings
        self.eng_w2i = {word: idx for idx, word in enumerate(self.eng_vocab)}
        self.ch_w2i = {word: idx for idx, word in enumerate(self.ch_vocab)}

    def __len__(self):
        return len(self.raw_pairs)

    def __getitem__(self, idx):
        eng_sent, ch_sent = self.raw_pairs[idx]
        # Convert tokens to index tensors with SOS/EOS markers
        eng_tensor = torch.tensor([self.eng_w2i.get(token, self.eng_w2i['<sos>']) for token in self.eng_tokenizer(eng_sent)] + [self.eng_w2i['<eos>']])
        ch_tensor = torch.tensor([self.ch_w2i.get(token, self.ch_w2i['<sos>']) for token in self.ch_tokenizer(ch_sent)] + [self.ch_w2i['<eos>']])
        return eng_tensor, ch_tensor

Model Architecture Implementation

We implement a standard GRU-based seq2seq model for machine translation:

Encoder Network

import torch.nn as nn

class EnglishEncoder(nn.Module):
    def __init__(self, src_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.embedding = nn.Embedding(src_vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, src_input):
        # src_input shape: [batch_size, src_sequence_length]
        embedded = self.dropout(self.embedding(src_input))
        # embedded shape: [batch_size, src_sequence_length, emb_dim]
        outputs, final_hidden = self.gru(embedded)
        return outputs, final_hidden

Decoder Network

class ChineseDecoder(nn.Module):
    def __init__(self, tgt_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.output_dim = tgt_vocab_size
        self.embedding = nn.Embedding(tgt_vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
        self.fc_output = nn.Linear(hid_dim, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, tgt_input, hidden_state):
        # tgt_input shape: [batch_size, 1]
        # hidden_state shape: [num_layers, batch_size, hid_dim]
        embedded = self.dropout(self.embedding(tgt_input))
        # embedded shape: [batch_size, 1, emb_dim]
        outputs, new_hidden = self.gru(embedded, hidden_state)
        # outputs shape: [batch_size, 1, hid_dim]
        prediction = self.fc_output(outputs.squeeze(1))
        return prediction, new_hidden

Full Seq2Seq Model

import random

class TranslationSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src_seq, tgt_seq, teacher_forcing_ratio=0.5):
        batch_size = src_seq.shape[0]
        tgt_len = tgt_seq.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        # Initialize output tensor for all time steps
        all_outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
        # Get encoder final hidden state as decoder initial state
        _, encoder_final_hidden = self.encoder(src_seq)
        # Initial decoder input is the start-of-sequence token
        dec_input = tgt_seq[:, 0].unsqueeze(1)

        for step in range(1, tgt_len):
            output, new_hidden = self.decoder(dec_input, encoder_final_hidden)
            all_outputs[:, step, :] = output
            # Use teacher forcing randomly during training
            use_teacher_forcing = random.random() < teacher_forcing_ratio
            top_pred_token = output.argmax(1)
            dec_input = tgt_seq[:, step].unsqueeze(1) if use_teacher_forcing else top_pred_token.unsqueeze(1)
            encoder_final_hidden = new_hidden

        return all_outputs

Training Workflow

The standard training loop for this model follows these steps:

Load batched data via DataLoader
Forward pass through the model to get predictions
Calculate cross-entropy loss between predictions and target sequences
Backpropagate gradients and update model weights
Clip gradients to prevent explosion

Example training loop code:

import torch.optim as optim

def train_one_epoch(model, data_loader, optimizer, loss_fn, grad_clip, device):
    model.train()
    total_loss = 0.0
    for batch_idx, (src_batch, tgt_batch) in enumerate(data_loader):
        src_batch = src_batch.to(device)
        tgt_batch = tgt_batch.to(device)
        optimizer.zero_grad()
        model_outputs = model(src_batch, tgt_batch)
        # Reshape outputs and targets for loss calculation
        output_flat = model_outputs[:, 1:].contiguous().view(-1, model_outputs.shape[-1])
        tgt_flat = tgt_batch[:, 1:].contiguous().view(-1)
        loss = loss_fn(output_flat, tgt_flat)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_loader)

Full main training script:

import torch
from torch.utils.data import DataLoader, Subset
import time

def collate_fn(batch):
    # Custom collate function to pad sequences in a batch
    src_list, tgt_list = zip(*batch)
    src_padded = torch.nn.utils.rnn.pad_sequence(src_list, batch_first=True, padding_value=0)
    tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_list, batch_first=True, padding_value=0)
    return src_padded, tgt_padded

if __name__ == "__main__":
    start_time = time.time()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load term dictionary (implementation omitted for brevity)
    term_dict = load_terminology_dictionary("./data/en-zh.dic")
    # Initialize dataset
    full_dataset = TranslationPairDataset("./data/train.txt", term_dict)
    # Use subset of data for quick testing
    sample_count = 1000
    subset_indices = list(range(sample_count))
    train_subset = Subset(full_dataset, subset_indices)
    train_loader = DataLoader(train_subset, batch_size=32, shuffle=True, collate_fn=collate_fn)

    # Initialize model components
    SRC_VOCAB_SIZE = len(full_dataset.eng_vocab)
    TGT_VOCAB_SIZE = len(full_dataset.ch_vocab)
    EMB_DIM = 256
    HID_DIM = 512
    NUM_LAYERS = 2
    DROPOUT = 0.5

    encoder = EnglishEncoder(SRC_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
    decoder = ChineseDecoder(TGT_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
    model = TranslationSeq2Seq(encoder, decoder, device).to(device)

    # Define optimizer and loss function (ignore padding token in loss)
    optimizer = optim.Adam(model.parameters())
    pad_idx = full_dataset.ch_w2i['<pad>']
    loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx)

    # Training parameters
    EPOCHS = 10
    GRAD_CLIP = 1.0

    for epoch in range(EPOCHS):
        epoch_loss = train_one_epoch(model, train_loader, optimizer, loss_fn, GRAD_CLIP, device)
        print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.3f}")

    # Save trained model
    torch.save(model.state_dict(), "./translation_model_gru.pth")
    elapsed = (time.time() - start_time) / 60
    print(f"Total training time: {elapsed:.2f} minutes")

The standard GRU-based seq2seq model has limited performance and slow training speed. A Transformer-based baseline was also developed, which reduced 10-epoch training time to ~1 hour on an RTX 4090 and achieved a BLEU-4 score of ~13.9, a significant improvement over the GRU baseline.

BLEU-4 Evaluation Metric

BLEU (Bilingual Evaluation Understudy) is a standard metric for evaluating machine translation quality, with BLEU-4 using 1- to 4-gram overlapping matches between generated and reference translations.

BLEU-4 Calculation Steps

Tokenize both generated and reference translations
Count n-gram overlaps for n=1 to 4
Calculate clipped precision for each n-gram
Compute weighted geometric mean of the precisions
Apply brevity penalty to account for short generated sequences
Combine the weighted mean and penalty to get the final BLEU-4 score

Example calculation: Reference: "The cat is on the mat" Generated: "The cat sat on the mat"

Tokenized reference: ["The", "cat", "is", "on", "the", "mat"] Tokenized generated: ["The", "cat", "sat", "on", "the", "mat"]
N-gram matches: 1-gram:5, 2-gram:4, 3-gram:2,4-gram:1
Clipped precisions: 5/6 ≈0.833, 4/5=0.8, 2/4=0.5, 1/3≈0.333
Weighted geometric mean: $\exp\left(\frac{1}{4}(\log0.833 + \log0.8 + \log0.5 + \log0.333)\right) ≈0.599$
Brevity penalty: 1.0 (sequence lengths match)
Final BLEU-4 score ≈0.599

Example evaluation code:

from sacrebleu.metrics import BLEU
from typing import List

def load_text_file(file_path: str) -> List[str]:
    with open(file_path, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f]

def translate_single_sentence(sentence: str, model: TranslationSeq2Seq, dataset: TranslationPairDataset, term_dict, device, max_len=50):
    model.eval()
    # Create reversed term mapping for quick lookup
    term_ch_to_en = {v: k for k, v in term_dict.items()}
    # Tokenize and convert to tensor
    eng_tokens = dataset.eng_tokenizer(sentence)
    src_tensor = torch.LongTensor([dataset.eng_w2i.get(t, dataset.eng_w2i['<sos>']) for t in eng_tokens]).unsqueeze(0).to(device)
    
    with torch.no_grad():
        _, enc_hidden = model.encoder(src_tensor)
    
    translated_tokens = []
    dec_input = torch.LongTensor([[dataset.ch_w2i['<sos>']]]).to(device)
    current_hidden = enc_hidden

    for _ in range(max_len):
        output, current_hidden = model.decoder(dec_input, current_hidden)
        pred_token = output.argmax(1).item()
        pred_word = dataset.ch_vocab[pred_token]
        
        if pred_word == '<eos>':
            break
        # Replace with standardized term if available
        if pred_word in term_ch_to_en:
            pred_word = term_ch_to_en[pred_word]
        translated_tokens.append(pred_word)
        dec_input = torch.LongTensor([[pred_token]]).to(device)
    
    return ''.join(translated_tokens)

def evaluate_bleu_score(model, dataset, src_file, ref_file, term_dict, device):
    model.eval()
    src_sents = load_text_file(src_file)
    ref_sents = load_text_file(ref_file)
    translated_sents = []
    
    for sent in src_sents:
        translated = translate_single_sentence(sent, model, dataset, term_dict, device)
        translated_sents.append(translated)
    
    bleu_scorer = BLEU()
    final_score = bleu_scorer.corpus_score(translated_sents, [ref_sents])
    return final_score

Inference Pipeline

After training, use the model to generate translations for the test set:

def run_inference(model, dataset, src_file, save_path, term_dict, device):
    model.eval()
    src_sents = load_text_file(src_file)
    translated_sents = []
    
    for sent in src_sents:
        translated = translate_single_sentence(sent, model, dataset, term_dict, device)
        translated_sents.append(translated)
    
    with open(save_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(translated_sents))

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    term_dict = load_terminology_dictionary("./data/en-zh.dic")
    dataset = TranslationPairDataset("./data/train.txt", term_dict)
    
    # Reinitialize model and load weights
    encoder = EnglishEncoder(len(dataset.eng_vocab), 256, 512, 2, 0.5)
    decoder = ChineseDecoder(len(dataset.ch_vocab), 256, 512, 2, 0.5)
    model = TranslationSeq2Seq(encoder, decoder, device).to(device)
    model.load_state_dict(torch.load("./translation_model_gru.pth"))
    
    run_inference(model, dataset, "./data/test_en.txt", "./data/submit.txt", term_dict, device)
    print(f"Translation complete! Results saved to ./data/submit.txt")

Back to List

Prev: Building a Java-Based Music Streaming Platform with Spring Boot and Vue.js

Next: MySQL Architecture, Indexing Strategies, and Performance Optimization

Fading Coder

Baseline Analysis for iFLYTEK Machine Translation Challenge at Datawhale AI Summer Camp

Dataset Overview

Core NLP Model Foundations

Gated Recurrent Unit (GRU)

Sequence-to-Sequence (Seq2Seq) Model

Encoder Module

Decoder Module

Data Preprocessing Pipeline

Text Cleaning

Custom Translation Dataset Class

Model Architecture Implementation

Encoder Network

Decoder Network

Full Seq2Seq Model

Training Workflow

BLEU-4 Evaluation Metric

BLEU-4 Calculation Steps

Inference Pipeline

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Baseline Analysis for iFLYTEK Machine Translation Challenge at Datawhale AI Summer Camp

Dataset Overview

Core NLP Model Foundations

Gated Recurrent Unit (GRU)

Sequence-to-Sequence (Seq2Seq) Model

Encoder Module

Decoder Module

Data Preprocessing Pipeline

Text Cleaning

Custom Translation Dataset Class

Model Architecture Implementation

Encoder Network

Decoder Network

Full Seq2Seq Model

Training Workflow

BLEU-4 Evaluation Metric

BLEU-4 Calculation Steps

Inference Pipeline

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment