Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Baseline Analysis for iFLYTEK Machine Translation Challenge at Datawhale AI Summer Camp

Tech 1

Dataset Overview

The official competition dataset includes 140,000 training sentence pairs, a test set for model evaluation, and a bilingual term dictionary for standardizing specialized vocabulary translations. Each line in the training file train.txt contains an English sentence and a corresponding Chinese sentence separated by a tab character \t, which will be parsed during data loading.

Core NLP Model Foundations

This baseline uses a sequence-to-sequence (seq2seq) framework with GRU-based encoder and decoder components.

Gated Recurrent Unit (GRU)

GRU is a modified variant of recurrent neural networks (RNNs) built to mitigate vanishing and exploding gradient issues present in standard RNN architectures. While standard RNN processes sequential input one token at a time, updating a hidden state at each step, GRU uses update and reset gates to better capture long-range semantic dependencies in sequences. For practical implementation, GRU accepts the same input and output formats as standard RNN but delivers more consistant performance on long input sequences.

Sequence-to-Sequence (Seq2Seq) Model

The seq2seq architecture consists of two core modular components: an encoder and a decoder, both implemented with GRU layers.

Encoder Module

  1. Accepts an input source sequence $X = [x_1, x_2, ..., x_T]$
  2. Feeds each token sequentially into the GRU network, generating a hidden state $h_t$ for each time step $t$. The final hidden state $h_T$ serves as the context vector $c$ for the decoding phase.

Decoder Module

  1. Initializes its hidden state using the context vector $c$ produced by the encoder
  2. Generates the target translation sequence $Y = [y_1, y_2, ..., y_{T'}]$ step-by-step: the first decoder GRU unit takes $c$ and the initial hidden state to produce $y_1$ and $h'_1$, with each subsequent step using the previous hidden state and output token to generate the next sequence token.

The full pipeline encodes the input sequence into a fixed-size context vector, then decodes this vector into the target translation sequence, making this framework ideal for machine translation and text summarization tasks.

Data Preprocessing Pipeline

Neural networks cannot directly process raw text strings, so the core goal of this stage is to convert text data into numerical tensors compatible with model training. The preprocessing workflow includes:

  1. Text cleaning
  2. Custom dataset class implementation
  3. Vocabulary construction with tokenization
  4. Integration of the official term dictionary

Text Cleaning

First perform data exploration to identify and remove noisy content: For English text, expand contractions (e.g., "There’s" → "There is"), remove non-printable characters, and filter to retain only alphaunmeric characters and basic punctuation. Example code:

import re
import unicodedata
import contractions

def normalize_unicode(text):
    return ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')

def clean_english_text(raw_text):
    cleaned = normalize_unicode(raw_text.strip())
    cleaned = contractions.fix(cleaned)
    # Remove parenthetical annotations
    cleaned = re.sub(r'\([^)]*\)', '', cleaned)
    # Keep allowed characters only
    cleaned = re.sub(r"[^a-zA-Z0-9.!?]+", " ", cleaned)
    return cleaned

For Chinese text, remove parenthetical filler content like "(掌声)" and filter to retain only Chinese characters, standard punctuation, and numerical values:

def clean_chinese_text(raw_text):
    # Remove all parenthetical annotations
    cleaned = re.sub(r'([^)]*)', '', raw_text)
    # Keep allowed Chinese characters, punctuation, and numbers
    cleaned = re.sub(r"[^\u4e00-\u9fa5,。!?0-9]", "", cleaned)
    return cleaned

Custom Translation Dataset Class

Implement a PyTorch Dataset class to handle loading, tokenization, and tensor conversion of the training pairs:

from torch.utils.data import Dataset
from torchtext.data.utils import get_tokenizer
from collections import Counter
import torch

class TranslationPairDataset(Dataset):
    def __init__(self, data_path, term_dict):
        self.raw_pairs = []
        # Load and parse training data
        with open(data_path, 'r', encoding='utf-8') as f:
            for line in f:
                eng_sent, ch_sent = line.strip().split('\t')
                self.raw_pairs.append((eng_sent, ch_sent))
        
        self.term_dict = term_dict
        # Initialize tokenizers: basic English tokenizer, character-level for Chinese
        self.eng_tokenizer = get_tokenizer('basic_english')
        self.ch_tokenizer = list

        # Build vocabulary, including terms from the official dictionary
        eng_vocab_counter = Counter(self.term_dict.keys())
        ch_vocab_counter = Counter()

        for eng, ch in self.raw_pairs:
            eng_vocab_counter.update(self.eng_tokenizer(eng))
            ch_vocab_counter.update(self.ch_tokenizer(ch))

        # Add special tokens and build final vocabularies
        special_tokens = ['<pad>', '<sos>', '<eos>']
        self.eng_vocab = special_tokens + list(self.term_dict.keys()) + [word for word, _ in eng_vocab_counter.most_common(10000)]
        self.ch_vocab = special_tokens + [word for word, _ in ch_vocab_counter.most_common(10000)]

        # Create word-to-index mappings
        self.eng_w2i = {word: idx for idx, word in enumerate(self.eng_vocab)}
        self.ch_w2i = {word: idx for idx, word in enumerate(self.ch_vocab)}

    def __len__(self):
        return len(self.raw_pairs)

    def __getitem__(self, idx):
        eng_sent, ch_sent = self.raw_pairs[idx]
        # Convert tokens to index tensors with SOS/EOS markers
        eng_tensor = torch.tensor([self.eng_w2i.get(token, self.eng_w2i['<sos>']) for token in self.eng_tokenizer(eng_sent)] + [self.eng_w2i['<eos>']])
        ch_tensor = torch.tensor([self.ch_w2i.get(token, self.ch_w2i['<sos>']) for token in self.ch_tokenizer(ch_sent)] + [self.ch_w2i['<eos>']])
        return eng_tensor, ch_tensor

Model Architecture Implementation

We implement a standard GRU-based seq2seq model for machine translation:

Encoder Network

import torch.nn as nn

class EnglishEncoder(nn.Module):
    def __init__(self, src_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.embedding = nn.Embedding(src_vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, src_input):
        # src_input shape: [batch_size, src_sequence_length]
        embedded = self.dropout(self.embedding(src_input))
        # embedded shape: [batch_size, src_sequence_length, emb_dim]
        outputs, final_hidden = self.gru(embedded)
        return outputs, final_hidden

Decoder Network

class ChineseDecoder(nn.Module):
    def __init__(self, tgt_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.output_dim = tgt_vocab_size
        self.embedding = nn.Embedding(tgt_vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
        self.fc_output = nn.Linear(hid_dim, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, tgt_input, hidden_state):
        # tgt_input shape: [batch_size, 1]
        # hidden_state shape: [num_layers, batch_size, hid_dim]
        embedded = self.dropout(self.embedding(tgt_input))
        # embedded shape: [batch_size, 1, emb_dim]
        outputs, new_hidden = self.gru(embedded, hidden_state)
        # outputs shape: [batch_size, 1, hid_dim]
        prediction = self.fc_output(outputs.squeeze(1))
        return prediction, new_hidden

Full Seq2Seq Model

import random

class TranslationSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src_seq, tgt_seq, teacher_forcing_ratio=0.5):
        batch_size = src_seq.shape[0]
        tgt_len = tgt_seq.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        # Initialize output tensor for all time steps
        all_outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
        # Get encoder final hidden state as decoder initial state
        _, encoder_final_hidden = self.encoder(src_seq)
        # Initial decoder input is the start-of-sequence token
        dec_input = tgt_seq[:, 0].unsqueeze(1)

        for step in range(1, tgt_len):
            output, new_hidden = self.decoder(dec_input, encoder_final_hidden)
            all_outputs[:, step, :] = output
            # Use teacher forcing randomly during training
            use_teacher_forcing = random.random() < teacher_forcing_ratio
            top_pred_token = output.argmax(1)
            dec_input = tgt_seq[:, step].unsqueeze(1) if use_teacher_forcing else top_pred_token.unsqueeze(1)
            encoder_final_hidden = new_hidden

        return all_outputs

Training Workflow

The standard training loop for this model follows these steps:

  1. Load batched data via DataLoader
  2. Forward pass through the model to get predictions
  3. Calculate cross-entropy loss between predictions and target sequences
  4. Backpropagate gradients and update model weights
  5. Clip gradients to prevent explosion

Example training loop code:

import torch.optim as optim

def train_one_epoch(model, data_loader, optimizer, loss_fn, grad_clip, device):
    model.train()
    total_loss = 0.0
    for batch_idx, (src_batch, tgt_batch) in enumerate(data_loader):
        src_batch = src_batch.to(device)
        tgt_batch = tgt_batch.to(device)
        optimizer.zero_grad()
        model_outputs = model(src_batch, tgt_batch)
        # Reshape outputs and targets for loss calculation
        output_flat = model_outputs[:, 1:].contiguous().view(-1, model_outputs.shape[-1])
        tgt_flat = tgt_batch[:, 1:].contiguous().view(-1)
        loss = loss_fn(output_flat, tgt_flat)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_loader)

Full main training script:

import torch
from torch.utils.data import DataLoader, Subset
import time

def collate_fn(batch):
    # Custom collate function to pad sequences in a batch
    src_list, tgt_list = zip(*batch)
    src_padded = torch.nn.utils.rnn.pad_sequence(src_list, batch_first=True, padding_value=0)
    tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_list, batch_first=True, padding_value=0)
    return src_padded, tgt_padded

if __name__ == "__main__":
    start_time = time.time()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load term dictionary (implementation omitted for brevity)
    term_dict = load_terminology_dictionary("./data/en-zh.dic")
    # Initialize dataset
    full_dataset = TranslationPairDataset("./data/train.txt", term_dict)
    # Use subset of data for quick testing
    sample_count = 1000
    subset_indices = list(range(sample_count))
    train_subset = Subset(full_dataset, subset_indices)
    train_loader = DataLoader(train_subset, batch_size=32, shuffle=True, collate_fn=collate_fn)

    # Initialize model components
    SRC_VOCAB_SIZE = len(full_dataset.eng_vocab)
    TGT_VOCAB_SIZE = len(full_dataset.ch_vocab)
    EMB_DIM = 256
    HID_DIM = 512
    NUM_LAYERS = 2
    DROPOUT = 0.5

    encoder = EnglishEncoder(SRC_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
    decoder = ChineseDecoder(TGT_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
    model = TranslationSeq2Seq(encoder, decoder, device).to(device)

    # Define optimizer and loss function (ignore padding token in loss)
    optimizer = optim.Adam(model.parameters())
    pad_idx = full_dataset.ch_w2i['<pad>']
    loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx)

    # Training parameters
    EPOCHS = 10
    GRAD_CLIP = 1.0

    for epoch in range(EPOCHS):
        epoch_loss = train_one_epoch(model, train_loader, optimizer, loss_fn, GRAD_CLIP, device)
        print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.3f}")

    # Save trained model
    torch.save(model.state_dict(), "./translation_model_gru.pth")
    elapsed = (time.time() - start_time) / 60
    print(f"Total training time: {elapsed:.2f} minutes")

The standard GRU-based seq2seq model has limited performance and slow training speed. A Transformer-based baseline was also developed, which reduced 10-epoch training time to ~1 hour on an RTX 4090 and achieved a BLEU-4 score of ~13.9, a significant improvement over the GRU baseline.

BLEU-4 Evaluation Metric

BLEU (Bilingual Evaluation Understudy) is a standard metric for evaluating machine translation quality, with BLEU-4 using 1- to 4-gram overlapping matches between generated and reference translations.

BLEU-4 Calculation Steps

  1. Tokenize both generated and reference translations
  2. Count n-gram overlaps for n=1 to 4
  3. Calculate clipped precision for each n-gram
  4. Compute weighted geometric mean of the precisions
  5. Apply brevity penalty to account for short generated sequences
  6. Combine the weighted mean and penalty to get the final BLEU-4 score

Example calculation: Reference: "The cat is on the mat" Generated: "The cat sat on the mat"

  1. Tokenized reference: ["The", "cat", "is", "on", "the", "mat"] Tokenized generated: ["The", "cat", "sat", "on", "the", "mat"]
  2. N-gram matches: 1-gram:5, 2-gram:4, 3-gram:2,4-gram:1
  3. Clipped precisions: 5/6 ≈0.833, 4/5=0.8, 2/4=0.5, 1/3≈0.333
  4. Weighted geometric mean: $\exp\left(\frac{1}{4}(\log0.833 + \log0.8 + \log0.5 + \log0.333)\right) ≈0.599$
  5. Brevity penalty: 1.0 (sequence lengths match)
  6. Final BLEU-4 score ≈0.599

Example evaluation code:

from sacrebleu.metrics import BLEU
from typing import List

def load_text_file(file_path: str) -> List[str]:
    with open(file_path, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f]

def translate_single_sentence(sentence: str, model: TranslationSeq2Seq, dataset: TranslationPairDataset, term_dict, device, max_len=50):
    model.eval()
    # Create reversed term mapping for quick lookup
    term_ch_to_en = {v: k for k, v in term_dict.items()}
    # Tokenize and convert to tensor
    eng_tokens = dataset.eng_tokenizer(sentence)
    src_tensor = torch.LongTensor([dataset.eng_w2i.get(t, dataset.eng_w2i['<sos>']) for t in eng_tokens]).unsqueeze(0).to(device)
    
    with torch.no_grad():
        _, enc_hidden = model.encoder(src_tensor)
    
    translated_tokens = []
    dec_input = torch.LongTensor([[dataset.ch_w2i['<sos>']]]).to(device)
    current_hidden = enc_hidden

    for _ in range(max_len):
        output, current_hidden = model.decoder(dec_input, current_hidden)
        pred_token = output.argmax(1).item()
        pred_word = dataset.ch_vocab[pred_token]
        
        if pred_word == '<eos>':
            break
        # Replace with standardized term if available
        if pred_word in term_ch_to_en:
            pred_word = term_ch_to_en[pred_word]
        translated_tokens.append(pred_word)
        dec_input = torch.LongTensor([[pred_token]]).to(device)
    
    return ''.join(translated_tokens)

def evaluate_bleu_score(model, dataset, src_file, ref_file, term_dict, device):
    model.eval()
    src_sents = load_text_file(src_file)
    ref_sents = load_text_file(ref_file)
    translated_sents = []
    
    for sent in src_sents:
        translated = translate_single_sentence(sent, model, dataset, term_dict, device)
        translated_sents.append(translated)
    
    bleu_scorer = BLEU()
    final_score = bleu_scorer.corpus_score(translated_sents, [ref_sents])
    return final_score

Inference Pipeline

After training, use the model to generate translations for the test set:

def run_inference(model, dataset, src_file, save_path, term_dict, device):
    model.eval()
    src_sents = load_text_file(src_file)
    translated_sents = []
    
    for sent in src_sents:
        translated = translate_single_sentence(sent, model, dataset, term_dict, device)
        translated_sents.append(translated)
    
    with open(save_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(translated_sents))

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    term_dict = load_terminology_dictionary("./data/en-zh.dic")
    dataset = TranslationPairDataset("./data/train.txt", term_dict)
    
    # Reinitialize model and load weights
    encoder = EnglishEncoder(len(dataset.eng_vocab), 256, 512, 2, 0.5)
    decoder = ChineseDecoder(len(dataset.ch_vocab), 256, 512, 2, 0.5)
    model = TranslationSeq2Seq(encoder, decoder, device).to(device)
    model.load_state_dict(torch.load("./translation_model_gru.pth"))
    
    run_inference(model, dataset, "./data/test_en.txt", "./data/submit.txt", term_dict, device)
    print(f"Translation complete! Results saved to ./data/submit.txt")

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.