Baseline Analysis for iFLYTEK Machine Translation Challenge at Datawhale AI Summer Camp
Dataset Overview
The official competition dataset includes 140,000 training sentence pairs, a test set for model evaluation, and a bilingual term dictionary for standardizing specialized vocabulary translations. Each line in the training file train.txt contains an English sentence and a corresponding Chinese sentence separated by a tab character \t, which will be parsed during data loading.
Core NLP Model Foundations
This baseline uses a sequence-to-sequence (seq2seq) framework with GRU-based encoder and decoder components.
Gated Recurrent Unit (GRU)
GRU is a modified variant of recurrent neural networks (RNNs) built to mitigate vanishing and exploding gradient issues present in standard RNN architectures. While standard RNN processes sequential input one token at a time, updating a hidden state at each step, GRU uses update and reset gates to better capture long-range semantic dependencies in sequences. For practical implementation, GRU accepts the same input and output formats as standard RNN but delivers more consistant performance on long input sequences.
Sequence-to-Sequence (Seq2Seq) Model
The seq2seq architecture consists of two core modular components: an encoder and a decoder, both implemented with GRU layers.
Encoder Module
- Accepts an input source sequence $X = [x_1, x_2, ..., x_T]$
- Feeds each token sequentially into the GRU network, generating a hidden state $h_t$ for each time step $t$. The final hidden state $h_T$ serves as the context vector $c$ for the decoding phase.
Decoder Module
- Initializes its hidden state using the context vector $c$ produced by the encoder
- Generates the target translation sequence $Y = [y_1, y_2, ..., y_{T'}]$ step-by-step: the first decoder GRU unit takes $c$ and the initial hidden state to produce $y_1$ and $h'_1$, with each subsequent step using the previous hidden state and output token to generate the next sequence token.
The full pipeline encodes the input sequence into a fixed-size context vector, then decodes this vector into the target translation sequence, making this framework ideal for machine translation and text summarization tasks.
Data Preprocessing Pipeline
Neural networks cannot directly process raw text strings, so the core goal of this stage is to convert text data into numerical tensors compatible with model training. The preprocessing workflow includes:
- Text cleaning
- Custom dataset class implementation
- Vocabulary construction with tokenization
- Integration of the official term dictionary
Text Cleaning
First perform data exploration to identify and remove noisy content: For English text, expand contractions (e.g., "There’s" → "There is"), remove non-printable characters, and filter to retain only alphaunmeric characters and basic punctuation. Example code:
import re
import unicodedata
import contractions
def normalize_unicode(text):
return ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
def clean_english_text(raw_text):
cleaned = normalize_unicode(raw_text.strip())
cleaned = contractions.fix(cleaned)
# Remove parenthetical annotations
cleaned = re.sub(r'\([^)]*\)', '', cleaned)
# Keep allowed characters only
cleaned = re.sub(r"[^a-zA-Z0-9.!?]+", " ", cleaned)
return cleaned
For Chinese text, remove parenthetical filler content like "(掌声)" and filter to retain only Chinese characters, standard punctuation, and numerical values:
def clean_chinese_text(raw_text):
# Remove all parenthetical annotations
cleaned = re.sub(r'([^)]*)', '', raw_text)
# Keep allowed Chinese characters, punctuation, and numbers
cleaned = re.sub(r"[^\u4e00-\u9fa5,。!?0-9]", "", cleaned)
return cleaned
Custom Translation Dataset Class
Implement a PyTorch Dataset class to handle loading, tokenization, and tensor conversion of the training pairs:
from torch.utils.data import Dataset
from torchtext.data.utils import get_tokenizer
from collections import Counter
import torch
class TranslationPairDataset(Dataset):
def __init__(self, data_path, term_dict):
self.raw_pairs = []
# Load and parse training data
with open(data_path, 'r', encoding='utf-8') as f:
for line in f:
eng_sent, ch_sent = line.strip().split('\t')
self.raw_pairs.append((eng_sent, ch_sent))
self.term_dict = term_dict
# Initialize tokenizers: basic English tokenizer, character-level for Chinese
self.eng_tokenizer = get_tokenizer('basic_english')
self.ch_tokenizer = list
# Build vocabulary, including terms from the official dictionary
eng_vocab_counter = Counter(self.term_dict.keys())
ch_vocab_counter = Counter()
for eng, ch in self.raw_pairs:
eng_vocab_counter.update(self.eng_tokenizer(eng))
ch_vocab_counter.update(self.ch_tokenizer(ch))
# Add special tokens and build final vocabularies
special_tokens = ['<pad>', '<sos>', '<eos>']
self.eng_vocab = special_tokens + list(self.term_dict.keys()) + [word for word, _ in eng_vocab_counter.most_common(10000)]
self.ch_vocab = special_tokens + [word for word, _ in ch_vocab_counter.most_common(10000)]
# Create word-to-index mappings
self.eng_w2i = {word: idx for idx, word in enumerate(self.eng_vocab)}
self.ch_w2i = {word: idx for idx, word in enumerate(self.ch_vocab)}
def __len__(self):
return len(self.raw_pairs)
def __getitem__(self, idx):
eng_sent, ch_sent = self.raw_pairs[idx]
# Convert tokens to index tensors with SOS/EOS markers
eng_tensor = torch.tensor([self.eng_w2i.get(token, self.eng_w2i['<sos>']) for token in self.eng_tokenizer(eng_sent)] + [self.eng_w2i['<eos>']])
ch_tensor = torch.tensor([self.ch_w2i.get(token, self.ch_w2i['<sos>']) for token in self.ch_tokenizer(ch_sent)] + [self.ch_w2i['<eos>']])
return eng_tensor, ch_tensor
Model Architecture Implementation
We implement a standard GRU-based seq2seq model for machine translation:
Encoder Network
import torch.nn as nn
class EnglishEncoder(nn.Module):
def __init__(self, src_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
super().__init__()
self.embedding = nn.Embedding(src_vocab_size, emb_dim)
self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
self.dropout = nn.Dropout(dropout_rate)
def forward(self, src_input):
# src_input shape: [batch_size, src_sequence_length]
embedded = self.dropout(self.embedding(src_input))
# embedded shape: [batch_size, src_sequence_length, emb_dim]
outputs, final_hidden = self.gru(embedded)
return outputs, final_hidden
Decoder Network
class ChineseDecoder(nn.Module):
def __init__(self, tgt_vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
super().__init__()
self.output_dim = tgt_vocab_size
self.embedding = nn.Embedding(tgt_vocab_size, emb_dim)
self.gru = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout_rate, batch_first=True)
self.fc_output = nn.Linear(hid_dim, tgt_vocab_size)
self.dropout = nn.Dropout(dropout_rate)
def forward(self, tgt_input, hidden_state):
# tgt_input shape: [batch_size, 1]
# hidden_state shape: [num_layers, batch_size, hid_dim]
embedded = self.dropout(self.embedding(tgt_input))
# embedded shape: [batch_size, 1, emb_dim]
outputs, new_hidden = self.gru(embedded, hidden_state)
# outputs shape: [batch_size, 1, hid_dim]
prediction = self.fc_output(outputs.squeeze(1))
return prediction, new_hidden
Full Seq2Seq Model
import random
class TranslationSeq2Seq(nn.Module):
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src_seq, tgt_seq, teacher_forcing_ratio=0.5):
batch_size = src_seq.shape[0]
tgt_len = tgt_seq.shape[1]
tgt_vocab_size = self.decoder.output_dim
# Initialize output tensor for all time steps
all_outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
# Get encoder final hidden state as decoder initial state
_, encoder_final_hidden = self.encoder(src_seq)
# Initial decoder input is the start-of-sequence token
dec_input = tgt_seq[:, 0].unsqueeze(1)
for step in range(1, tgt_len):
output, new_hidden = self.decoder(dec_input, encoder_final_hidden)
all_outputs[:, step, :] = output
# Use teacher forcing randomly during training
use_teacher_forcing = random.random() < teacher_forcing_ratio
top_pred_token = output.argmax(1)
dec_input = tgt_seq[:, step].unsqueeze(1) if use_teacher_forcing else top_pred_token.unsqueeze(1)
encoder_final_hidden = new_hidden
return all_outputs
Training Workflow
The standard training loop for this model follows these steps:
- Load batched data via DataLoader
- Forward pass through the model to get predictions
- Calculate cross-entropy loss between predictions and target sequences
- Backpropagate gradients and update model weights
- Clip gradients to prevent explosion
Example training loop code:
import torch.optim as optim
def train_one_epoch(model, data_loader, optimizer, loss_fn, grad_clip, device):
model.train()
total_loss = 0.0
for batch_idx, (src_batch, tgt_batch) in enumerate(data_loader):
src_batch = src_batch.to(device)
tgt_batch = tgt_batch.to(device)
optimizer.zero_grad()
model_outputs = model(src_batch, tgt_batch)
# Reshape outputs and targets for loss calculation
output_flat = model_outputs[:, 1:].contiguous().view(-1, model_outputs.shape[-1])
tgt_flat = tgt_batch[:, 1:].contiguous().view(-1)
loss = loss_fn(output_flat, tgt_flat)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
optimizer.step()
total_loss += loss.item()
return total_loss / len(data_loader)
Full main training script:
import torch
from torch.utils.data import DataLoader, Subset
import time
def collate_fn(batch):
# Custom collate function to pad sequences in a batch
src_list, tgt_list = zip(*batch)
src_padded = torch.nn.utils.rnn.pad_sequence(src_list, batch_first=True, padding_value=0)
tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_list, batch_first=True, padding_value=0)
return src_padded, tgt_padded
if __name__ == "__main__":
start_time = time.time()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load term dictionary (implementation omitted for brevity)
term_dict = load_terminology_dictionary("./data/en-zh.dic")
# Initialize dataset
full_dataset = TranslationPairDataset("./data/train.txt", term_dict)
# Use subset of data for quick testing
sample_count = 1000
subset_indices = list(range(sample_count))
train_subset = Subset(full_dataset, subset_indices)
train_loader = DataLoader(train_subset, batch_size=32, shuffle=True, collate_fn=collate_fn)
# Initialize model components
SRC_VOCAB_SIZE = len(full_dataset.eng_vocab)
TGT_VOCAB_SIZE = len(full_dataset.ch_vocab)
EMB_DIM = 256
HID_DIM = 512
NUM_LAYERS = 2
DROPOUT = 0.5
encoder = EnglishEncoder(SRC_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
decoder = ChineseDecoder(TGT_VOCAB_SIZE, EMB_DIM, HID_DIM, NUM_LAYERS, DROPOUT)
model = TranslationSeq2Seq(encoder, decoder, device).to(device)
# Define optimizer and loss function (ignore padding token in loss)
optimizer = optim.Adam(model.parameters())
pad_idx = full_dataset.ch_w2i['<pad>']
loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Training parameters
EPOCHS = 10
GRAD_CLIP = 1.0
for epoch in range(EPOCHS):
epoch_loss = train_one_epoch(model, train_loader, optimizer, loss_fn, GRAD_CLIP, device)
print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.3f}")
# Save trained model
torch.save(model.state_dict(), "./translation_model_gru.pth")
elapsed = (time.time() - start_time) / 60
print(f"Total training time: {elapsed:.2f} minutes")
The standard GRU-based seq2seq model has limited performance and slow training speed. A Transformer-based baseline was also developed, which reduced 10-epoch training time to ~1 hour on an RTX 4090 and achieved a BLEU-4 score of ~13.9, a significant improvement over the GRU baseline.
BLEU-4 Evaluation Metric
BLEU (Bilingual Evaluation Understudy) is a standard metric for evaluating machine translation quality, with BLEU-4 using 1- to 4-gram overlapping matches between generated and reference translations.
BLEU-4 Calculation Steps
- Tokenize both generated and reference translations
- Count n-gram overlaps for n=1 to 4
- Calculate clipped precision for each n-gram
- Compute weighted geometric mean of the precisions
- Apply brevity penalty to account for short generated sequences
- Combine the weighted mean and penalty to get the final BLEU-4 score
Example calculation: Reference: "The cat is on the mat" Generated: "The cat sat on the mat"
- Tokenized reference: ["The", "cat", "is", "on", "the", "mat"] Tokenized generated: ["The", "cat", "sat", "on", "the", "mat"]
- N-gram matches: 1-gram:5, 2-gram:4, 3-gram:2,4-gram:1
- Clipped precisions: 5/6 ≈0.833, 4/5=0.8, 2/4=0.5, 1/3≈0.333
- Weighted geometric mean: $\exp\left(\frac{1}{4}(\log0.833 + \log0.8 + \log0.5 + \log0.333)\right) ≈0.599$
- Brevity penalty: 1.0 (sequence lengths match)
- Final BLEU-4 score ≈0.599
Example evaluation code:
from sacrebleu.metrics import BLEU
from typing import List
def load_text_file(file_path: str) -> List[str]:
with open(file_path, 'r', encoding='utf-8') as f:
return [line.strip() for line in f]
def translate_single_sentence(sentence: str, model: TranslationSeq2Seq, dataset: TranslationPairDataset, term_dict, device, max_len=50):
model.eval()
# Create reversed term mapping for quick lookup
term_ch_to_en = {v: k for k, v in term_dict.items()}
# Tokenize and convert to tensor
eng_tokens = dataset.eng_tokenizer(sentence)
src_tensor = torch.LongTensor([dataset.eng_w2i.get(t, dataset.eng_w2i['<sos>']) for t in eng_tokens]).unsqueeze(0).to(device)
with torch.no_grad():
_, enc_hidden = model.encoder(src_tensor)
translated_tokens = []
dec_input = torch.LongTensor([[dataset.ch_w2i['<sos>']]]).to(device)
current_hidden = enc_hidden
for _ in range(max_len):
output, current_hidden = model.decoder(dec_input, current_hidden)
pred_token = output.argmax(1).item()
pred_word = dataset.ch_vocab[pred_token]
if pred_word == '<eos>':
break
# Replace with standardized term if available
if pred_word in term_ch_to_en:
pred_word = term_ch_to_en[pred_word]
translated_tokens.append(pred_word)
dec_input = torch.LongTensor([[pred_token]]).to(device)
return ''.join(translated_tokens)
def evaluate_bleu_score(model, dataset, src_file, ref_file, term_dict, device):
model.eval()
src_sents = load_text_file(src_file)
ref_sents = load_text_file(ref_file)
translated_sents = []
for sent in src_sents:
translated = translate_single_sentence(sent, model, dataset, term_dict, device)
translated_sents.append(translated)
bleu_scorer = BLEU()
final_score = bleu_scorer.corpus_score(translated_sents, [ref_sents])
return final_score
Inference Pipeline
After training, use the model to generate translations for the test set:
def run_inference(model, dataset, src_file, save_path, term_dict, device):
model.eval()
src_sents = load_text_file(src_file)
translated_sents = []
for sent in src_sents:
translated = translate_single_sentence(sent, model, dataset, term_dict, device)
translated_sents.append(translated)
with open(save_path, 'w', encoding='utf-8') as f:
f.write('\n'.join(translated_sents))
if __name__ == "__main__":
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
term_dict = load_terminology_dictionary("./data/en-zh.dic")
dataset = TranslationPairDataset("./data/train.txt", term_dict)
# Reinitialize model and load weights
encoder = EnglishEncoder(len(dataset.eng_vocab), 256, 512, 2, 0.5)
decoder = ChineseDecoder(len(dataset.ch_vocab), 256, 512, 2, 0.5)
model = TranslationSeq2Seq(encoder, decoder, device).to(device)
model.load_state_dict(torch.load("./translation_model_gru.pth"))
run_inference(model, dataset, "./data/test_en.txt", "./data/submit.txt", term_dict, device)
print(f"Translation complete! Results saved to ./data/submit.txt")