Troubleshooting Common Development and Data Collection Errors
1. SSL/TLS Connection Failures
A SSLError(SSLEOFError(...)) often indicates a protocol violation during SSL handshake. When web scraping foreign websites, this can be caused by proxy settings interfering with the connection.
Solution: Check and disable proxy environment variables. Test connectivity with:
curl -vv https://www.github.com
If the error occurs during pip install, switching to a domestic mirror source can resolve it.
2. SSH Authentication and Connection Issues
Encountering kex_exchange_identification: Connection closed by remote host during git push suggests SSH connection problems.
Solution: Use verbose SSH testing to diagnose the connection:
ssh -Tv git@github.com
Running this command often resolves underlying connection issues, allowing subsequent git push to succeed.
3. Windows DLL Import Failures
An ImportError: DLL load failed while importing _igraph on Windows indicates missing runtime libraries.
Solution: Download and install the Visual C++ Redistributable package (vc_redist.x64.exe) to provide the required runtime dependencies.
4. Hostname Resolution and Connection Errors
A ConnectionError with getaddrinfo failed for raw.githubusercontent.com while api.github.com works points to DNS resolution or network blocking issues.
Solution: Using a VPN can bypass regional network restrictions or DNS blocks that prevent access to specific GitHub subdomains.
5. Transformer Model Training Issues
5.1 Token Overflow Warning
Warning about overflowing tokens not being returned with the 'longest_first' truncation strategy.
Solution: Suppress the verbose warning using:
from transformers import logging
logging.set_verbosity_error()
5.2 Attribute Error on List
AttributeError: 'list' object has no attribute 'to' occurs when trying to call tensor methods on a Python list.
Solution: Convert the data to a PyTorch tensor before calling .to(device):
import torch
data_tensor = torch.tensor(your_list_data).to(device)
5.3 Tensor Creation Error
ValueError: Unable to create tensor due to features having excessive nesting (e.g., list where int is expected).
Solution: After tokenization, the dataset may retain original text columns. Remove these extra columns before training:
train_dataset = train_dataset.remove_columns(['text1', 'text2'])
Ensure padding and truncation are enabled in the tokenizer: padding=True, truncation=True.
5.4 Significant Accuracy Drop on Test Set
A large discrepancy where test accuracy is much lower than validation accuracy typically indicates data leakage.
Solution: Thoroughly inspect and verify that there is no overlap between the training and validation datasets. Ensure proper splitting techniques are used to prevent contamination.