Home > Tech > Content

Converting Between Unicode and UTF-8 Using the C++ Standard Library

Tech May 18 18

Converting Character Encodings with C++11 Standard Library codecvt

The C++11 standard provides the std::codecvt class for performing character set conversions. While deprecated in C++17, it remains functional in C++11 and C++14 environments.

std::wstring_convert works in conjunction with the std::codecvt template class. The former serves as a transcoder that accepts a codecvt parameter to convert between localized wide-character strings and byte-oriented strings. The later provides encoding conversion facets used as template arguments to specify which encoding to use. For instance, std::codecvt_utf8<wchar_t> handles conversions between UTF-8 and localized wide characters.

The following example demonstrates UTF-16 to UTF-8 conversion.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring sourceWide = L"Hello, World! 你好世界！";
    
    std::wstring_convert<std::codecvt_utf8<wchar_t>> transcoder;
    std::string resultUtf8 = transcoder.to_bytes(sourceWide);
    
    std::cout << "UTF-8 Output: " << resultUtf8 << std::endl;
    
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> localeTranscoder;
    std::wstring backToWide = localeTranscoder.from_bytes(resultUtf8);
    
    std::wcout.imbue(std::locale("en_US.utf-8"));
    std::wcout << L"Wide String: " << backToWide << std::endl;
    
    return 0;
}

Converting between narrow and wide strings follows a similar pattern.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring wideStr = L"测试字符串";
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> 
        conv(new std::codecvt<wchar_t, char, std::mbstate_t>(""));
    
    std::string narrow = conv.to_bytes(wideStr);
    std::wstring wide = conv.from_bytes(narrow);
    
    std::cout << narrow << std::endl;
    std::wcout.imbue(std::locale(""));
    std::wcout << wide << std::endl;
    
    return 0;
}

C++17 Encoding Conversion and Modern Alternatives

In C++17, std::codecvt carries a deprecation warning, prompting developers to consider alternative approaches. A common replacement uses std::codecvt_utf8_utf16 for UTF-8 and UTF-16 interoperation.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring wideStr = L"Unicode Text 文字";
    
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::string utf8Result = converter.to_bytes(wideStr);
    
    std::cout << "UTF-8: " << utf8Result << std::endl;
    
    std::wstring restored = converter.from_bytes(utf8Result);
    
    std::wcout.imbue(std::locale(""));
    std::wcout << L"Restored: " << restored << std::endl;
    
    return 0;
}

This approach simplifies the conversion process by eliminating the need for manual locale object creation. The conversion flows naturally from wide string to UTF-8 and back.

For GBK and other legacy encodings, the standard library does not provide direct codecvt facets. The following example illustrates UTF-8 to UTF-16 conversion, which serves as a foundation for implementing other encoding workflows.

#include <iostream>
#include <codecvt>
#include <locale>
#include <string>

int main() {
    std::string inputData = "Test Data 测试数据";
    
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> transformer;
    std::wstring intermediate = transformer.from_bytes(inputData);
    
    std::wcout << L"Intermediate Form: " << intermediate << std::endl;
    
    std::string outputData = transformer.to_bytes(intermediate);
    std::cout << "Output: " << outputData << std::endl;
    
    return 0;
}

For production applications requiring GBK support or other locale-specific encodings, third-party libraries such as libiconv or platform-specific APIs provide more comprehensive coverage than the standard library's deprecated facilities.

Tags: C++Unicode utf-8 UTF-16 Character Encoding

Back to List

Prev: Understanding Blocking I/O via TCP Socket Interaction

Next: Sorting Algorithms: Comprehensive Implementation Guide for Common Sorting Techniques

Fading Coder

Converting Between Unicode and UTF-8 Using the C++ Standard Library

Converting Character Encodings with C++11 Standard Library codecvt

C++17 Encoding Conversion and Modern Alternatives

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Converting Between Unicode and UTF-8 Using the C++ Standard Library

Converting Character Encodings with C++11 Standard Library codecvt

C++17 Encoding Conversion and Modern Alternatives

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment