Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Converting Between Unicode and UTF-8 Using the C++ Standard Library

Tech May 18 2

Converting Character Encodings with C++11 Standard Library codecvt

The C++11 standard provides the std::codecvt class for performing character set conversions. While deprecated in C++17, it remains functional in C++11 and C++14 environments.

std::wstring_convert works in conjunction with the std::codecvt template class. The former serves as a transcoder that accepts a codecvt parameter to convert between localized wide-character strings and byte-oriented strings. The later provides encoding conversion facets used as template arguments to specify which encoding to use. For instance, std::codecvt_utf8<wchar_t> handles conversions between UTF-8 and localized wide characters.

The following example demonstrates UTF-16 to UTF-8 conversion.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring sourceWide = L"Hello, World! 你好世界!";
    
    std::wstring_convert<std::codecvt_utf8<wchar_t>> transcoder;
    std::string resultUtf8 = transcoder.to_bytes(sourceWide);
    
    std::cout << "UTF-8 Output: " << resultUtf8 << std::endl;
    
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> localeTranscoder;
    std::wstring backToWide = localeTranscoder.from_bytes(resultUtf8);
    
    std::wcout.imbue(std::locale("en_US.utf-8"));
    std::wcout << L"Wide String: " << backToWide << std::endl;
    
    return 0;
}

Converting between narrow and wide strings follows a similar pattern.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring wideStr = L"测试字符串";
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> 
        conv(new std::codecvt<wchar_t, char, std::mbstate_t>(""));
    
    std::string narrow = conv.to_bytes(wideStr);
    std::wstring wide = conv.from_bytes(narrow);
    
    std::cout << narrow << std::endl;
    std::wcout.imbue(std::locale(""));
    std::wcout << wide << std::endl;
    
    return 0;
}

C++17 Encoding Conversion and Modern Alternatives

In C++17, std::codecvt carries a deprecation warning, prompting developers to consider alternative approaches. A common replacement uses std::codecvt_utf8_utf16 for UTF-8 and UTF-16 interoperation.

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main() {
    std::wstring wideStr = L"Unicode Text 文字";
    
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::string utf8Result = converter.to_bytes(wideStr);
    
    std::cout << "UTF-8: " << utf8Result << std::endl;
    
    std::wstring restored = converter.from_bytes(utf8Result);
    
    std::wcout.imbue(std::locale(""));
    std::wcout << L"Restored: " << restored << std::endl;
    
    return 0;
}

This approach simplifies the conversion process by eliminating the need for manual locale object creation. The conversion flows naturally from wide string to UTF-8 and back.

For GBK and other legacy encodings, the standard library does not provide direct codecvt facets. The following example illustrates UTF-8 to UTF-16 conversion, which serves as a foundation for implementing other encoding workflows.

#include <iostream>
#include <codecvt>
#include <locale>
#include <string>

int main() {
    std::string inputData = "Test Data 测试数据";
    
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> transformer;
    std::wstring intermediate = transformer.from_bytes(inputData);
    
    std::wcout << L"Intermediate Form: " << intermediate << std::endl;
    
    std::string outputData = transformer.to_bytes(intermediate);
    std::cout << "Output: " << outputData << std::endl;
    
    return 0;
}

For production applications requiring GBK support or other locale-specific encodings, third-party libraries such as libiconv or platform-specific APIs provide more comprehensive coverage than the standard library's deprecated facilities.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.