Converting Between Unicode and UTF-8 Using the C++ Standard Library
Converting Character Encodings with C++11 Standard Library codecvt
The C++11 standard provides the std::codecvt class for performing character set conversions. While deprecated in C++17, it remains functional in C++11 and C++14 environments.
std::wstring_convert works in conjunction with the std::codecvt template class. The former serves as a transcoder that accepts a codecvt parameter to convert between localized wide-character strings and byte-oriented strings. The later provides encoding conversion facets used as template arguments to specify which encoding to use. For instance, std::codecvt_utf8<wchar_t> handles conversions between UTF-8 and localized wide characters.
The following example demonstrates UTF-16 to UTF-8 conversion.
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>
int main() {
std::wstring sourceWide = L"Hello, World! 你好世界!";
std::wstring_convert<std::codecvt_utf8<wchar_t>> transcoder;
std::string resultUtf8 = transcoder.to_bytes(sourceWide);
std::cout << "UTF-8 Output: " << resultUtf8 << std::endl;
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> localeTranscoder;
std::wstring backToWide = localeTranscoder.from_bytes(resultUtf8);
std::wcout.imbue(std::locale("en_US.utf-8"));
std::wcout << L"Wide String: " << backToWide << std::endl;
return 0;
}
Converting between narrow and wide strings follows a similar pattern.
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>
int main() {
std::wstring wideStr = L"测试字符串";
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>>
conv(new std::codecvt<wchar_t, char, std::mbstate_t>(""));
std::string narrow = conv.to_bytes(wideStr);
std::wstring wide = conv.from_bytes(narrow);
std::cout << narrow << std::endl;
std::wcout.imbue(std::locale(""));
std::wcout << wide << std::endl;
return 0;
}
C++17 Encoding Conversion and Modern Alternatives
In C++17, std::codecvt carries a deprecation warning, prompting developers to consider alternative approaches. A common replacement uses std::codecvt_utf8_utf16 for UTF-8 and UTF-16 interoperation.
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>
int main() {
std::wstring wideStr = L"Unicode Text 文字";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string utf8Result = converter.to_bytes(wideStr);
std::cout << "UTF-8: " << utf8Result << std::endl;
std::wstring restored = converter.from_bytes(utf8Result);
std::wcout.imbue(std::locale(""));
std::wcout << L"Restored: " << restored << std::endl;
return 0;
}
This approach simplifies the conversion process by eliminating the need for manual locale object creation. The conversion flows naturally from wide string to UTF-8 and back.
For GBK and other legacy encodings, the standard library does not provide direct codecvt facets. The following example illustrates UTF-8 to UTF-16 conversion, which serves as a foundation for implementing other encoding workflows.
#include <iostream>
#include <codecvt>
#include <locale>
#include <string>
int main() {
std::string inputData = "Test Data 测试数据";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> transformer;
std::wstring intermediate = transformer.from_bytes(inputData);
std::wcout << L"Intermediate Form: " << intermediate << std::endl;
std::string outputData = transformer.to_bytes(intermediate);
std::cout << "Output: " << outputData << std::endl;
return 0;
}
For production applications requiring GBK support or other locale-specific encodings, third-party libraries such as libiconv or platform-specific APIs provide more comprehensive coverage than the standard library's deprecated facilities.