Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding Character Encoding in Computing Systems

Tech May 13 2

Software operates on hardware; the three core components relevant to execution are the CPU, memory, and storage drive. Before execution, program instructions and data reside on the storage drive. Launching an application copies required data from storage into memory, where the CPU fetches and processes it. Any runtime-generated data first appears in memory; persisting it requires writing back to storage.

When a text viewer opens a file, it loads content from storage into memory and renders it on screen. This process ignores language-specific syntax; interpreters, unlike viewers, parse and execute code according to their language rules.

Foundations of Character Encoding

Encoding mechanisms apply solely to textual data, not binary media such as images, audio, or video. Machines interpret binary digits; character encoding tables map glyphs to numeric pattrens so humans can work with readable symbols.

Character encoding defines a conversion scheme translating characters into storable or transmittable forms. For instance, one legacy scheme uses seven bits of a byte for mapping, assigning capital A the value 65 (hex 0x41, binary 01000001). Widely used schemes include UTF-8, GBK, and Big5.

Evolution of Encoding Schemes

Initial Standard: ASCII

In the 1960s, American computing adopted ASCII, defining 128 mappings between English glyphs and binary codes. Each entry occupied the lower seven bits of a byte, with the most significant bit fixed at zero. Values 0–31 represented control functions; 32–126 covered printable characters available via keyboard.

Regional Extensions

While sufficient for English, ASCII could not cover other alphabets. European implementers repurposed the unused high bit of a byte, expanding capacity to 256 symbols—known as Extended ASCII (EASCII). However, linguistic diversity across small nations led to incompatible assignments for the upper 128 values. This resulted in multiple ISO-8859 variants (e.g., ISO-8859-1 through ISO-8859-16), each tailored to specific regions, while retaining ASCII compatibility in the lower half.

Universal Set: Unicode

To encompass global scripts, Unicode was created as a comprehensive catalog of over one million distinct code points. Examples: U+0639 for Arabic ‘Ain’, U+0041 for Latin ‘A’, U+4E25 for Chinese ‘严’. Unicode specifies identifiers, not storage formats, leaving ambiguity: a sequence of bytes might represent one symbol or several, and fixed multi-byte representations waste space for predominantly ASCII text.

Unicode Transformation Formats (UTF) address these issues by defining concrete encodings. Memory typically holds text as Unicode code points; persistence can use any compatible mapping:

  • English glyphs → Unicode scalar → ASCII pattern
  • Chinese and English glyphs → Unicode scalar → GBK pattern
  • Japanese and English glyphs → Unicode scalar → Shift JIS pattern

Emergence of UTF-8

UTF-8, designed by Ken Thompson in 1992 and standardized in RFC 3629, is a variable-length encoding for Unicode, widely used on the web. It represents any code point using one to four bytes, supporting millions of characters including Latin, CJK, and other scripts.

Unlike UTF-16 or UTF-32, UTF-8 dominates internet usage. Unicode itself is merely a set of abstract numbers; UTF-8 provides a compact, backward-compatible serialization.

Key property: ASCII bytes remain unchanged in UTF-8, ensuring efficient handling of predominantly English text. Multi-byte sequences signal non-ASCII characters through leading bit patterns.

Practical Usage

Correct display requires matching the encoding used for writing when reading. Mismatches produce garbled output. In Python 2, source files default to ASCII unless declared otherwise (# coding:utf-8). To embed non-ASCII literals, prefix with u (e.g., s = u'你好').

Encoding and Decoding Processes

Conversion from a character to its Unicode representation in memory is encoding; conversion from Unicode to another scheme is also encoding. Reversal—from internal numeric form to human-readable glyph—is decoding.

Example in Python:

text_chunk = '趁年轻'
encoded_bytes = text_chunk.encode('utf-8')   # encode to UTF-8 byte sequence
print(encoded_bytes)                        # b'\xe8\xb6\x81\xe5\xb9\xb4\xe8\xbd\xbb'

decoded_text = encoded_bytes.decode('utf-8') # decode back to string
print(decoded_text)                         # 趁年轻

another = 'hello world'
print(another.encode('utf-8'))              # b'hello world'

When decoding unknown byte streams, attempt common encodings such as UTF-8 or GBK.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.