Efficient File I/O Techniques in Python
Reading and Writing Text Files
In Python 2, strings are byte sequences by default (ASCII), while Unicode strings require a u'' prefix. All text processing should use Unicode internally, with explicit encoding/decoding at I/O boundaries to avoid corruption.
Python 3 simplifies this: the str type is Unicode by default, and bytes represent raw data (prefixed with b). The built-in open() function supports an encoding parameter for transparent conversion:
message = '你好'
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(message)
with open('output.txt', 'r', encoding='utf-8') as f:
print(f.read())
Handling Binary Files
Binary formats like WAV audio files contain structured headers followed by raw data. To parse them:
- Open in binary mode (
'rb'or'wb') - Use
struct.unpack()to interpret fixed-size fields - Read directly into pre-allocated buffers (e.g., NumPy arrays) to efficiency
import struct
import numpy as np
def locate_chunk(file_obj, target):
file_obj.seek(12)
while True:
name = file_obj.read(4)
if len(name) < 4:
raise ValueError("Chunk not found")
size = struct.unpack('<I', file_obj.read(4))[0]
if name == target:
return file_obj.tell(), size
file_obj.seek(size, 1)
with open('audio.wav', 'rb') as src:
offset, data_size = locate_chunk(src, b'data')
buffer = np.empty(data_size // 2, dtype=np.int16)
src.readinto(buffer)
buffer //= 8 # Reduce amplitude
with open('processed.wav', 'wb') as dst:
src.seek(0)
dst.write(src.read(offset))
buffer.tofile(dst)
Controlling File Buffering
File writes are buffered to minimize system calls. Buffering strategies include:
- Full buffering: Data flushed when buffer fills (typically 4–8 KB)
- Line buffering: Flushes on newline (
\n), only in text mode - Unbuffered: Immediate writes (use
buffering=0for binary files)
Explicit buffer size control:
# Full buffering with custom size
with open('data.bin', 'wb', buffering=8192) as f:
f.write(b'...')
# Line buffering (text mode only)
with open('log.txt', 'w', buffering=1) as f:
f.write("Entry\n") # Flushed immediately
# Unbuffered binary write
with open('realtime.bin', 'wb', buffering=0) as f:
f.write(b'immediate')
Memory-Mapped Files
The mmap module maps files directly into memory, enabling:
- Random access to large files without full loading
- Shared memory between procesess
- Direct hardware register access (e.g.,
/dev/mem)
import mmap
with open('/dev/fb0', 'r+b') as fb:
screen_size = 8294400 # Example framebuffer size
with mmap.mmap(fb.fileno(), screen_size) as mm:
# Fill first half of screen with white (RGBA)
mm[:screen_size//2] = b'\xff\xff\xff\x00' * (screen_size // 8)
Retrieving File Metadata
Use os.stat() to obtain detailed file attributes:
import os
import stat
import time
info = os.stat('script.py')
print(f"Size: {info.st_size} bytes")
print(f"Modified: {time.ctime(info.st_mtime)}")
# Check file type
if stat.S_ISREG(info.st_mode):
print("Regular file")
# Check permissions
if info.st_mode & stat.S_IRUSR:
print("User-readable")
Convenience functions in os.path simplify common checks:
os.path.isfile('data.txt') # True if regular file
os.path.getsize('data.txt') # File size in bytes
os.path.getmtime('data.txt') # Modification time
Working with Temporary Files
For transient data that shouldn’t persist:
TemporaryFile(): Anonymous, auto-deleted on closeNamedTemporaryFile(): Has a filesystem path; auto-deleted unlessdelete=False
from tempfile import TemporaryFile, NamedTemporaryFile
# Anonymous temporary file
with TemporaryFile() as tf:
tf.write(b'temporary data')
tf.seek(0)
print(tf.read(5))
# Named temporary file
with NamedTemporaryFile() as ntf:
print(f"Path: {ntf.name}")
ntf.write(b'shared data')