Working with Various File Formats in Python
CSV Files
Writing CSV Files
import csv
characters = [
['Doctor', 'No'],
['Rosa', 'Klebb'],
['Mister', 'Big'],
['Auric', 'Goldfinger'],
['Ernst', 'Blofeld']
]
with open('characters.csv', 'w', newline='') as output_file:
writer = csv.writer(output_file)
writer.writerows(characters)
Reading CSV Files
import csv
with open('characters.csv', 'r') as input_file:
reader = csv.reader(input_file)
data = [row for row in reader]
print(data)
Using DictWriter for Named Columns
import csv
villains = [
{'first': 'Doctor', 'last': 'No'},
{'first': 'Rosa', 'last': 'Klebb'},
{'first': 'Mister', 'last': 'Big'},
{'first': 'Auric', 'last': 'Goldfinger'},
{'first': 'Ernst', 'last': 'Blofeld'}
]
with open('villains.csv', 'w', newline='') as output_file:
writer = csv.DictWriter(output_file, fieldnames=['first', 'last'])
writer.writeheader()
writer.writerows(villains)
Using DictReader for Named Columns
import csv
with open('villains.csv', 'r') as input_file:
reader = csv.DictReader(input_file, fieldnames=['first', 'last'])
villains = [row for row in reader]
print(villains)
XML Files
sample.xml:
<?xml version="1.0"?>
<menu>
<breakfast hours="7-11">
<item price="$6.00">breakfast burritos</item>
<item price="$4.00">pancakes</item>
</breakfast>
<lunch hours="11-3">
<item price="$5.00">hamburger</item>
</lunch>
<dinner hours="3-10">
<item price="$8.00">spaghetti</item>
</dinner>
</menu>
Parsing XML with ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
print(root.tag)
for child in root:
print(f'tag: {child.tag}, attributes: {child.attrib}')
for grandchild in child:
print(f'\ttag: {grandchild.tag}, attributes: {grandchild.attrib}')
Alternative XML Libraries
xml.dom: Loads the entire XML document into memory for processing.
xml.sax: Parses XML dynamically without requiring the entire document to be loaded at once.
HTML Files
Web frameworks like Flask and Django, combined with templating engines such as Jinja2, are commonly used to render HTML content for browser display.
JSON Files
Converting Python Dictionary to JSON String
menu = {
"breakfast": {
"hours": "7-11",
"items": {
"breakfast burritos": "$6.00",
"pancakes": "$4.00"
}
},
"lunch": {
"hours": "11-3",
"items": {
"hamburger": "$5.00"
}
},
"dinner": {
"hours": "3-10",
"items": {
"spaghetti": "$8.00"
}
}
}
import json
menu_json = json.dumps(menu, indent=2)
print(menu_json)
Converting JSON String to Python Dictionary
import json
json_string = '{"breakfast": {"hours": "7-11", "items": {"breakfast burritos": "$6.00", "pancakes": "$4.00"}}, "lunch": {"hours": "11-3", "items": {"hamburger": "$5.00"}}, "dinner": {"hours": "3-10", "items": {"spaghetti": "$8.00"}}}'
menu_data = json.loads(json_string)
print(menu_data)
Handling datetime Objects
import datetime
import json
now = datetime.datetime.utcnow()
json_output = json.dumps(now, default=str)
print(json_output)
YAML Files
The Python standard library does not include YAML support, so the third-party pyyaml library must be installed:
$ pip install pyyaml
sample.yml:
name:
first: James
last: McIntyre
dates:
birth: 1828-05-25
death: 1906-03-31
details:
bearded: true
themes: [cheese, Canada]
poems:
- title: 'Motto'
text: |
Politeness, perseverance and pluck,
To their possessor will bring good luck
- title: 'Canadian Charms'
text: |
Here industry is not in vain,
For we have bounteous crops of grain
Loading YAML Files
import yaml
with open('sample.yaml', 'r') as input_file:
content = input_file.read()
data = yaml.safe_load(content)
print(data)
Tablib
Tablib handles tabular data in CSV, JSON, YAML formats, aswell as Microsoft Excel and Pandas DataFrame.
$ pip install tablib
Pandas
Pandas is a powerful library for structured data manipulation, ideal for handling various data processing tasks.
Capabilities:
- Read and write multiple text and binary file formats
- Support for CSV, fixed-width text, Excel, JSON, HTML, SQL, HDF5
- Data operations: grouping, splitting, merging, indexing, slicing, sorting, selection
- Data type conversion
- Reshaping and reshaping operations
- Handling missing data
- Random value generation
- Time series management
The read_* functions return DataFrame objects, which represent two-dimensional data (rows and columns), similar to spreadsheets or database tables. One-dimensional data is represented using Series objects.
Reading CSV with Pandas
import pandas
data = pandas.read_csv('villains.csv')
print(data)
The resulting DataFrame is well-suited for nmuerical computations with NumPy and machine learning data preparation.
INI/CFG Configuration Files
The standard library configparser module handles INI file processing.
settings.cfg:
[english]
greeting = Hello
[french]
greeting = Bonjour
[files]
home = /usr/local
bin = %(home)s/bin
Reading INI Files
import configparser
config = configparser.ConfigParser()
config.read('settings.cfg')
print(config)
Microsoft Excel Files
Popular third-party libraries for Excel manipulation include xlrd, tablib, python-excel, xlwt, and xlutils.
Reading Excel with Pandas
import pandas as pd
data = pd.read_excel(r"/home/data.xlsx", sheet_name="Sheet1", sheet_index=0, header=0)
The returned DataFrame contains the Excel data.
Writing Excel with Pandas
import pandas as pd
df = pd.DataFrame(data)
df.to_excel("/home/output.xlsx", sheet_name="Sheet1", na_rep="N/A", header=True)