Mastering Pandas DataFrame Operations
Iterating Over DataFrames
Processing data row by row or column by column is a common task, though vectorization is preferred for performance. When iteration is necessary, Pandas offers several methods.
Row Iteration
The iterrows() method yields each row index as a key and the row data as a Series. This allows access to values via column labels.
for identifier, row_data in dataset.iterrows():
print(row_data['id'], row_data['value'])
For significantly better performance, itertuples() converts rows into named tuples. This approach is faster and attribute access is cleaner.
for row in dataset.itertuples(index=True, name='Row'):
print(getattr(row, 'id'), getattr(row, 'value'))
Column Iteration
To iterate through columns, use items() (or iteritems() in older versions). This returns the column label and the content as a Series.
for label, content in dataset.items():
print(f"Column: {label}")
print(content.values)
Data Filtering Techniques
Subset Selection with filter()
The filter() method is versatile for selecting specific rows or columns based on names, patterns, or regular expressions.
- items: Specific list of labels to keep.
- like: Keeps labels containing a specific substring.
- regex: Keeps labels matching a regular expression pattern.
- axis: The target axis (0 for 'index', 1 for 'columns').
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.array([[10, 20, 30], [40, 50, 60]]),
index=['row_x', 'row_y'],
columns=['col_a', 'col_b', 'col_c']
)
# Select specific columns
selected = df.filter(items=['col_a', 'col_c'])
# Select columns via regex (e.g., ending with 'c')
regex_cols = df.filter(regex='c$', axis=1)
# Select rows containing 'y'
filtered_rows = df.filter(like='y', axis=0)
Boolean Indexing
Direct filtering can be achieved by passing a boolean series derived from conditional logic.
# Filter rows where column value is greater than 25
filtered_df = original_df[original_df['quantity'] > 25]
Essential DataFrame Attributes
Understanding the fundamental attributes of a DataFrame is crucial for debugging and data inspection.
| Attribute | Description |
|---|---|
| T | Transposes rows and columns. |
| axes | Returns a list of row and column axis labels. |
| dtypes | Returns the data type of each column. |
| empty | Returns True if the DataFrame is entirely empty. |
| ndim | The number of axes (dimensionality). |
| shape | A tuple representing the dimensionality (rows, columns). |
| size | Total number of elements in the DataFrame. |
| values | Returns the DataFrame data as a NumPy array. |
| head(n) | Returns the first n rows. |
| tail(n) | Returns the last n rows. |
| shift(periods) | Shifts the index by the desired number of periods. |
Converting DataFrames to Dictionaries
The to_dict() function transforms DataFrame data into various Python dictionary formats. This is useful for serialization or API responses.
The structure of the output is controlled by the orient parameter.
- 'dict' (default):
{column_label: {index_label: value}} - 'list':
{column_label: [values]} - 'series':
{column_label: Series(values)} - 'split':
{'index': [indices], 'columns': [cols], 'data': [values]} - 'records':
[{column_label: value}, ...] - 'index':
{index_label: {column_label: value}}
Mapping External Values
A common requirement is enriching a DataFrame by mapping values in a column to values from an external dictionary. The map() function is ideal for this operation.
import pandas as pd
status_lookup = {'open': 1, 'closed': 0, 'pending': 2}
df = pd.DataFrame({'order_status': ['open', 'pending', 'closed', 'open'], 'amount': [100, 200, 150, 50]})
# Create a new column by mapping status strings to integers
df['status_code'] = df['order_status'].map(status_lookup)