Essential Pandas and NumPy Functions for Efficient Data Analysis
Pandas and NumPy are fundamental libraries in Python for data analysis and scientific computing. They provide powerful tools that streamline workflows and enhance productivity. This article highlights 12 key functions from these libraries that can significantly improve enalysis efficiency.
At the end of this article, readers can access a Jupyter Notebook containing all code examples discussed.
NumPy Functions
NumPy is a core package for scientific computing in Python, offering features such as:
- Powerful N-dimensional array objects
- Sophisticated broadcasting capabilities
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number functionalities
Beyond scientific applications, NumPy serves as an efficient container for multidimensional data, supporting arbitrary data types and enabling seamless integration with various databases.
1. allclose()
The allclose() function compares two arrays and returns a boolean value. It returns False if elements differ beyond a specified tolerance, making it ideal for checking array similarity, which can be challenging to do manually.
import numpy as np
first_array = np.array([0.12, 0.17, 0.24, 0.29])
second_array = np.array([0.13, 0.19, 0.26, 0.31])
# With tolerance of 0.1, returns False
np.allclose(first_array, second_array, 0.1)
# Output: False
# With tolerance of 0.2, returns True
np.allclose(first_array, second_array, 0.2)
# Output: True
2. argpartition()
This function efficiently identifies indices of the N largest values in an array. It outputs these indices, allowing for sorting as needed.
Source: Pexels
values = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
indices = np.argpartition(values, -4)[-4:]
# Output: array([1, 8, 2, 0], dtype=int64)
np.sort(values[indices])
# Output: array([10, 12, 12, 16])
3. clip()
clip() confines array values within a specified interval. Values outside the bounds are adjusted to the nearest edge.
data = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])
np.clip(data, 2, 5)
# Output: array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])
4. extract()
As the name suggests, extract() retrieves elements from an array based on a condition. It supports logical operators like and and or.
# Generate random integers
random_array = np.random.randint(20, size=12)
# Output example: array([0, 1, 8, 19, 16, 18, 10, 11, 2, 13, 14, 3])
# Condition: elements where remainder is 1 when divided by 2
condition = np.mod(random_array, 2) == 1
# Output example: array([False, True, False, True, False, False, False, True, False, True, False, True])
np.extract(condition, random_array)
# Output example: array([1, 19, 11, 13, 3])
# Direct condition application
np.extract(((random_array < 3) | (random_array > 15)), random_array)
# Output example: array([0, 1, 19, 16, 18, 2])
5. percentile()
percentile() computes the nth percentile of array elements along a specified axis.
array_a = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
print("50th Percentile of array_a, axis = 0:", np.percentile(array_a, 50, axis=0))
# Output: 50th Percentile of array_a, axis = 0: 6.0
array_b = np.array([[10, 7, 4], [3, 2, 1]])
print("30th Percentile of array_b, axis = 0:", np.percentile(array_b, 30, axis=0))
# Output: 30th Percentile of array_b, axis = 0: [5.1 3.5 1.9]
6. where()
where() returns elements from an array that satisfy a condition, similar to SQL's WHERE clause. It provides index positions or allows value replacement.
Source: Pexels
y = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
# Get indices where y > 5
np.where(y > 5)
# Output: (array([2, 3, 5, 7, 8], dtype=int64),)
# Replace values based on condition
np.where(y > 5, "Hit", "Miss")
# Output: array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'], dtype='<U4')
Pandas Functions
Pandas is a Python library designed for data manipulation and analysis, offering intuitive data structures for handling structured and time-series data. It excels with various data types, including heterogeneous tabular data, time series, and labeled matrices.
Key advantages of Pandas include:
- Handling missing data efficiently
- Flexible data alignment and grouping
- Easy conversion from other data structures
- Powerful indexing and subsetting capabilities
- Robust I/O tools for multiple file formats
- Time-series-specific functionalities
1. apply()
apply() enables applying a function to each value in a Pandas Series.
import pandas as pd
import numpy as np
# Define a lambda function for range calculation
range_fn = lambda x: x.max() - x.min()
# Apply to a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df.apply(range_fn)
2. copy()
copy() creates an independent copy of a Pandas object, preventing unintended modifications to the original data.
# Create a sample Series
countries = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])
# Issue with direct assignment
countries_copy = countries
countries_copy[0] = 'USA' # Also changes original Series
# Solution using copy()
new_copy = countries.copy()
new_copy[1] = 'Changed value'
print(new_copy)
print(countries)
3. read_csv(nrows=n)
When working with large CSV files, read_csv() with the nrows parameter allows reading only a specifeid number of rows, saving memory and time.
Source: Pexels
import io
import requests
url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
content = requests.get(url).content
# Read only first 10 rows
df = pd.read_csv(io.StringIO(content.decode('utf-8')), nrows=10, index_col=0)
4. map()
map() transforms Series values based on a mapping from a function, dictionary, or another Series.
# Create a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['b', 'd', 'e'], index=['India', 'USA', 'China', 'Russia'])
# Format floating-point values to two decimal places
format_func = lambda x: f'{x:.2f}'
df['d'].map(format_func)
5. isin()
isin() filters DataFrames by selecting rows with specific values in one or more columns.
# Using the DataFrame from read_csv example
value_filter = df["value"].isin([112])
time_filter = df["time"].isin([1949.000000])
df[value_filter & time_filter]
6. select_dtypes()
select_dtypes() returns a subset of DataFrame columns based on their data types, allowing inclusion or exclusion of specific types.
# Select only float64 columns
float_columns = df.select_dtypes(include="float64")
# Returns columns with float64 dtype