Home > Notes > Content

Essential Pandas and NumPy Functions for Efficient Data Analysis

Notes Apr 20 20

Pandas and NumPy are fundamental libraries in Python for data analysis and scientific computing. They provide powerful tools that streamline workflows and enhance productivity. This article highlights 12 key functions from these libraries that can significantly improve enalysis efficiency.

At the end of this article, readers can access a Jupyter Notebook containing all code examples discussed.

NumPy Functions

NumPy is a core package for scientific computing in Python, offering features such as:

Powerful N-dimensional array objects
Sophisticated broadcasting capabilities
Tools for integrating C/C++ and Fortran code
Useful linear algebra, Fourier transform, and random number functionalities

Beyond scientific applications, NumPy serves as an efficient container for multidimensional data, supporting arbitrary data types and enabling seamless integration with various databases.

1. allclose()

The allclose() function compares two arrays and returns a boolean value. It returns False if elements differ beyond a specified tolerance, making it ideal for checking array similarity, which can be challenging to do manually.

import numpy as np

first_array = np.array([0.12, 0.17, 0.24, 0.29])
second_array = np.array([0.13, 0.19, 0.26, 0.31])

# With tolerance of 0.1, returns False
np.allclose(first_array, second_array, 0.1)
# Output: False

# With tolerance of 0.2, returns True
np.allclose(first_array, second_array, 0.2)
# Output: True

2. argpartition()

This function efficiently identifies indices of the N largest values in an array. It outputs these indices, allowing for sorting as needed.

argpartition illustration Source: Pexels

values = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
indices = np.argpartition(values, -4)[-4:]
# Output: array([1, 8, 2, 0], dtype=int64)

np.sort(values[indices])
# Output: array([10, 12, 12, 16])

3. clip()

clip() confines array values within a specified interval. Values outside the bounds are adjusted to the nearest edge.

data = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])
np.clip(data, 2, 5)
# Output: array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

4. extract()

As the name suggests, extract() retrieves elements from an array based on a condition. It supports logical operators like and and or.

# Generate random integers
random_array = np.random.randint(20, size=12)
# Output example: array([0, 1, 8, 19, 16, 18, 10, 11, 2, 13, 14, 3])

# Condition: elements where remainder is 1 when divided by 2
condition = np.mod(random_array, 2) == 1
# Output example: array([False, True, False, True, False, False, False, True, False, True, False, True])

np.extract(condition, random_array)
# Output example: array([1, 19, 11, 13, 3])

# Direct condition application
np.extract(((random_array < 3) | (random_array > 15)), random_array)
# Output example: array([0, 1, 19, 16, 18, 2])

5. percentile()

percentile() computes the nth percentile of array elements along a specified axis.

array_a = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
print("50th Percentile of array_a, axis = 0:", np.percentile(array_a, 50, axis=0))
# Output: 50th Percentile of array_a, axis = 0: 6.0

array_b = np.array([[10, 7, 4], [3, 2, 1]])
print("30th Percentile of array_b, axis = 0:", np.percentile(array_b, 30, axis=0))
# Output: 30th Percentile of array_b, axis = 0: [5.1 3.5 1.9]

6. where()

where() returns elements from an array that satisfy a condition, similar to SQL's WHERE clause. It provides index positions or allows value replacement.

where illustration Source: Pexels

y = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])

# Get indices where y > 5
np.where(y > 5)
# Output: (array([2, 3, 5, 7, 8], dtype=int64),)

# Replace values based on condition
np.where(y > 5, "Hit", "Miss")
# Output: array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'], dtype='<U4')

Pandas Functions

Pandas is a Python library designed for data manipulation and analysis, offering intuitive data structures for handling structured and time-series data. It excels with various data types, including heterogeneous tabular data, time series, and labeled matrices.

Key advantages of Pandas include:

Handling missing data efficiently
Flexible data alignment and grouping
Easy conversion from other data structures
Powerful indexing and subsetting capabilities
Robust I/O tools for multiple file formats
Time-series-specific functionalities

1. apply()

apply() enables applying a function to each value in a Pandas Series.

import pandas as pd
import numpy as np

# Define a lambda function for range calculation
range_fn = lambda x: x.max() - x.min()

# Apply to a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df.apply(range_fn)

2. copy()

copy() creates an independent copy of a Pandas object, preventing unintended modifications to the original data.

# Create a sample Series
countries = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])

# Issue with direct assignment
countries_copy = countries
countries_copy[0] = 'USA'  # Also changes original Series

# Solution using copy()
new_copy = countries.copy()
new_copy[1] = 'Changed value'
print(new_copy)
print(countries)

3. read_csv(nrows=n)

When working with large CSV files, read_csv() with the nrows parameter allows reading only a specifeid number of rows, saving memory and time.

read_csv illustration Source: Pexels

import io
import requests

url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
content = requests.get(url).content

# Read only first 10 rows
df = pd.read_csv(io.StringIO(content.decode('utf-8')), nrows=10, index_col=0)

4. map()

map() transforms Series values based on a mapping from a function, dictionary, or another Series.

# Create a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['b', 'd', 'e'], index=['India', 'USA', 'China', 'Russia'])

# Format floating-point values to two decimal places
format_func = lambda x: f'{x:.2f}'
df['d'].map(format_func)

5. isin()

isin() filters DataFrames by selecting rows with specific values in one or more columns.

# Using the DataFrame from read_csv example
value_filter = df["value"].isin([112])
time_filter = df["time"].isin([1949.000000])
df[value_filter & time_filter]

6. select_dtypes()

select_dtypes() returns a subset of DataFrame columns based on their data types, allowing inclusion or exclusion of specific types.

# Select only float64 columns
float_columns = df.select_dtypes(include="float64")
# Returns columns with float64 dtype

Tags: pandas

Back to List

Prev: Batch Vulnerability Testing Scripts for GET and POST Requests

Next: Building a TCP Chat Server with Winsock and Modern C++

Fading Coder

Essential Pandas and NumPy Functions for Efficient Data Analysis

NumPy Functions

1. allclose()

2. argpartition()

3. clip()

4. extract()

5. percentile()

6. where()

Pandas Functions

1. apply()

2. copy()

3. read_csv(nrows=n)

4. map()

5. isin()

6. select_dtypes()

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Essential Pandas and NumPy Functions for Efficient Data Analysis

NumPy Functions

1. allclose()

2. argpartition()

3. clip()

4. extract()

5. percentile()

6. where()

Pandas Functions

1. apply()

2. copy()

3. read_csv(nrows=n)

4. map()

5. isin()

6. select_dtypes()

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment