Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Essential Pandas and NumPy Functions for Efficient Data Analysis

Notes 3

Pandas and NumPy are fundamental libraries in Python for data analysis and scientific computing. They provide powerful tools that streamline workflows and enhance productivity. This article highlights 12 key functions from these libraries that can significantly improve enalysis efficiency.

At the end of this article, readers can access a Jupyter Notebook containing all code examples discussed.

NumPy Functions

NumPy is a core package for scientific computing in Python, offering features such as:

  • Powerful N-dimensional array objects
  • Sophisticated broadcasting capabilities
  • Tools for integrating C/C++ and Fortran code
  • Useful linear algebra, Fourier transform, and random number functionalities

Beyond scientific applications, NumPy serves as an efficient container for multidimensional data, supporting arbitrary data types and enabling seamless integration with various databases.

1. allclose()

The allclose() function compares two arrays and returns a boolean value. It returns False if elements differ beyond a specified tolerance, making it ideal for checking array similarity, which can be challenging to do manually.

import numpy as np

first_array = np.array([0.12, 0.17, 0.24, 0.29])
second_array = np.array([0.13, 0.19, 0.26, 0.31])

# With tolerance of 0.1, returns False
np.allclose(first_array, second_array, 0.1)
# Output: False

# With tolerance of 0.2, returns True
np.allclose(first_array, second_array, 0.2)
# Output: True

2. argpartition()

This function efficiently identifies indices of the N largest values in an array. It outputs these indices, allowing for sorting as needed.

argpartition illustration Source: Pexels

values = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
indices = np.argpartition(values, -4)[-4:]
# Output: array([1, 8, 2, 0], dtype=int64)

np.sort(values[indices])
# Output: array([10, 12, 12, 16])

3. clip()

clip() confines array values within a specified interval. Values outside the bounds are adjusted to the nearest edge.

data = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])
np.clip(data, 2, 5)
# Output: array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

4. extract()

As the name suggests, extract() retrieves elements from an array based on a condition. It supports logical operators like and and or.

# Generate random integers
random_array = np.random.randint(20, size=12)
# Output example: array([0, 1, 8, 19, 16, 18, 10, 11, 2, 13, 14, 3])

# Condition: elements where remainder is 1 when divided by 2
condition = np.mod(random_array, 2) == 1
# Output example: array([False, True, False, True, False, False, False, True, False, True, False, True])

np.extract(condition, random_array)
# Output example: array([1, 19, 11, 13, 3])

# Direct condition application
np.extract(((random_array < 3) | (random_array > 15)), random_array)
# Output example: array([0, 1, 19, 16, 18, 2])

5. percentile()

percentile() computes the nth percentile of array elements along a specified axis.

array_a = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
print("50th Percentile of array_a, axis = 0:", np.percentile(array_a, 50, axis=0))
# Output: 50th Percentile of array_a, axis = 0: 6.0

array_b = np.array([[10, 7, 4], [3, 2, 1]])
print("30th Percentile of array_b, axis = 0:", np.percentile(array_b, 30, axis=0))
# Output: 30th Percentile of array_b, axis = 0: [5.1 3.5 1.9]

6. where()

where() returns elements from an array that satisfy a condition, similar to SQL's WHERE clause. It provides index positions or allows value replacement.

where illustration Source: Pexels

y = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])

# Get indices where y > 5
np.where(y > 5)
# Output: (array([2, 3, 5, 7, 8], dtype=int64),)

# Replace values based on condition
np.where(y > 5, "Hit", "Miss")
# Output: array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'], dtype='<U4')

Pandas Functions

Pandas is a Python library designed for data manipulation and analysis, offering intuitive data structures for handling structured and time-series data. It excels with various data types, including heterogeneous tabular data, time series, and labeled matrices.

Key advantages of Pandas include:

  • Handling missing data efficiently
  • Flexible data alignment and grouping
  • Easy conversion from other data structures
  • Powerful indexing and subsetting capabilities
  • Robust I/O tools for multiple file formats
  • Time-series-specific functionalities

1. apply()

apply() enables applying a function to each value in a Pandas Series.

import pandas as pd
import numpy as np

# Define a lambda function for range calculation
range_fn = lambda x: x.max() - x.min()

# Apply to a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df.apply(range_fn)

2. copy()

copy() creates an independent copy of a Pandas object, preventing unintended modifications to the original data.

# Create a sample Series
countries = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])

# Issue with direct assignment
countries_copy = countries
countries_copy[0] = 'USA'  # Also changes original Series

# Solution using copy()
new_copy = countries.copy()
new_copy[1] = 'Changed value'
print(new_copy)
print(countries)

3. read_csv(nrows=n)

When working with large CSV files, read_csv() with the nrows parameter allows reading only a specifeid number of rows, saving memory and time.

read_csv illustration Source: Pexels

import io
import requests

url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
content = requests.get(url).content

# Read only first 10 rows
df = pd.read_csv(io.StringIO(content.decode('utf-8')), nrows=10, index_col=0)

4. map()

map() transforms Series values based on a mapping from a function, dictionary, or another Series.

# Create a DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['b', 'd', 'e'], index=['India', 'USA', 'China', 'Russia'])

# Format floating-point values to two decimal places
format_func = lambda x: f'{x:.2f}'
df['d'].map(format_func)

5. isin()

isin() filters DataFrames by selecting rows with specific values in one or more columns.

# Using the DataFrame from read_csv example
value_filter = df["value"].isin([112])
time_filter = df["time"].isin([1949.000000])
df[value_filter & time_filter]

6. select_dtypes()

select_dtypes() returns a subset of DataFrame columns based on their data types, allowing inclusion or exclusion of specific types.

# Select only float64 columns
float_columns = df.select_dtypes(include="float64")
# Returns columns with float64 dtype
Tags: pandas

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.