Unlocking the Power of Pandas: A Beginner’s Guide to Data Analysis
Pandas is one of the most versatile libraries in Python for working with structured data. Whether you’re dealing with spreadsheets, databases, or any tabular data, Pandas provides powerful tools to clean, manipulate, and analyze it. It’s a must-know for anyone stepping into data science or analytics.
With Pandas, you work primarily with two types of data structures:
- Series: A one-dimensional array-like structure with labels.
- DataFrame: A two-dimensional table with labeled rows and columns, similar to an Excel sheet or SQL table.
Let’s dive into the basics and learn how to get the most out of Pandas.
Getting Started with Pandas
Here’s how you can create a Series in Pandas using different data types:
Creating a Series
From a Tuple
import pandas as pd
t = (10, 11, 12)
d = pd.Series(t)
print(d)
From a list
l1 = [45, 78, 56, 445, 78]
d = pd.Series(l1)
print(d)
From a NumPy Array
import numpy as np
arr = np.array([1, 2, 3, 4])
d1 = pd.Series(arr)
print(d1)
From a Dictionary
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
b = pd.Series(d)
print(b)
Slicing a Series
Pandas makes it easy to slice and dice your data:
d = pd.Series(np.arange(10))
print(d[2:7])
Working with DataFrames
Creating a DataFrame
From a List of Lists
data = [['abc', 1], ['def', 2], ['ghi', 3], ['hij'], ['klm', '5']]
a = pd.DataFrame(data)
print(a)
Adding Column Names and Custom Indexes
d = pd.DataFrame(data, columns=['A', 'B'], index=['A', 'B', 'C', 'D', 'E'])
print(d)
From a Dictionary
data = {'name': ['a', 'b', 'c', 'd', 'e', 'f'], 'age': [45, 12, 48, 25, 12, 56]}
d = pd.DataFrame(data)
# Renaming Columns
d.rename({'age': 'Age_in_Years', 'name': 'Candidate_Name'}, axis=1, inplace=True)
print(d)
Exploring Your Data
Here I used the famous Titanic Dataset. You can download it in kaggle Titanic Dataset
Read the Data
data = pd.read_csv('titanic.csv')
Once you have your DataFrame, here are some handy methods to understand your data:
Viewing Data
print(data.head()) # First 5 rows
print(data.tail()) # Last 5 rows
Checking Structure and Summary
print(data.columns) # Column names
print(data.info()) # Data types and non-null counts
print(data.describe()) # Summary statistics for numeric columns
print(data.describe(include='O')) # Summary for categorical columns
Filtering Data
Pandas makes filtering intuitive and efficient:
Basic Filtering
# Rows where Age > 25
df = data.loc[data['Age'] > 25]
print(df)
Complex Conditions
# Females in First Class who survived
females_first_class = data.loc[(data['Sex'] == 'female') & (data['Pclass'] == 1) & (data['Survived'] == 1)]
print(females_first_class)
Adding and Dropping Columns
Adding Columns
data['New_Column'] = np.arange(1, len(data) + 1)
Dropping Columns
data.drop(['New_Column'], axis=1, inplace=True)
Sorting and Grouping
Sorting
data.sort_values('Fare', ascending=True, ignore_index=True, inplace=True)
Grouping
grouped = data.groupby('Pclass').agg({'Fare': 'max', 'Age': 'mean'})
print(grouped)
Handling Missing Values
# Check for missing values
print(data.isnull().sum())
# Fill missing Age values with mean
data['Age'].fillna(data['Age'].mean(), inplace=True)
Changing Data Types
You can optimize memory usage by changing data types:
data['Age'] = data['Age'].astype('float64')
data['Fare'] = pd.to_numeric(data['Fare'], downcast='float')
data['Parch'] = pd.to_numeric(data['Parch'], downcast='integer')
print(data.info())
Merging and Pivot Tables
Merging DataFrames
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged = pd.merge(left, right, on='key', how='inner')
print(merged)
Pivot Table
pivot = data.pivot_table(index='Pclass', values='Fare', aggfunc='mean')
print(pivot)
Visualizing Data
Pandas integrates well with visualization libraries like Matplotlib:
import matplotlib.pyplot as plt
data['Age'].hist(bins=20)
plt.title('Age Distribution')
plt.show()
Insights from Titanic Data
- Majority of Survivors: A deeper look showed that women had higher survival rates.
- Seniority and Survival: Older passengers weren’t prioritized for survival.
Sample Tasks
Find Passengers Older Than 50
print(data.loc[data['Age'] > 50])
Filter by Multiple Conditions
# Female passengers under 38
data.loc[(data['Sex'] == 'female') & (data['Age'] < 38)]
Pandas is an essential tool for any data professional. Whether you’re cleaning messy data, conducting exploratory analysis, or preparing features for machine learning, Pandas makes it all seamless. Dive in, experiment, and unlock the full potential of your datasets!