Unlocking the Power of Pandas: A Beginner’s Guide to Data Analysis

Angel B
3 min readDec 20, 2024

--

Pandas is one of the most versatile libraries in Python for working with structured data. Whether you’re dealing with spreadsheets, databases, or any tabular data, Pandas provides powerful tools to clean, manipulate, and analyze it. It’s a must-know for anyone stepping into data science or analytics.

With Pandas, you work primarily with two types of data structures:

  1. Series: A one-dimensional array-like structure with labels.
  2. DataFrame: A two-dimensional table with labeled rows and columns, similar to an Excel sheet or SQL table.

Let’s dive into the basics and learn how to get the most out of Pandas.

Getting Started with Pandas

Here’s how you can create a Series in Pandas using different data types:

Creating a Series

From a Tuple

import pandas as pd

t = (10, 11, 12)
d = pd.Series(t)
print(d)

From a list

l1 = [45, 78, 56, 445, 78]
d = pd.Series(l1)
print(d)

From a NumPy Array

import numpy as np

arr = np.array([1, 2, 3, 4])
d1 = pd.Series(arr)
print(d1)

From a Dictionary

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
b = pd.Series(d)
print(b)

Slicing a Series

Pandas makes it easy to slice and dice your data:

d = pd.Series(np.arange(10))
print(d[2:7])

Working with DataFrames

Creating a DataFrame

From a List of Lists

data = [['abc', 1], ['def', 2], ['ghi', 3], ['hij'], ['klm', '5']]
a = pd.DataFrame(data)
print(a)

Adding Column Names and Custom Indexes

d = pd.DataFrame(data, columns=['A', 'B'], index=['A', 'B', 'C', 'D', 'E'])
print(d)

From a Dictionary

data = {'name': ['a', 'b', 'c', 'd', 'e', 'f'], 'age': [45, 12, 48, 25, 12, 56]}
d = pd.DataFrame(data)
# Renaming Columns
d.rename({'age': 'Age_in_Years', 'name': 'Candidate_Name'}, axis=1, inplace=True)
print(d)

Exploring Your Data

Here I used the famous Titanic Dataset. You can download it in kaggle Titanic Dataset

Read the Data

data = pd.read_csv('titanic.csv')

Once you have your DataFrame, here are some handy methods to understand your data:

Viewing Data

print(data.head())  # First 5 rows
print(data.tail()) # Last 5 rows

Checking Structure and Summary

print(data.columns)  # Column names
print(data.info()) # Data types and non-null counts
print(data.describe()) # Summary statistics for numeric columns
print(data.describe(include='O')) # Summary for categorical columns

Filtering Data

Pandas makes filtering intuitive and efficient:

Basic Filtering

# Rows where Age > 25
df = data.loc[data['Age'] > 25]
print(df)

Complex Conditions

# Females in First Class who survived
females_first_class = data.loc[(data['Sex'] == 'female') & (data['Pclass'] == 1) & (data['Survived'] == 1)]
print(females_first_class)

Adding and Dropping Columns

Adding Columns

data['New_Column'] = np.arange(1, len(data) + 1)

Dropping Columns

data.drop(['New_Column'], axis=1, inplace=True)

Sorting and Grouping

Sorting

data.sort_values('Fare', ascending=True, ignore_index=True, inplace=True)

Grouping

grouped = data.groupby('Pclass').agg({'Fare': 'max', 'Age': 'mean'})
print(grouped)

Handling Missing Values

# Check for missing values
print(data.isnull().sum())

# Fill missing Age values with mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

Changing Data Types

You can optimize memory usage by changing data types:

data['Age'] = data['Age'].astype('float64')
data['Fare'] = pd.to_numeric(data['Fare'], downcast='float')
data['Parch'] = pd.to_numeric(data['Parch'], downcast='integer')
print(data.info())

Merging and Pivot Tables

Merging DataFrames

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged = pd.merge(left, right, on='key', how='inner')
print(merged)

Pivot Table

pivot = data.pivot_table(index='Pclass', values='Fare', aggfunc='mean')
print(pivot)

Visualizing Data

Pandas integrates well with visualization libraries like Matplotlib:

import matplotlib.pyplot as plt

data['Age'].hist(bins=20)
plt.title('Age Distribution')
plt.show()

Insights from Titanic Data

  1. Majority of Survivors: A deeper look showed that women had higher survival rates.
  2. Seniority and Survival: Older passengers weren’t prioritized for survival.

Sample Tasks

Find Passengers Older Than 50

print(data.loc[data['Age'] > 50])

Filter by Multiple Conditions

# Female passengers under 38
data.loc[(data['Sex'] == 'female') & (data['Age'] < 38)]

Pandas is an essential tool for any data professional. Whether you’re cleaning messy data, conducting exploratory analysis, or preparing features for machine learning, Pandas makes it all seamless. Dive in, experiment, and unlock the full potential of your datasets!

--

--

Angel B
Angel B

Written by Angel B

Aspiring Data Scientist | AI & ML Enthusiast | Computer Science Graduate | Passionate About Solving Real-World Problems | Exploring Life, Learning, and Growth.

No responses yet