Getting Started
Series and DataFrames
Introduction
The pandas
library is the backbone of data manipulation in Python. It provides two main structures: Series and DataFrames, which are essential for handling and analyzing data. In this post, we’ll introduce these structures and explore some of the most common operations for data exploration and manipulation.
1. Introduction to Series
A Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.).
Creating a Series
You can create a Series from a list or array:
python
Copy codeimport pandas as pd
# Creating a Series from a list
= [1, 2, 3, 4, 5]
data = pd.Series(data)
series print(series)
Setting Custom Index
You can assign custom labels to the data:
python
Copy code= ['a', 'b', 'c', 'd', 'e']
index = pd.Series(data, index=index)
series print(series)
Accessing Elements
You can access elements in a Series using labels or integer positions:
python
Copy code# Access by label
print(series['a'])
# Access by position
print(series[0])
2. Introduction to DataFrames
A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Creating a DataFrame
You can create a DataFrame from dictionaries, lists, or NumPy arrays:
python
Copy code# Creating a DataFrame from a dictionary
= {
data 'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}= pd.DataFrame(data)
df print(df)
Creating a DataFrame from a List of Lists
python
Copy code= [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']]
data = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
df print(df)
3. Common Operations on Series and DataFrames
Accessing Data
By Column: You can access columns of a DataFrame as if they were Series:
python Copy codeprint(df['Name'])
By Row: Use
.loc[]
for label-based indexing or.iloc[]
for position-based indexing:python Copy codeprint(df.loc[0]) # First row by label print(df.iloc[0]) # First row by position
Filtering Data
You can filter data based on conditions:
python
Copy code# Filter by Age > 30
= df[df['Age'] > 30]
filtered_df print(filtered_df)
Sorting Data
Sort the data by a specific column:
python
Copy code# Sorting by Age
= df.sort_values(by='Age')
sorted_df print(sorted_df)
4. Grouping Data
Grouping data is useful for aggregation. The groupby()
method is a powerful tool for this.
Grouping by a Column
python
Copy code= df.groupby('City')
grouped print(grouped['Age'].mean()) # Find average age per city
Multiple Aggregations
You can apply multiple aggregation functions:
python
Copy code= df.groupby('City').agg({'Age': ['mean', 'max'], 'Name': 'count'})
grouped print(grouped)
5. Handling Missing Data
Missing data is common in real-world datasets. pandas
provides several methods to handle it.
Detecting Missing Data
python
Copy code= pd.DataFrame({
df 'Name': ['Alice', 'Bob', None],
'Age': [25, None, 35],
})print(df.isnull()) # Detect missing values
Filling Missing Data
You can fill missing values with a specific value or method:
python
Copy code'Age'] = df['Age'].fillna(df['Age'].mean()) # Fill with the mean of the column
df[print(df)
Dropping Missing Data
Alternatively, you can drop rows with missing values:
python
Copy code= df.dropna() # Drop rows with missing values
df print(df)
6. Practical Example: Exploring and Manipulating Data
Let’s work with a sample dataset to see how pandas
can be used in practice.
Problem: A dataset contains information about employees, including their name, age, department, and salary. We will explore and manipulate this data.
python
Copy codeimport pandas as pd
# Sample DataFrame
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'Department': ['HR', 'IT', 'IT', 'Sales', 'HR'],
'Salary': [50000, 60000, 70000, 80000, 90000]
}= pd.DataFrame(data)
df
# Filtering data
= df[df['Department'] == 'IT']
it_department print(it_department)
# Sorting by Salary
= df.sort_values(by='Salary', ascending=False)
sorted_by_salary print(sorted_by_salary)
# Grouping by Department and calculating average salary
= df.groupby('Department')['Salary'].mean()
avg_salary_by_dept print(avg_salary_by_dept)
Conclusion
In this post, we’ve learned the fundamentals of using pandas
to manipulate and analyze data. You now know how to create Series and DataFrames, filter, sort, group data, and handle missing values. These are essential skills for working with data in Python.