Pandas is a powerful data manipulation library in Python that offers intuitive data structures and functions for data analysis. The name Pandas is derived from the term “Panel Data,” which refers to multi-dimensional structured data sets commonly used in statistics and econometrics. In this article, we will explore the key features of Pandas and demonstrate how to use it effectively for data manipulation and analysis.
To install Pandas, simply run the following command in your terminal or command prompt:
pip install pandas
Once the installation is complete, you can import Pandas in your Python script as follows:
import pandas as pd
The alias pd
is widely adopted in the Python community and allows for a shorter and more convenient way to call Pandas functions.
Pandas provides two primary data structures: DataFrame and Series. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns), while a Series is a one-dimensional labeled array capable of holding any data type.
You can create a DataFrame from various data sources, such as lists, dictionaries, or external files (e.g., CSV, Excel, or SQL databases). Here are some examples:
## Creating a DataFrame from a dictionary
data = {
"col1": [1, 2, 3, 4],
"col2": ["A", "B", "C", "D"],
"col3": [1.1, 2.2, 3.3, 4.4],
}
df = pd.DataFrame(data)
## Creating a Series from a list
s = pd.Series([1, 2, 3, 4], index=["A", "B", "C", "D"])
To load data from an external CSV file, you can use the pd.read_csv()
function:
filename = "example.csv"
df = pd.read_csv(filename)
Pandas provides a variety of functions for data manipulation, such as filtering, sorting, and aggregating. Here are some examples:
To select a specific column from a DataFrame, use the column name as an index:
col1 = df["col1"]
To select multiple columns, pass a list of column names:
selected_columns = df[["col1", "col3"]]
To select rows based on their index, use the iloc[]
function:
first_row = df.iloc[0]
To filter rows based on a specific condition, use the following syntax:
filtered_df = df[df["col1"] > 2]
You can also combine multiple conditions using the &
(AND) or |
(OR) operators:
filtered_df = df[(df["col1"] > 2) & (df["col3"] < 4)]
To sort a DataFrame based on a specific column, use the sort_values()
function:
sorted_df = df.sort_values("col1", ascending=False)
Pandas provides various aggregation functions, such as sum()
, mean()
, min()
, max()
, and count()
. You can apply these functions to a specific column like this:
mean_col1 = df["col1"].mean()
To group data by a specific column and apply an aggregation function, use the groupby()
function:
grouped_df = df.groupby("col2")["col1"].sum()
In this article, we introduced Pandas, its key data structures (DataFrame and Series), and demonstrated basic data manipulation techniques. Pandas is a powerful library for data analysis and manipulation in Python, and learning to use it effectively can greatly enhance your productivity as a data scientist or analyst.