Basics of Pandas: 10 Core Commands for Data Analysis

Pandas is a popular and widely-used Python library used for data manipulation and analysis, as it provides tools for working with structured data, like tables and time series, making it an essential tool for data preprocessing.

Whether you’re cleaning data, looking at datasets, or getting data ready for machine learning, Pandas is your go-to library. This article introduces the basics of Pandas and explores 10 essential commands for beginners.

What is Pandas?

Pandas is an open-source Python library designed for data manipulation and analysis, which is built on top of NumPy, another Python library for numerical computing.

Pandas introduces two main data structures:

  • Series: A one-dimensional labeled array capable of holding any data type (e.g., integers, strings, floats).
  • DataFrame: A two-dimensional labeled data structure, similar to a spreadsheet or SQL table, where data is organized in rows and columns.

To use Pandas, you need to install it first using the pip package manager:

pip install pandas

Once installed, import it in your Python script:

import pandas as pd

The alias pd is commonly used to make Pandas commands shorter and easier to write.

Now let’s dive into the essential commands!

1. Loading Data

Before working with data, you need to load it into a Pandas DataFrame using the read_csv() function, which is commonly used to load CSV files:

data = pd.read_csv('data.csv')
print(data.head())
  • read_csv('data.csv'): Reads the CSV file into a DataFrame.
  • head(): Displays the first five rows of the DataFrame.

This command is crucial for starting any data preprocessing task.

2. Viewing Data

To understand your dataset, you can use the following commands:

  • head(n): View the first n rows of the DataFrame.
  • tail(n): View the last n rows of the DataFrame.
  • info(): Get a summary of the DataFrame, including column names, non-null counts, and data types.
  • describe(): Get statistical summaries of numerical columns.

These commands help you quickly assess the structure and contents of your data.

print(data.info())
print(data.describe())

3. Selecting Data

To select specific rows or columns, use the following methods:

Select a single column:

column_data = data['ColumnName']

Select multiple columns:

selected_data = data[['Column1', 'Column2']]

Select rows using slicing:

rows = data[10:20]  # Rows 10 to 19

Select rows and columns using loc or iloc:

# By labels (loc)
subset = data.loc[0:5, ['Column1', 'Column2']]

# By index positions (iloc)
subset = data.iloc[0:5, 0:2]

4. Filtering Data

Filtering allows you to select rows based on conditions.

filtered_data = data[data['ColumnName'] > 50]

You can combine multiple conditions using & (AND) or | (OR):

filtered_data = data[(data['Column1'] > 50) & (data['Column2'] < 100)]

This is useful for narrowing down your dataset to relevant rows.

5. Adding or Modifying Columns

You can create new columns or modify existing ones:

Add a new column:

data['NewColumn'] = data['Column1'] + data['Column2']

Modify an existing column:

data['Column1'] = data['Column1'] * 2

These operations are essential for feature engineering and data transformation.

6. Handling Missing Data

Real-world datasets often contain missing values and Pandas provides tools to handle them:

Check for missing values:

print(data.isnull().sum())

Drop rows or columns with missing values:

data = data.dropna()
data = data.dropna(axis=1)

Fill missing values:

data['ColumnName'] = data['ColumnName'].fillna(0)
data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].mean())

Handling missing data ensures your dataset is clean and ready for analysis.

7. Sorting Data

To sort your dataset by one or more columns, use the sort_values() function:

sorted_data = data.sort_values(by='ColumnName', ascending=True)

For multiple columns:

sorted_data = data.sort_values(by=['Column1', 'Column2'], ascending=[True, False])

Sorting is helpful for organizing data and finding patterns.

8. Grouping Data

The groupby() function is used to group data and perform aggregate operations:

grouped_data = data.groupby('ColumnName')['AnotherColumn'].sum()

Common aggregation functions include:

  • sum(): Sum of values.
  • mean(): Average of values.
  • count(): Count of non-null values.

Example:

grouped_data = data.groupby('Category')['Sales'].mean()

This command is essential for summarizing data.

9. Merging and Joining DataFrames

To combine multiple DataFrames, use the following methods:

Concatenate:

combined_data = pd.concat([data1, data2], axis=0)

Merge:

merged_data = pd.merge(data1, data2, on='KeyColumn')

Join:

joined_data = data1.join(data2, how='inner')

These operations allow you to combine datasets for a comprehensive analysis.

10. Exporting Data

After processing your data, you may need to save it using the to_csv() function:

data.to_csv('processed_data.csv', index=False)

This command saves the DataFrame to a CSV file without the index column. You can also export to other formats like Excel, JSON, or SQL.

Conclusion

Pandas is an indispensable tool for data preprocessing, offering a wide range of functions to manipulate and analyze data.

The 10 commands covered in this article provide a solid foundation for beginners to start working with Pandas. As you practice and explore more, you’ll discover the full potential of this powerful library.

Similar Posts