4.1 Pandas in Python
Introduction to Pandas: -
- Pandas is the fastest and mostly used library for data analysis and data manipulation.
- It is a high-level data manipulation tool developed by Wes McKinney.
- It is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.
- The major outcomes of the panda are:
Data Analysis.
Data preparation.
Data manipulation.
Data modelling.
Data analysis.
Features of Pandas
- It has fast and efficient DataFrame object with default and customized indexing.
- It has tools for loading data into in-memory data objects from different file formats.
- It is used for data alignment and integrated handling of missing data.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating-point data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and sub-setting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.
Benefits of Pandas: - The main
advantages of Pandas are:
- It has functions for analyzing, cleaning, exploring, and manipulating data.
- It allows us to analyze big data and make conclusions based on statistical theories.
- It can clean messy data sets, and make them readable and relevant.
- It helps to shorten the procedure of handling data. With the time saved, we can focus more on data analysis algorithms.
Installing Pandas: -
- click on the Start button to open the start menu.
- Type “cmd,” and the Command Prompt app should appear as a listing in the start menu.
- Enter the following command on the terminal.
Py -m pip install pandas
Introduction to Panda Object: - Pandas support two data structures:
1. Series: - one-dimensional labeled
arrays.
2. DataFrames: - two-dimensional data
structure with columns, much like a table.
1. Series: -
- Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).
- A Pandas Series object is like a column in a table.
- Series are generally created from:
a. Arrays
b. Lists
c. Dict
a) Create a Series from Arrays: -
- firstly, we have to import the numpy module and then use array( ) function in the program.
Example: -
Output: -
b) Create a Series from Lists: -
- In order to create a series from list, we have to first create a list after that we can create a series from list.
output: -
c) Create a Series from dict: -
- We can also create a Series from dict.
- All the keys in the dictionary will become the indices of the Series object, whereas all the values from the key-value pairs in the dictionary will become the values (data) of the Series object.
Example: -
output: -
2) DataFrame: -
- A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
- Data frame is an object that is useful in representing data in the form of rows and columns.
- Data frames are generally created from:
a. List
b. List of tuples
c. Dictionary
d. Excel spreadsheet files
e. .csv (common separated values) files
a) Create a DataFrame from Lists: -
- The DataFrame can be created using single list.
- To do this we need to pass a python list as a parameter to the pandas DataFrame( ) function.
- DataFrame( ) function is used to create a dataframe in Pandas.
Example: -
Output: -
b) Create a DataFrame from List of Tuples: -
- The DataFrame can be created using list of tuples.
- A tuple can be treated as a row of data.
- Suppose, if we want to store the data of 3 employees, as we have to create 3 tuples.
Example: -
output: -
c) Create a DataFrame from Dictionary: -
- It is also possible to create a python dictionary that contains employee data.
- A dictionary stores data in the form of key-value pairs.
- In this case, we take 'EMPID' and 'ENAME' as keys and corresponding lists as values.
Example: -
Output: -
d) Create a DataFrame from Excel Spreadsheet: -
- We can also read an excel file as a DataFrame.
- Let us assume excel spread sheet file named "Emp.xlsx".
- We have created an excel file which contains data of employee id number, employee name, job and salary. This file is saved with the file name "Emp" with extension "xlsx".
- To create a data frames, we need to first import the pandas package.
- We also need xlrd package. XLRD package is useful to retrieve data from Excel file.
- install xlrd package by using command prompt:
py -m pip install xlrd
py -m pip install openpyxl
- To read the data from emp.xlsx file, read_excel( ) function of pandas package will be used.
Example: -
output: -
e) Create a DataFrame from .CSV file: -
- CSV stands for "Comma Separated Values."
- In many cases, the data will be in the form of .csv files.
- It is the simplest form of storing data in tabular form as plain text.
- It is similar to Excel file but it takes less memory.
- It is important to know to work with .csv because we mostly rely on .csv data in our day-to-day lives as data scientists.
- We have created CSV file which contains the same data as created during excel file i.e. data of employee id number, employee name, job, and salary of a company. This file is saved with the file name "Emp" with extension "csv".
- To read the data from empdata.csv file, read_csv( ) function of pandas package will be used.
Example: -
output: -