In this Project I performed Exploratory Data Analysis(EDA) and Confirmatory Data Analysis (CDA) on a given dataset. I tried to find monthly, and yearly profit by products, category, states and the top customers whose bring highest profit. Also, I did a assumption and check by doing confirmatory data analysis. Lets, go through the process:
Importing modules:
#pandas module
import pandas as pd
#data visualization module
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
#set default settings
pd.set_option('mode.chained_assignment',None)
pyo.init_notebook_mode()
Lets load dataset:
data = pd.read_csv('dataset.csv', engine = 'python')
Checking data size:
data.shape
(9994, 9)
Cheking the data:
data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | |
---|---|---|---|---|---|---|---|---|---|
0 | 11/8/2016 | Claire Gute | Kentucky | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 | 2 | 41.9136 |
1 | 11/8/2016 | Claire Gute | Kentucky | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 | 3 | 219.5820 |
2 | 6/12/2016 | Darrin Van Huff | California | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 | 2 | 6.8714 |
3 | 10/11/2015 | Sean O'Donnell | Florida | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 | 5 | -383.0310 |
4 | 10/11/2015 | Sean O'Donnell | Florida | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 | 2 | 2.5164 |
Checking cloumn data types:
data.dtypes
Order Date object Customer Name object State object Category object Sub-Category object Product Name object Sales float64 Quantity int64 Profit float64 dtype: object
Time Information (Order Date)
Customer Information (Customer Name)
Place Information (State name)
Hierarchical Information about the products (Category, Sub-category, Product Name)
Sale Information (sales, profit, quantity)
Now, Let's start exploration:
Timespan of the data:
data['Order Date'] = pd.to_datetime(data['Order Date'])
from_ = data['Order Date'].min()
to_ = data['Order Date'].max()
print('We have the sales information from:', from_, "to", to_)
We have the sales information from: 2014-01-03 00:00:00 to 2017-12-30 00:00:00
Sort data by the date:
data = data.sort_values(by= 'Order Date')
data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | |
---|---|---|---|---|---|---|---|---|---|
7980 | 2014-01-03 | Darren Powers | Texas | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 16.448 | 2 | 5.5512 |
739 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Labels | Avery 508 | 11.784 | 3 | 4.2717 |
740 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 272.736 | 3 | -64.7748 |
741 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 3.540 | 2 | -5.4870 |
1759 | 2014-01-05 | Mick Brown | Pennsylvania | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 19.536 | 3 | 4.8840 |
Data preparation: extract year, month, and day from the Order Date column
pd.DatetimeIndex(data['Order Date']).year
Int64Index([2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, ... 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017], dtype='int64', name='Order Date', length=9994)
data['Year'] = pd.DatetimeIndex(data['Order Date']).year
data['Month'] = pd.DatetimeIndex(data['Order Date']).month
data['Day'] = pd.DatetimeIndex(data['Order Date']).day
data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | Year | Month | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7980 | 2014-01-03 | Darren Powers | Texas | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 16.448 | 2 | 5.5512 | 2014 | 1 | 3 |
739 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Labels | Avery 508 | 11.784 | 3 | 4.2717 | 2014 | 1 | 4 |
740 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 272.736 | 3 | -64.7748 | 2014 | 1 | 4 |
741 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 3.540 | 2 | -5.4870 | 2014 | 1 | 4 |
1759 | 2014-01-05 | Mick Brown | Pennsylvania | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 19.536 | 3 | 4.8840 | 2014 | 1 | 5 |
Profit gained over time by different product categories:
data_time_yearly_profit = data.groupby(['Year', 'Category']).agg({'Profit':'sum'}).reset_index()
data_time_yearly_profit.head()
Year | Category | Profit | |
---|---|---|---|
0 | 2014 | Furniture | 5457.7255 |
1 | 2014 | Office Supplies | 22593.4161 |
2 | 2014 | Technology | 21492.8325 |
3 | 2015 | Furniture | 3015.2029 |
4 | 2015 | Office Supplies | 25099.5338 |
Visualizing the results using a line chart:
Analyse the monthly profits gained from sales of different product categories:
px.line(data_time_yearly_profit, x='Year', y='Profit', color= 'Category')
data_time_monthly_profit = data.groupby(['Year', 'Month', 'Category']).agg({'Profit':'sum'}).reset_index()
data_time_monthly_profit['Date'] = data_time_monthly_profit.Year.astype(str)+ '-' + data_time_monthly_profit.Month.astype(str)+ "-01"
data_time_monthly_profit.head()
Year | Month | Category | Profit | Date | |
---|---|---|---|---|---|
0 | 2014 | 1 | Furniture | 805.4665 | 2014-1-01 |
1 | 2014 | 1 | Office Supplies | 788.9506 | 2014-1-01 |
2 | 2014 | 1 | Technology | 855.7736 | 2014-1-01 |
3 | 2014 | 2 | Furniture | 120.6917 | 2014-2-01 |
4 | 2014 | 2 | Office Supplies | 176.0910 | 2014-2-01 |
Visualizing results using a linechart:
px.line(data_time_monthly_profit, x='Date', y='Profit', color='Category')
Let's see how many unique costumers do we have:
len(data['Customer Name'].unique())
793
Let's see the yearly change in number of unique customers:
customer_data = data.groupby('Year').agg({'Customer Name':'nunique'}).reset_index()
customer_data
Year | Customer Name | |
---|---|---|
0 | 2014 | 595 |
1 | 2015 | 573 |
2 | 2016 | 638 |
3 | 2017 | 693 |
Visualizing the result:
px.line(customer_data, x='Year', y='Customer Name')
Finding top 10 customers who brought the highest profit:
top10_customers = data.groupby('Customer Name').agg({'Profit':'sum'}).reset_index().sort_values('Profit', ascending=False).head(10)
top10_customers
Customer Name | Profit | |
---|---|---|
730 | Tamara Chand | 8981.3239 |
622 | Raymond Buch | 6976.0959 |
671 | Sanjit Chand | 5757.4119 |
334 | Hunter Lopez | 5622.4292 |
6 | Adrian Barton | 5444.8055 |
757 | Tom Ashbrook | 4703.7883 |
157 | Christopher Martinez | 3899.8904 |
431 | Keith Dawkins | 3038.6254 |
35 | Andy Reiter | 2884.6208 |
194 | Daniel Raglin | 2869.0760 |
Visualizing the results using bar graph:
px.bar(top10_customers, x='Customer Name', y='Profit')
Let's analyze the profits gained in different states in the US (Based on dataset):
state_data = data.groupby('State').agg({'Profit':'sum'}).reset_index()
state_data
State | Profit | |
---|---|---|
0 | Alabama | 5786.8253 |
1 | Arizona | -3427.9246 |
2 | Arkansas | 4008.6871 |
3 | California | 76381.3871 |
4 | Colorado | -6527.8579 |
5 | Connecticut | 3511.4918 |
6 | Delaware | 9977.3748 |
7 | District of Columbia | 1059.5893 |
8 | Florida | -3399.3017 |
9 | Georgia | 16250.0433 |
10 | Idaho | 826.7231 |
11 | Illinois | -12607.8870 |
12 | Indiana | 18382.9363 |
13 | Iowa | 1183.8119 |
14 | Kansas | 836.4435 |
15 | Kentucky | 11199.6966 |
16 | Louisiana | 2196.1023 |
17 | Maine | 454.4862 |
18 | Maryland | 7031.1788 |
19 | Massachusetts | 6785.5016 |
20 | Michigan | 24463.1876 |
21 | Minnesota | 10823.1874 |
22 | Mississippi | 3172.9762 |
23 | Missouri | 6436.2105 |
24 | Montana | 1833.3285 |
25 | Nebraska | 2037.0942 |
26 | Nevada | 3316.7659 |
27 | New Hampshire | 1706.5028 |
28 | New Jersey | 9772.9138 |
29 | New Mexico | 1157.1161 |
30 | New York | 74038.5486 |
31 | North Carolina | -7490.9122 |
32 | North Dakota | 230.1497 |
33 | Ohio | -16971.3766 |
34 | Oklahoma | 4853.9560 |
35 | Oregon | -1190.4705 |
36 | Pennsylvania | -15559.9603 |
37 | Rhode Island | 7285.6293 |
38 | South Carolina | 1769.0566 |
39 | South Dakota | 394.8283 |
40 | Tennessee | -5341.6936 |
41 | Texas | -25729.3563 |
42 | Utah | 2546.5335 |
43 | Vermont | 2244.9783 |
44 | Virginia | 18597.9504 |
45 | Washington | 33402.6517 |
46 | West Virginia | 185.9216 |
47 | Wisconsin | 8401.8004 |
48 | Wyoming | 100.1960 |
In dataset fullname of states are given. But, Plotly uses abbreviated two-letter postal codes for state locations so it will be necessary to create a dictionary that contains conversions of the full names of states into abbreviations.
state_codes = {
'Alabama': 'AL',
'Alaska': 'AK',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}
Mapping the abbreviated postal codes to the State column:
state_data.State = state_data.State.map(state_codes)
state_data
State | Profit | |
---|---|---|
0 | AL | 5786.8253 |
1 | AZ | -3427.9246 |
2 | AR | 4008.6871 |
3 | CA | 76381.3871 |
4 | CO | -6527.8579 |
5 | CT | 3511.4918 |
6 | DE | 9977.3748 |
7 | DC | 1059.5893 |
8 | FL | -3399.3017 |
9 | GA | 16250.0433 |
10 | ID | 826.7231 |
11 | IL | -12607.8870 |
12 | IN | 18382.9363 |
13 | IA | 1183.8119 |
14 | KS | 836.4435 |
15 | KY | 11199.6966 |
16 | LA | 2196.1023 |
17 | ME | 454.4862 |
18 | MD | 7031.1788 |
19 | MA | 6785.5016 |
20 | MI | 24463.1876 |
21 | MN | 10823.1874 |
22 | MS | 3172.9762 |
23 | MO | 6436.2105 |
24 | MT | 1833.3285 |
25 | NE | 2037.0942 |
26 | NV | 3316.7659 |
27 | NH | 1706.5028 |
28 | NJ | 9772.9138 |
29 | NM | 1157.1161 |
30 | NY | 74038.5486 |
31 | NC | -7490.9122 |
32 | ND | 230.1497 |
33 | OH | -16971.3766 |
34 | OK | 4853.9560 |
35 | OR | -1190.4705 |
36 | PA | -15559.9603 |
37 | RI | 7285.6293 |
38 | SC | 1769.0566 |
39 | SD | 394.8283 |
40 | TN | -5341.6936 |
41 | TX | -25729.3563 |
42 | UT | 2546.5335 |
43 | VT | 2244.9783 |
44 | VA | 18597.9504 |
45 | WA | 33402.6517 |
46 | WV | 185.9216 |
47 | WI | 8401.8004 |
48 | WY | 100.1960 |
Creating a choropleth map to visualize the profit gained in different states:
px.choropleth( state_data,
locations="State",
color='Profit',
locationmode="USA-states",
scope='usa',
title='Proft gained in different states',
color_continuous_scale='Blugrn')
Choropleth map to visualize the profit gained by selling technology products in different states:
tech_data = data[data.Category == 'Technology'].groupby('State').agg({'Profit':'sum'}).reset_index()
px.choropleth( tech_data,
locations=tech_data.State.map(state_codes),
color='Profit',
locationmode="USA-states",
scope='usa',
title='Proft gained in different states selling tech product',
color_continuous_scale='Pubu')
data
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | Year | Month | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7980 | 2014-01-03 | Darren Powers | Texas | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 16.448 | 2 | 5.5512 | 2014 | 1 | 3 |
739 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Labels | Avery 508 | 11.784 | 3 | 4.2717 | 2014 | 1 | 4 |
740 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 272.736 | 3 | -64.7748 | 2014 | 1 | 4 |
741 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 3.540 | 2 | -5.4870 | 2014 | 1 | 4 |
1759 | 2014-01-05 | Mick Brown | Pennsylvania | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 19.536 | 3 | 4.8840 | 2014 | 1 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5091 | 2017-12-30 | Jill Matthias | Colorado | Office Supplies | Fasteners | Bagged Rubber Bands | 3.024 | 3 | -0.6048 | 2017 | 12 | 30 |
908 | 2017-12-30 | Patrick O'Donnell | New York | Office Supplies | Binders | Wilson Jones Legal Size Ring Binders | 52.776 | 3 | 19.7910 | 2017 | 12 | 30 |
907 | 2017-12-30 | Patrick O'Donnell | New York | Technology | Phones | Gear Head AU3700S Headset | 90.930 | 7 | 2.7279 | 2017 | 12 | 30 |
1296 | 2017-12-30 | Erica Bern | California | Office Supplies | Binders | Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl | 13.904 | 2 | 4.5188 | 2017 | 12 | 30 |
906 | 2017-12-30 | Patrick O'Donnell | New York | Furniture | Bookcases | Bush Westfield Collection Bookcases, Fully Ass... | 323.136 | 4 | 12.1176 | 2017 | 12 | 30 |
9994 rows × 12 columns
product_data = data.groupby(['Category', 'Sub-Category']).agg({'Profit':'sum'}).reset_index()
product_data = product_data[product_data.Profit >0]
product_data['Sales'] = 'Any'
product_data
Category | Sub-Category | Profit | Sales | |
---|---|---|---|---|
1 | Furniture | Chairs | 26590.1663 | Any |
2 | Furniture | Furnishings | 13059.1436 | Any |
4 | Office Supplies | Appliances | 18138.0054 | Any |
5 | Office Supplies | Art | 6527.7870 | Any |
6 | Office Supplies | Binders | 30221.7633 | Any |
7 | Office Supplies | Envelopes | 6964.1767 | Any |
8 | Office Supplies | Fasteners | 949.5182 | Any |
9 | Office Supplies | Labels | 5546.2540 | Any |
10 | Office Supplies | Paper | 34053.5693 | Any |
11 | Office Supplies | Storage | 21278.8264 | Any |
13 | Technology | Accessories | 41936.6357 | Any |
14 | Technology | Copiers | 55617.8249 | Any |
15 | Technology | Machines | 3384.7569 | Any |
16 | Technology | Phones | 44515.7306 | Any |
import plotly.express as px
fig = px.sunburst(product_data , path = ['Sales','Category','Sub-Category'] , values = 'Profit')
fig.show()
fig = px. treemap(product_data , path = ['Sales','Category','Sub-Category'] , values = 'Profit')
fig.show()
data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | Year | Month | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7980 | 2014-01-03 | Darren Powers | Texas | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 16.448 | 2 | 5.5512 | 2014 | 1 | 3 |
739 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Labels | Avery 508 | 11.784 | 3 | 4.2717 | 2014 | 1 | 4 |
740 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 272.736 | 3 | -64.7748 | 2014 | 1 | 4 |
741 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 3.540 | 2 | -5.4870 | 2014 | 1 | 4 |
1759 | 2014-01-05 | Mick Brown | Pennsylvania | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 19.536 | 3 | 4.8840 | 2014 | 1 | 5 |
Checking the statistical summary of the column:
data.Quantity.describe()
count 9994.000000 mean 3.789574 std 2.225110 min 1.000000 25% 2.000000 50% 3.000000 75% 5.000000 max 14.000000 Name: Quantity, dtype: float64
px.histogram(data, x='Quantity')
px.box(data, y= 'Quantity', x= 'Category', color='Year')
Apply distribution analysis using boxplot to the Profit column:
data.Profit.describe()
count 9994.000000 mean 28.656896 std 234.260108 min -6599.978000 25% 1.728750 50% 8.666500 75% 29.364000 max 8399.976000 Name: Profit, dtype: float64
px.box(data, y='Profit')
Confirmatory Data Analysis is the process of using statistical summary and graphical representations to evaluate the validity of an assumption about the data at hand.
Assumption 1 - Every summer technology products have the highest sale quantity compared to other product categories
seasons = {
1 : "Winter",
2 : "Spring",
3 : "Summer",
4 : "Fall"
}
Creating Season column:
data['Season'] = data.Month.astype(int) % 12 //3+1
data.Season = data.Season.map(seasons)
data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | Year | Month | Day | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7980 | 2014-01-03 | Darren Powers | Texas | Office Supplies | Paper | Message Book, Wirebound, Four 5 1/2" X 4" Form... | 16.448 | 2 | 5.5512 | 2014 | 1 | 3 | Winter |
739 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Labels | Avery 508 | 11.784 | 3 | 4.2717 | 2014 | 1 | 4 | Winter |
740 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Storage | SAFCO Boltless Steel Shelving | 272.736 | 3 | -64.7748 | 2014 | 1 | 4 | Winter |
741 | 2014-01-04 | Phillina Ober | Illinois | Office Supplies | Binders | GBC Standard Plastic Binding Systems Combs | 3.540 | 2 | -5.4870 | 2014 | 1 | 4 | Winter |
1759 | 2014-01-05 | Mick Brown | Pennsylvania | Office Supplies | Art | Avery Hi-Liter EverBold Pen Style Fluorescent ... | 19.536 | 3 | 4.8840 | 2014 | 1 | 5 | Winter |
Extracting data related to summer every year:
summer_data = data[data.Season == 'Summer']
summer_data.head()
Order Date | Customer Name | State | Category | Sub-Category | Product Name | Sales | Quantity | Profit | Year | Month | Day | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
245 | 2014-06-01 | Dianna Wilson | Minnesota | Office Supplies | Storage | Safco Steel Mobile File Cart | 166.72 | 2 | 41.6800 | 2014 | 6 | 1 | Summer |
307 | 2014-06-01 | Corey Roper | New Jersey | Office Supplies | Art | Boston Heavy-Duty Trimline Electric Pencil Sha... | 289.20 | 6 | 83.8680 | 2014 | 6 | 1 | Summer |
246 | 2014-06-01 | Dianna Wilson | Minnesota | Office Supplies | Paper | Adams Telephone Message Book w/Frequently-Call... | 47.88 | 6 | 23.9400 | 2014 | 6 | 1 | Summer |
247 | 2014-06-01 | Dianna Wilson | Minnesota | Office Supplies | Appliances | Honeywell Enviracaire Portable HEPA Air Cleane... | 1503.25 | 5 | 496.0725 | 2014 | 6 | 1 | Summer |
248 | 2014-06-01 | Dianna Wilson | Minnesota | Office Supplies | Paper | Xerox 205 | 25.92 | 4 | 12.4416 | 2014 | 6 | 1 | Summer |
Aggregating data based on Year, Category, and Season columns and summing up the Quantity:
summer_data_agg = summer_data.groupby(['Year', 'Category', 'Season']).agg({'Quantity':'sum'}).reset_index()
summer_data_agg.head()
Year | Category | Season | Quantity | |
---|---|---|---|---|
0 | 2014 | Furniture | Summer | 343 |
1 | 2014 | Office Supplies | Summer | 1031 |
2 | 2014 | Technology | Summer | 306 |
3 | 2015 | Furniture | Summer | 386 |
4 | 2015 | Office Supplies | Summer | 959 |
Visualize the result using a grouped bar chart:
px.bar(summer_data_agg, x=summer_data_agg.Year.astype('str'),
y='Quantity',
color='Category',
barmode='group')