TPSJAN22-01 EDA 필사

kaggle

TPSJAN22-01 EDA 필사

NowChan 2022. 1. 12. 19:03

daterutil : 날짜 관련

matplotlib.ticker: 그래프의 눈금 위치, 형식 지정 모듈

MaxNLocator: 좋은 위치에 눈금이 있는 최대 간격을 찾습니다.
FormatStrFormatter: 문자열로 라벨을 설정합니다.
PercentFormatter: 백분율로 라벨을 설정합니다.

sklearn.compose.TranformedTargetRegressor: 회귀에서 데이터를 변환해 분포를 고르게 한 후, 예측할 때 값을 역변환해야하는데, 그걸 수행해주는 라이브러리

DataFrame.unstack(level): level을 열로 하게끔 DataFrame을 재구성

ndarrat.squeeze(): ndarray의 필요 없는 축을 제거한다.

>>> x = np.array([[[0], [1], [2]]])
>>> x.shape
(1, 3, 1)
>>> np.squeeze(x).shape
(3,)
>>> np.squeeze(x, axis=0).shape
(3, 1)
>>> np.squeeze(x, axis=1).shape
Traceback (most recent call last):
...
ValueError: cannot select an axis to squeeze out which has size not equal to one
>>> np.squeeze(x, axis=2).shape
(1, 3)
>>> x = np.array([[1234]])
>>> x.shape
(1, 1)
>>> np.squeeze(x)
array(1234)  # 0d array
>>> np.squeeze(x).shape
()
>>> np.squeeze(x)[()]
1234

import pandas as pd
import numpy as np
import dateutil.easter as easter
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

train_df = pd.read_csv('/content/drive/MyDrive/military/tabular/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/military/tabular/test.csv')

for df in [train_df, test_df]:
  df['date'] = pd.to_datetime(df.date)
  df.set_index('date', inplace=True, drop=False)
train_df #결과1

print(train_df.groupby(['country', 'store', 'product']).date.count())
print('First day:', train_df.date.min(), '   Last day:', train_df.date.max())
print('Number of days in four years:', 365*4+1)
print(18*1461, train_df.shape, train_df.date.isnull().sum())

'''
country  store       product       
Finland  KaggleMart  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
         KaggleRama  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
Norway   KaggleMart  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
         KaggleRama  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
Sweden   KaggleMart  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
         KaggleRama  Kaggle Hat        1461
                     Kaggle Mug        1461
                     Kaggle Sticker    1461
Name: date, dtype: int64
First day: 2015-01-01 00:00:00    Last day: 2018-12-31 00:00:00
Number of days in four years: 1461
26298 (26298, 6) 0
'''

train_df.groupby(['country', 'store', 'product']).num_sold.agg(['min', 'max', 'mean'])
# 결과2

print(test_df.date.min(), test_df.date.max())
# 2019-01-01 00:00:00 2019-12-31 00:00:00

kk = train_df.groupby(['country', 'store', 'product']).num_sold.mean().unstack(level='store')
kk['KaggleRamma:KaggleMart'] = kk.KaggleRama/kk.KaggleMart
kk # 결과3

KaggleRamma 가게에서 KaggleMart보다 1.74배 더 많이 팔고 있다는 걸 알 수 있다.

kk = train_df.groupby(['country', 'store', 
		  'product', train_df.date.dt.year]).num_sold.mean().unstack(level='product')
kk['Mugs/Sticker'] = kk['Kaggle Mug'] / kk['Kaggle Sticker']
kk['Hats/Sticker'] = kk['Kaggle Hat'] / kk['Kaggle Sticker']
kk # 결과4

Mugs, Hats 상품들도 Sticker와 비교했을 때 모든 조건에서 1 : 1.97 : 3.5의 비율로 일정하게 팔리고 있다.

kk = train_df.groupby(['product',
			train_df.date.dt.month]).num_sold.mean().unstack(level='product')
kk['Mugs/Sticker'] = kk['Kaggle Mug'] / kk['Kaggle Sticker']
kk['Hats/Sticker'] = kk['Kaggle Hat'] / kk['Kaggle Sticker']
kk # 결과5

달마다는 다른 판매량을 보이고 있다. 즉, 달마다 판매에 영향을 주는 다른 요인이 있다는 뜻

나라 - 판매처를 기준으로 각 품목에 판매량의 분포를 확인해보겠습니다. 1열: Hat, 2열: Mug, 3열: Sticker

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  ax.hist(df.num_sold, bins=50, color='pink')
  ax.set_title(combi)
plt.suptitle('Histograms of num_sold', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show() # 결과6

간혹 보이는 다른 것들과는 다른 엄청난 판매량이 보이는데, 이는 명절 등이 의심된다. Hat이 다른 품목보다 더 넓은 판매량 폭을 가진 이유는 뭘까?

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  ax.plot(df.num_sold)
  ax.set_title(combi)
plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for 2015-2018', y=1.03)
plt.show() # 결과 7

Hat의 경우 1~7월이 7~12월보다 판매량이 많음을 알 수 있고, 연휴에 항상 엄청난 판매량이 보인다. 이것이 아까 판매량 분포도에서 높은 수치로 보이던 값들이다.

2015-2018년도의 1 ~ 12월들의 판매량을 평균해서 나타내보겠습니다.

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  ax.bar(range(1, 32),
         df.num_sold[df.date.dt.month==12].groupby(df.date.dt.day).mean(),
         color=['b']*25 + ['orange']*6 )
  ax.set_title(combi)
  ax.set_xticks(ticks=range(5, 31, 5))
plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for December', y=1.03)
plt.show() # 결과8

월말에 확실 판매량이 늘어남을 알 수 있습니다.

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  resampled = df.resample('MS').num_sold.sum()
  ax.bar(range(len(resampled)), resampled)
  ax.set_title(combi)
  ax.set_ylim(resampled.min(), resampled.max())
plt.suptitle('Monthly sales for 2015-2018', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show() # 결과9

DataFrame.resample은 index가 datetime일 때 사용 가능한데, MS는 매월 1일 판매량을 내놓는다. 0~47은 각 2015년 1월 1일~ 2018년 12월 1일을 나타내고, y축은 판매량을 나타낸다.

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  resampled = df.resample('MS').sum()
  resampled = resampled.groupby([resampled.index.month]).mean()
  ax.bar(range(1, 13), resampled.num_sold)
  ax.set_xticks(ticks=range(1, 13))
  ax.set_title(combi)
  ax.set_ylim(resampled.num_sold.min(), resampled.num_sold.max())
plt.suptitle('Monthly sales for 2015-2018', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show() # 결과 10

12월에는 판매량이 모두 높습니다. Hat은 전반기에 판매량이 높고, Mug는 10~1월에 판매량이 높습니다. Sticker는 5월에 살짝 판매량이 높은 것 같습니다.

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  resampled = df.resample('AS').sum()
  ax.bar(range(2015, 2019), resampled.num_sold)
  ax.set_title(combi)
  ax.xaxis.set_major_locator(MaxNLocator(integer=True)) # only integer labels
  ax.set_ylim(0, resampled.num_sold.max())
plt.suptitle('Growth of yearly sales for 2015-2018', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show() # 결과 11

판매량이 점차 늘고 있습니다. 이 결과는 실제 국가의 GDP와 관계가 있다는 토론이 있습니다.

plt.figure(figsize=(12, 90))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'product', 'store'])):
  ax = plt.subplot(18, 1, i+1, ymargin=0.5)

  resampled = df[(df.date.dt.month<12) | (df.date.dt.day<25)].resample('AS').num_sold.sum()
  resampled /= resampled.iloc[0]
  resampled_end_of_year = df[(df.date.dt.month==12) & (df.date.dt.day>=25)].resample('AS').num_sold.sum()
  resampled_end_of_year /= resampled_end_of_year.iloc[0]
  ax.bar(range(2015, 2019), resampled, color='brown')
  ax.bar(range(2015, 2019), resampled_end_of_year, color='orange', width=0.4)

  X = np.arange(2015, 2019).reshape(-1, 1)
  lr = TransformedTargetRegressor(LinearRegression(), func=np.log, inverse_func=np.exp)
  lr.fit(X, resampled)
  ax.plot(range(2015, 2019), lr.predict(X), color='brown', \
          label=f'whole year: {lr.predict([[2016]]).squeeze() -1:.1%}')
  lr.fit(X, resampled_end_of_year)
  ax.plot(range(2015, 2019), lr.predict(X), color='orange', \
          label=f'end of year: {lr.predict([[2016]]).squeeze() -1:.1%}')

  ax.legend()
  ax.set_title(f'Yearly sales for {combi}')
  ax.xaxis.set_major_locator(MaxNLocator(integer=True)) # only integer labels
plt.tight_layout(h_pad=3.0)
plt.show() # 결과12

전체 평균 판매량 상승률보다 항상 연말 평균 판매량 상승률이 높다. 즉, 연말(12월 25일 ~ 31일)은 더 높은 판매량으로 예측해야한다.

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_df.groupby(['country', 'store', 'product'])):
  ax = plt.subplot(6, 3, i+1, ymargin=0.5)
  resampled = df.groupby(df.index.dayofweek).mean()
  ax.bar(range(7), resampled.num_sold,
         color=['b']*4 + ['g'] + ['orange']*2)
  ax.set_title(combi)
  ax.set_xticks(ticks=range(7))
  ax.set_ylim(0, resampled.num_sold.max())
plt.suptitle('Sales per day of the week', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show() # 결과13

주말이 평일보다 항상 높고, 금요일은 다른 평일보다 조금 더 높다.

plt.figure(figsize=(18, 12))
for i, (year, df) in enumerate(train_df.groupby(train_df.date.dt.year)):
  df = df.reset_index(drop=True)
  ax = plt.subplot(4, 1, i+1, ymargin=0.5)
  april = df.num_sold[df.date.dt.month==4].groupby(df.date.dt.day).mean()
  date_range = pd.date_range(start=f'{year}-04-01', end=f'{year}-04-30', freq='D')
  easter_date = easter.easter(year)
  color = ['r' if d== easter_date else 'lightblue' if (d.date() - easter_date).days in range(6) \
           else 'b' if d.dayofweek < 5 else 'orange' for d in date_range]
  ax.bar(range(1, 31), april, color=color)
  ax.set_title(str(year))
  ax.set_xticks(ticks=range(5, 31, 5))
plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for April', y=1.03)
plt.show() # 결과14

부활절 이후 주간에 판매량이 급증하고 있다. 이때 판매량 예측을 더 늘려야 한다.

'kaggle' 카테고리의 다른 글

Tabular Playground Series - Jan 2022, pycaret 필사 (0)	2022.01.07
kaggle Digit Recognizer 필사2 (0)	2021.12.24
Kaggle Digit Recognizer 필사 (0)	2021.12.22

현재글TPSJAN22-01 EDA 필사

공부한 것을 기록하는 곳

SentimentIntetsityAnalyzer, scikit learn, clustering, okt, stackign ensemble, nlp, 파이썬 머신러닝 완벽 가이드, LinearRegression, Mercari Price Suggestion Challenge, konlpy, Kaggle, WordNetLemmatizer, 결정 트리, kmeans, truncated svd, NMF, DBSCAN, Silhouette Evaluation, LightGBM, 피마 인디언 당뇨병 예측,

my-github

Today :
Yesterday :

공부 기록