StudentAlcohol
- 결론 : 당연한 결과가 나온 것 같음.
- 집과 학교가 거리가 멀고, 친구들과 밖으로 자주 놀러 나가는 남자 아이가 술을 먹을 확률이 높다.
- 주중에 먹는애가 주말에 먹고, 주말에 먹는 아이가 주중에 먹을 확률 또한 높다. (당연한 소리)
- 결석을 자주하는 아이 또한 가능성은 있지만 높은 편은 아니다.
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:¶
- school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- sex - student's sex (binary: 'F' - female or 'M' - male)
- age - student's age (numeric: from 15 to 22)
- address - student's home address type (binary: 'U' - urban or 'R' - rural)
- famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- guardian - student's guardian (nominal: 'mother', 'father' or 'other')
- traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
- nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)
these grades are related with the course subject, Math or Portuguese:¶
- G1 - first period grade (numeric: from 0 to 20)
- G2 - second period grade (numeric: from 0 to 20)
- G3 - final grade (numeric: from 0 to 20, output target)
In [122]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
In [123]:
%matplotlib inline
In [124]:
df1 = pd.read_csv("data/student-mat.csv") # 수학
df2 = pd.read_csv("data/student-por.csv") # 포르투칼어
In [125]:
df1['class'] = 'math'
df2['class'] = 'por'
In [126]:
df = df1.append(df2)
In [127]:
df.T.iloc[:,1:5]
Out[127]:
탐색적 분석¶
In [128]:
print(df['class'].value_counts())
sns.factorplot('class',kind='count', data=df)
Out[128]:
In [129]:
print(df['sex'].value_counts())
sns.factorplot('sex',kind='count', data=df)
Out[129]:
In [130]:
print(df['age'].value_counts())
sns.factorplot('age',data=df, kind='count')
Out[130]:
In [131]:
print(df['school'].value_counts())
sns.factorplot('school',kind='count', data=df)
Out[131]:
- GridExtra 처럼 그릴 수 있는 방법이 있을텐데... 찾아봐야됨
In [132]:
sns.factorplot('famsize',kind='count', data=df)
sns.factorplot('Pstatus',kind='count', data=df)
sns.factorplot('Medu',kind='count', data=df)
sns.factorplot('Fedu',kind='count', data=df)
sns.factorplot('Mjob',kind='count', data=df)
sns.factorplot('Fjob',kind='count', data=df)
sns.factorplot('reason',kind='count', data=df)
sns.factorplot('guardian',kind='count', data=df)
sns.factorplot('traveltime',kind='count', data=df)
sns.factorplot('studytime',kind='count', data=df)
sns.factorplot('failures',kind='count', data=df)
Out[132]:
Finding Correation with Alcol¶
- 연속형 변수와 명목형으로 나누어 명목형은 Dummpy형태로 변환
- 숫자형이 아닐 경우 Correation을 구할 수가 없다.
In [133]:
columns = df.columns
discrete = []
continuous = []
for i in columns:
if df[i].dtype =='object':
discrete.append(i)
else:
continuous.append(i)
In [134]:
print(discrete)
In [135]:
print(continuous)
In [136]:
dummy = pd.get_dummies(df[discrete])
In [137]:
dummy.head(3)
Out[137]:
- 데이터 결합.
In [138]:
X = pd.concat([df[continuous], dummy], axis=1)
In [139]:
X.head()
Out[139]:
In [90]:
corr = X.corr()
In [91]:
corr
Out[91]:
알콜 정도¶
- 주중 알콜 중독정도 (Dalc)
- 주말 알콜 중독정도 (Walc)
In [36]:
pd.DataFrame({'Walc':corr['Walc'], 'Dalc':corr['Dalc']}).T
Out[36]:
In [38]:
Rel_Dalc = corr['Dalc']
Rel_Walc = corr['Walc']
- 상관계수가 0.1 이상인 변수만 추출 => 절대값을 하는 이유는 특정 변수가 높을 수록 술을 적게 먹는 변수도 찾기 위해서
In [39]:
Rel_Dalc[abs(Rel_Dalc)>0.1]
Out[39]:
In [40]:
Rel_Walc[abs(Rel_Walc)>0.1]
Out[40]:
- Index명 추출.
In [41]:
Rel_Col_Dal = Rel_Dalc[abs(Rel_Dalc)>0.1].index.tolist()
Rel_Col_Wal = Rel_Walc[abs(Rel_Walc)>0.1].index.tolist()
In [42]:
Dal_df = X[Rel_Col_Dal]
Wal_df = X[Rel_Col_Wal]
In [43]:
Dal_df.head()
Out[43]:
In [44]:
Wal_df.head()
Out[44]:
In [46]:
Dal_corr = Dal_df.corr()
Wal_corr = Wal_df.corr()
결론¶
- 남자학생의 경우가 여성의 학생보다 알콜 중독 현상이 높다.
- 당연한 얘기로 밖에 친구와 많이 놀러 가는 학생이 알콜에 노출될 확률이 높으므로 더 많은 섭취 현상을 보였다.
- 자유 시간이 많은 학생이 위와 같은 원인으로 더 많은 노출이 되었다.
- 주중 / 주말 알콜 섭취 비율은 당연히 상관관계가 제일 높았다.
- 결석을 많이 하는 학생 또한 약간의 상관 관계를 가지고 있으나 확정적이지는 않다.
In [54]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(Dal_corr,
xticklabels=Dal_corr.columns.values,
yticklabels=Dal_corr.columns.values,
annot=True, linewidths=.5, ax=ax)
Out[54]:
In [55]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(Wal_corr,
xticklabels=Wal_corr.columns.values,
yticklabels=Wal_corr.columns.values,
annot=True, linewidths=.5, ax=ax)
Out[55]:
In [57]:
sns.boxplot(x='goout',y='Walc',data=df)
Out[57]:
In [58]:
sns.boxplot(x='goout',y='Dalc',data=df)
Out[58]:
- 여성의 알콜 중독 비율이 남자학생보다 훨씬 낮다.
In [59]:
sns.factorplot('Walc',kind='count',hue='sex',data=df)
Out[59]:
Machine Learning 예측¶
In [140]:
X[['Walc','Dalc']].head()
Out[140]:
In [141]:
X['Alcohol'] = X['Walc'] + X['Dalc']
In [142]:
X['Alcohol'].head()
Out[142]:
In [143]:
ml_df = X.copy()
In [180]:
ml_df['Alcohol'].value_counts()
Out[180]:
In [144]:
from sklearn.utils import shuffle
In [145]:
ml_df = shuffle(ml_df)
In [146]:
ml_df = ml_df.reset_index()
In [147]:
del ml_df['index']
In [148]:
ml_df_columns = ml_df.columns.difference(['Walc','Dalc','Alcohol'])
In [149]:
ml_df_columns
Out[149]:
In [206]:
X = ml_df[ml_df_columns]
In [207]:
y = ml_df['Alcohol']
In [208]:
print(ml_df.columns)
ml_df.head()
Out[208]:
In [209]:
y[:5]
Out[209]:
데이터 분할¶
In [158]:
from sklearn.model_selection import train_test_split
In [210]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [160]:
from sklearn.linear_model import LinearRegression
In [161]:
lm = LinearRegression()
In [162]:
X_train.head()
Out[162]:
In [163]:
lm.fit(X_train, y_train)
Out[163]:
In [164]:
lm.intercept_ # 절편
Out[164]:
In [165]:
lm.coef_ #기울기
Out[165]:
In [166]:
from sklearn.metrics import mean_squared_error
In [167]:
mean_squared_error(y_train,lm.predict(X_train))
Out[167]:
In [168]:
mean_squared_error(y_test,lm.predict(X_test))
Out[168]:
In [172]:
compare_alcohol = pd.DataFrame({'prediction':lm.predict(X_test),'real_value':y_test})
In [175]:
compare_alcohol['prediction'] = round(compare_alcohol['prediction'])
In [177]:
compare_alcohol['diff'] = compare_alcohol['prediction'] - compare_alcohol['real_value']
In [179]:
compare_alcohol['diff'].value_counts()
Out[179]:
Classification으로 풀기¶
In [181]:
from sklearn.ensemble import GradientBoostingClassifier
In [184]:
from sklearn import metrics
In [211]:
gb = GradientBoostingClassifier(n_estimators=3000)
In [212]:
gb.fit(X_train,y_train)
Out[212]:
In [188]:
def getResult(y_test,y_pred):
print(metrics.confusion_matrix(y_test, y_pred))
print('accurracy:', metrics.accuracy_score(y_test, y_pred))
In [190]:
gb.predict(X_test)
Out[190]:
In [191]:
getResult(gb.predict(X_test),y_test)