이번 글은 SageMaker를 활용하여 기본적인 데이터 시각화 분석과 결과 평가 방법부터, 캐글의 Bike Rental 데이터를 통해 데이터 전처리, 트레이닝, 모델 생성, 배포 일련의 작업을 다룬다. 기본 sklearn을 사용해보고 SageMaker의 high level interface의 estimators 사용 방법과 비교
개인적으로 학습한 내용이라 두서 없음...
1. Introduce
1-1. np, pd, plt(plt.hist/블록, plt.scatter/분산, plt.plot/선형)
1-2. 데이터 전처리
vehicle = df['Vehicles'].fillna(0)
plt.plot(vehicles, ls='-.', alpha=0.8, label='mean')
fillna(0), fillna(df['Car'].mean()) / df['Car'].interpolate() / fillna(method='ffill') / fillna(method='bfill')
1-3. 비선형 함수
plt.plot(df['exponential']), plt.plot(df['log']), plt.plot(np.arange(-20,20,.1), df['sine'] 등
2. PerformanceEvaluation
2-1. linear Regression
plots, RMSE, Residual Histograms(Over/Under)
2-2. binary_classfier_performance
Confusion metric
classification_report(df['Pass'], df[model.replace(' ','') + '_Prediction'], labels=[1,0], target_names=['Pass','Fail']))
2-3. binary_classifier_rawscore_evaluation(ROC & AUC)
roc_score = roc_auc_score(df['Pass'], df[model.replace(' ','') + '_Prediction'])
참고
2-4. multicall_classifier_performance
Accuracy(전체 적중률 TP+TN/ALL), Percision(예상에서 적중률 TP/TP+FP), Recall(실제에서 적중률 TP/TP+FN),
F1-score(per/rec 값을 이용한 스코어), Support(normalize)
Micro avg: ? / Macro avg: 평균 / Weighted avg: 가중 평균
3. SageMaker Overview
3-1. DataFormat of SageMaker
Training Data Format
CSV, RecordIO, Algorithm specific formats(LibSVM, JSON, Pqrquet) are stored in S3
Inference Format
CSV, JSON, RecordIO
import sagemaker.amazon.common as smac #RecordIO
import sagemaker.amazon.common as smac
def write_recordio_file (filename, x, y=None):
with open(filename, 'wb') as f:
smac.write_numpy_to_dense_tensor(f, x, y)
def read_recordio_file (filename, recordsToPrint = 10):
with open(filename, 'rb') as f:
record = smac.read_records(f)
for i, r in enumerate(record):
if i >= recordsToPrint:
break
print ("record: {}".format(i))
print(r)
3-2. Mode
File, Pipe Mode
4. XGBoosts(Gradient Boosted Tress)
XGBoosts: 트리 형식의 가중치 알고리즘(Support Regression, Classification)
4-1. PerformanceEvaluation(Quadratic Regression, XGBoost VS Linear Regression)
- quadratic_xgboost_localmode.ipynb
- XGBoost
X_train = df_train.iloc[:,1:] # Features: 1st column onwards
y_train = df_train.iloc[:,0].ravel() # Target: 0th column
X_validation = df_validation.iloc[:,1:]
y_validation = df_validation.iloc[:,0].ravel()
# max_depth = 5,objective="reg:linear",num_round = 50
# XGBoost Training Parameter Reference:
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
regressor = xgb.XGBRegressor()
regressor
regressor.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_validation, y_validation)])
[98] validation_0-rmse:1.49427 validation_1-rmse:3.71814
[99] validation_0-rmse:1.49156 validation_1-rmse:3.71637
# Get the Training RMSE and Evaluation RMSE
eval_result = regressor.evals_result()
eval_result
eval_result = regressor.evals_result()
training_rounds = range(len(eval_result['validation_0']['rmse']))
print(training_rounds)
plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')
plt.grid(True)
plt.xlabel('Iteration')
plt.ylabel('RMSE')
plt.title('Training Vs Validation Error')
plt.legend()
xgb.plot_importance(regressor)
plt.show()
참고
- Validation Dateset Compare Actual and Predicted(quadratic 파일 Import가 빠져있음... 다시해야됨)
XGBoost!
#def quad_func (x):
# return 5*x**2 -23*x + 47
df = pd.read_csv('quadratic_all.csv')
plt.scatter(df.x,df.y,label='Target')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Simple Regression Dataset')
plt.show()
plt.title('XGBoost - Validation Dataset')
plt.scatter(df_validation.x,df_validation.y,label='actual',marker='.')
plt.scatter(df_validation.x,result,label='predicted',marker='.')
plt.grid(True)
plt.legend()
plt.show()
# RMSE Metrics
print('XGBoost Algorithm Metrics')
mse = mean_squared_error(df_validation.y,result)
print(" Mean Squared Error: {0:.2f}".format(mse))
print(" Root Mean Square Error: {0:.2f}".format(mse**.5))
# Residual
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = df_validation.y - result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('XGBoost Residual')
plt.axvline(color='r')
plt.show()
# Count number of values greater than zero and less than zero
value_counts = (residuals > 0).value_counts(sort=False)
print(' Under Estimation: {0}'.format(value_counts[True]))
print(' Over Estimation: {0}'.format(value_counts[False]))
# Plot for entire dataset
plt.scatter(df.x,df.y,label='Target')
plt.scatter(df.x,regressor.predict(df[['x']]) ,label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('XGBoost')
plt.show()
linear_regressor!
lin_regressor = LinearRegression()
lin_regressor.fit(X_train,y_train)
lin_regressor.coef_ #Gradient -> array([-21.4367])
lin_regressor.intercept_ # X: Zero Y Value -> 740.362
## -21.43 * X + 740.36
plt.title('LinearRegression - Validation Dataset')
plt.scatter(df_validation.x,df_validation.y,label='actual',marker='.')
plt.scatter(df_validation.x,result,label='predicted',marker='.')
plt.grid(True)
plt.legend()
plt.show()
# RMSE Metrics
print('Linear Regression Metrics')
mse = mean_squared_error(df_validation.y,result)
print(" Mean Squared Error: {0:.2f}".format(mse))
print(" Root Mean Square Error: {0:.2f}".format(mse**.5))
# Residual
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = df_validation.y - result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('Linear Regression Residual')
plt.axvline(color='r')
plt.show()
# Count number of values greater than zero and less than zero
value_counts = (residuals > 0).value_counts(sort=False)
print(' Under Estimation: {0}'.format(value_counts[True]))
print(' Over Estimation: {0}'.format(value_counts[False]))
# Plot for entire dataset
plt.scatter(df.x,df.y,label='Target')
plt.scatter(df.x,lin_regressor.predict(df[['x']]) ,label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('LinearRegression')
plt.show()
Summary: Non-linear의 경우 XGBoost가 효과적이며, Linear Regression의 경우 언더피팅이 발생한다.
XGboost의 경우 상/하한선이 존재한다.
4-2. Step
- Data Preparation/Exploration
df.corr()
df.corr()['count']
- Training on instance: biketrain_data_preparation_rev1/3.ipynb, xgboost_localmode_rev1/3.ipynb
코드 업데이트가 빠르네..4주만에 또 바꼈네...
# Example
# Converts to log1p(count)
# Print original count back using expm1
print('Test log and exp')
test_count = 100
print('original value', test_count)
x = np.log1p(test_count) # log (x+1)
print('log1p', x)
print('expm1', np.expm1(x)) # exp(x) - 1
regressor = xgb.XGBRegressor(max_depth=5,eta=0.1,subsample=0.7,num_round=150)
training_rounds = range(len(eval_result['validation_0']['rmse']))
plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('RMSE')
plt.title('Training Vs Validation Error')
plt.legend()
xgb.plot_importance(regressor)
plt.show()
# Negative Values are predicted
df['count_predicted'].describe()
def adjust_count(x):
if x < 0:
return 0
else:
return x
df['count_predicted'] = df['count_predicted'].map(adjust_count)
df[df['count_predicted'] < 0]
plt.boxplot([df['count'],df['count_predicted']], labels=['actual','predicted'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('Target')
plt.grid(True)
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = (df['count_predicted'] - df['count'])
plt.hist(residuals)
plt.grid(True)
plt.xlabel('(Predicted - Actual)')
plt.ylabel('Count')
plt.title('Residuals Distribution')
plt.axvline(color='g')
import sklearn.metrics as metrics
print("RMSE: {0}".format(metrics.mean_squared_error(df['count'],df['count_predicted'])**.5))
# Metric Use By Kaggle
def compute_rmsle(y_true, y_pred):
if type(y_true) != np.ndarray:
y_true = np.array(y_true)
if type(y_pred) != np.ndarray:
y_pred = np.array(y_pred)
return(np.average((np.log1p(y_pred) - np.log1p(y_true))**2)**.5)
print("RMSLE: {0}".format(compute_rmsle(df['count'],df['count_predicted'])))
Kaggle TEST 파일 Predict 후에 결과 값을 정제하고 Kaggle에 업로드
df_test[df_test["count"] < 0]
df_test["count"] = df_test["count"].map(adjust_count)
df_test[['datetime','count']].to_csv('predicted_count.csv',index=False)
엇... 수정된 코드로 스코어가 0.62 -> 0.40 포인트까지 줄었다.
핵심은 데이터정제 시 Count에 log1p를 사용
- Training on SageMaker: xgboost_cloud_training_template.ipynb
[Upload Data to S3]
# Specify your bucket name
bucket_name = 'leedoing-ml-sagemaker'
training_folder = 'bikerental/training/'
validation_folder = 'bikerental/validation/'
test_folder = 'bikerental/test/'
s3_model_output_location = 's3://{0}/bikerental/model'.format(bucket_name)
s3_training_file_location = 's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = 's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_test_file_location = 's3://{0}/{1}'.format(bucket_name,test_folder)
def write_to_s3(filename, bucket, key):
with open(filename,'rb') as f: # Read in binary mode
return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)
write_to_s3('bike_train.csv',
bucket_name,
training_folder + 'bike_train.csv')
write_to_s3('bike_validation.csv',
bucket_name,
validation_folder + 'bike_validation.csv')
write_to_s3('bike_test.csv',
bucket_name,
test_folder + 'bike_test.csv')
SageMaker 사용을 위해서 각 정제된 데이터를 S3에 저장
In the past, we had to maintain the algorithm containers mapping:
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}
Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
과거 ecr을 이용한 containers 정의가 필요했으나, 이제 그럴 필요가 없음.
[Training Algorithm Docker Image]
# Establish a session with AWS
sess = sagemaker.Session()
role = get_execution_role()
# Sagemaker API now maintains the algorithm container mapping for us
# Specify the region, algorithm and version
container = sagemaker.amazon.amazon_estimator.get_image_uri(
sess.boto_region_name,
"xgboost",
"latest")
print('Using SageMaker XGBoost container:\n{} ({})'.format(container, sess.boto_region_name))
[Build Model]
# Configure the training job
# Specify type and number of instances to use
# S3 location where final artifacts needs to be stored
# Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html
estimator = sagemaker.estimator.Estimator(
container,
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path=s3_model_output_location,
sagemaker_session=sess,
base_job_name ='xgboost-bikerental-v1')
SageMaker의 high level interface의 estimators
# Specify hyper parameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference:
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
# max_depth=5,eta=0.1,subsample=0.7,num_round=150
estimator.set_hyperparameters(max_depth=5,
objective="reg:linear",
eta=0.1,
num_round=150)
estimator.hyperparameters()
Hyperparameters 구성
[Specify Training Data Location and Optionally, Validation Data Locaion]
# content type can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.s3_input(
s3_data=s3_training_file_location,
content_type='csv',
s3_data_type='S3Prefix')
validation_input_config = sagemaker.session.s3_input(
s3_data=s3_validation_file_location,
content_type='csv',
s3_data_type='S3Prefix'
)
data_channels = {'train': training_input_config, 'validation': validation_input_config}
print(training_input_config.config)
print(validation_input_config.config)
[Train the model]
# XGBoost supports "train", "validation" channels
# Reference: Supported channels by algorithm
# https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
estimator.fit(data_channels)
모델 트레이닝은 약 3분이 소요된다. 물론 데이터에 따라 다름...
Completed 메시지와 아래와 같이 쥬피터에서 RMSE 결과를 확인
[Deploy Model]
# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html
predictor = estimator.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name = 'xgboost-bikerental-v1')
estimator.deploy를 통해 모델을 배포한다.
배포는 약 5분 정도 소요된다.
[Predictions]
from sagemaker.predictor import csv_serializer, json_deserializer
predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None
predictor.predict([[3,0,1,2,28.7,33.335,79,12.998,2011,7,7,3]])
SageMaker와 XGBoost 사용 정리
1. Upload Train and Validation file to S3
2. Specify Algorithm and Hyperparameters
3. Configure type of server and number of servers to user for Training
4. Create a real-time Endpoint for interactive use case
- Prediction using SageMaker: xgboost_cloud_prediction_template.ipynb
1. Invoke Endpoint for interactive use cases
2. Connect to an exisiting Endpoint
3. Endpoint Security
4. Multiple obserations in a single round trip
import boto3
import re
from sagemaker import get_execution_role
import sagemaker
# Acquire a realtime endpoint
endpoint_name = 'xgboost-bikerental-v1'
predictor = sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name)
from sagemaker.predictor import csv_serializer, json_deserializer
predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None
df_all = pd.read_csv('bike_test.csv')
# Need to pass an array to the prediction
# can pass a numpy array or a list of values [[19,1],[20,1]]
arr_test = df_all[df_all.columns[1:]].values ##date를 빼고 np 형태로 변경
type(arr_test)
arr_test.shape
arr_test[:5]
result = predictor.predict(arr_test[:2])
result
TEST 배열을 나눠서 predict
# For large number of predictions, we can split the input data and
# Query the prediction service.
# array_split is convenient to specify how many splits are needed
predictions = []
for arr in np.array_split(arr_test,10):
result = predictor.predict(arr)
result = result.decode("utf-8")
result = result.split(',')
print (arr.shape)
predictions += [float(r) for r in result]
np.expm1(predictions)
df_all['count'] = np.expm1(predictions) ##Count reverse(np.log1p)
df_all.head()
df_all[['datetime','count']].to_csv('predicted_count_cloud.csv', index=False)
Scaling based on Invocations
MAX RPS: 100
Safety Factor = 0.5
Target SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60
SageMakerVariantInvocationsPerInstance = 100 * 0.5 * 60 = 3,000
일단 데이터 전처리 등 기본을 다뤄봤다. 그리고 캐글 데이터를 사용하여 sklearn을 활용한 딥러닝을 구현, 또한 SageMaker의 XGBoost 활용까지 정리.
'Machine Learning > SageMaker' 카테고리의 다른 글
Amazon SageMaker 샘플 비교 (0) | 2019.02.15 |
---|