Amazon SageMaker & ML(sklearn)

이번 글은 SageMaker를 활용하여 기본적인 데이터 시각화 분석과 결과 평가 방법부터, 캐글의 Bike Rental 데이터를 통해 데이터 전처리, 트레이닝, 모델 생성, 배포 일련의 작업을 다룬다. 기본 sklearn을 사용해보고 SageMaker의 high level interface의 estimators 사용 방법과 비교

개인적으로 학습한 내용이라 두서 없음...

1. Introduce

1-1. np, pd, plt(plt.hist/블록, plt.scatter/분산, plt.plot/선형)

plt.scatter(df[setosa].index, df[setosa].petal_length, label ='set')

plt.plot(df['Vehicles'], label='target')

1-2. 데이터 전처리

vehicle = df['Vehicles'].fillna(0)

plt.plot(vehicles, ls='-.', alpha=0.8, label='mean')

fillna(0), fillna(df['Car'].mean()) / df['Car'].interpolate() / fillna(method='ffill') / fillna(method='bfill')

df['Vehicles'].fillna(df['Vehicles'].mean())

1-3. 비선형 함수

plt.plot(df['exponential']), plt.plot(df['log']), plt.plot(np.arange(-20,20,.1), df['sine'] 등

2. PerformanceEvaluation

2-1. linear Regression

plots, RMSE, Residual Histograms(Over/Under)

2-2. binary_classfier_performance

Confusion metric

classification_report(df['Pass'], df[model.replace(' ','') + '_Prediction'], labels=[1,0], target_names=['Pass','Fail']))

https://www.youtube.com/watch?v=nMAtFhamoRY

2-3. binary_classifier_rawscore_evaluation(ROC & AUC)

roc_score = roc_auc_score(df['Pass'], df[model.replace(' ','') + '_Prediction'])

https://www.youtube.com/watch?v=4jRBRDbJemM

참고

Data Science School

Data Science School is an open space!

datascienceschool.net

2-4. multicall_classifier_performance

Accuracy(전체 적중률 TP+TN/ALL), Percision(예상에서 적중률 TP/TP+FP), Recall(실제에서 적중률 TP/TP+FN),

F1-score(per/rec 값을 이용한 스코어), Support(normalize)

Micro avg: ? / Macro avg: 평균 / Weighted avg: 가중 평균

classification_report( df['NumericClass'], df[model.replace(' ','') + '_Prediction'], labels=labels, target_names=classes)

3. SageMaker Overview

3-1. DataFormat of SageMaker

Training Data Format

CSV, RecordIO, Algorithm specific formats(LibSVM, JSON, Pqrquet) are stored in S3

Inference Format

CSV, JSON, RecordIO

import sagemaker.amazon.common as smac #RecordIO

import sagemaker.amazon.common as smac

def write_recordio_file (filename, x, y=None):
    with open(filename, 'wb') as f:
        smac.write_numpy_to_dense_tensor(f, x, y)
        
def read_recordio_file (filename, recordsToPrint = 10):
    with open(filename, 'rb') as f:
        record = smac.read_records(f)
        for i, r in enumerate(record):
            if i >= recordsToPrint:
                break
            print ("record: {}".format(i))
            print(r)

3-2. Mode

File, Pipe Mode

4. XGBoosts(Gradient Boosted Tress)

XGBoosts: 트리 형식의 가중치 알고리즘(Support Regression, Classification)

4-1. PerformanceEvaluation(Quadratic Regression, XGBoost VS Linear Regression)

- quadratic_xgboost_localmode.ipynb

- XGBoost

X_train = df_train.iloc[:,1:] # Features: 1st column onwards 
y_train = df_train.iloc[:,0].ravel() # Target: 0th column

X_validation = df_validation.iloc[:,1:]
y_validation = df_validation.iloc[:,0].ravel()

# max_depth = 5,objective="reg:linear",num_round = 50
# XGBoost Training Parameter Reference: 
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
regressor = xgb.XGBRegressor()
regressor
regressor.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_validation, y_validation)])

[98] validation_0-rmse:1.49427 validation_1-rmse:3.71814

[99] validation_0-rmse:1.49156 validation_1-rmse:3.71637

# Get the Training RMSE and Evaluation RMSE
eval_result = regressor.evals_result()
eval_result

eval_result = regressor.evals_result()
training_rounds = range(len(eval_result['validation_0']['rmse']))
print(training_rounds)

plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')
plt.grid(True)
plt.xlabel('Iteration')
plt.ylabel('RMSE')
plt.title('Training Vs Validation Error')
plt.legend()

xgb.plot_importance(regressor)
plt.show()

참고

Stats Overflow

A Data Scientist's Tech Blog

aldente0630.github.io

- Validation Dateset Compare Actual and Predicted(quadratic 파일 Import가 빠져있음... 다시해야됨)

XGBoost!

#def quad_func (x):
#    return 5*x**2 -23*x + 47

df = pd.read_csv('quadratic_all.csv')
plt.scatter(df.x,df.y,label='Target')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Simple Regression Dataset')
plt.show()

plt.title('XGBoost - Validation Dataset')
plt.scatter(df_validation.x,df_validation.y,label='actual',marker='.')
plt.scatter(df_validation.x,result,label='predicted',marker='.')
plt.grid(True)
plt.legend()
plt.show()

# RMSE Metrics
print('XGBoost Algorithm Metrics')
mse = mean_squared_error(df_validation.y,result)
print(" Mean Squared Error: {0:.2f}".format(mse))
print(" Root Mean Square Error: {0:.2f}".format(mse**.5))

# Residual
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = df_validation.y - result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('XGBoost Residual')
plt.axvline(color='r')
plt.show()

# Count number of values greater than zero and less than zero
value_counts = (residuals > 0).value_counts(sort=False)

print(' Under Estimation: {0}'.format(value_counts[True]))
print(' Over  Estimation: {0}'.format(value_counts[False]))

# Plot for entire dataset
plt.scatter(df.x,df.y,label='Target')
plt.scatter(df.x,regressor.predict(df[['x']]) ,label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('XGBoost')
plt.show()

linear_regressor!

lin_regressor = LinearRegression()
lin_regressor.fit(X_train,y_train)
lin_regressor.coef_ #Gradient -> array([-21.4367])
lin_regressor.intercept_ # X: Zero Y Value -> 740.362

## -21.43 * X + 740.36

plt.title('LinearRegression - Validation Dataset')
plt.scatter(df_validation.x,df_validation.y,label='actual',marker='.')
plt.scatter(df_validation.x,result,label='predicted',marker='.')
plt.grid(True)
plt.legend()
plt.show()

# RMSE Metrics
print('Linear Regression Metrics')
mse = mean_squared_error(df_validation.y,result)
print(" Mean Squared Error: {0:.2f}".format(mse))
print(" Root Mean Square Error: {0:.2f}".format(mse**.5))

# Residual
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = df_validation.y - result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('Linear Regression Residual')
plt.axvline(color='r')
plt.show()

# Count number of values greater than zero and less than zero
value_counts = (residuals > 0).value_counts(sort=False)

print(' Under Estimation: {0}'.format(value_counts[True]))
print(' Over  Estimation: {0}'.format(value_counts[False]))

# Plot for entire dataset
plt.scatter(df.x,df.y,label='Target')
plt.scatter(df.x,lin_regressor.predict(df[['x']]) ,label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('LinearRegression')
plt.show()

Summary: Non-linear의 경우 XGBoost가 효과적이며, Linear Regression의 경우 언더피팅이 발생한다.

XGboost의 경우 상/하한선이 존재한다.

4-2. Step

- Data Preparation/Exploration

df.corr()

df.corr()['count']

- Training on instance: biketrain_data_preparation_rev1/3.ipynb, xgboost_localmode_rev1/3.ipynb

코드 업데이트가 빠르네..4주만에 또 바꼈네...

# Example
# Converts to log1p(count)
# Print original count back using expm1
print('Test log and exp')
test_count = 100
print('original value', test_count)
x = np.log1p(test_count) # log (x+1)
print('log1p', x)
print('expm1', np.expm1(x)) # exp(x) - 1

regressor = xgb.XGBRegressor(max_depth=5,eta=0.1,subsample=0.7,num_round=150)

dmlc/xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow - dmlc/xgboost

github.com

XGBoost 하이퍼파라미터 - Amazon SageMaker

XGBoost 하이퍼파라미터 다음 표에는 Amazon SageMaker XGBoost 알고리즘에 필요하거나 가장 일반적으로 사용되는 하이퍼파라미터 하위 세트가 포함되어 있습니다. 이들은 사용자가 데이터로부터 모델 파라미터의 예측을 촉진하기 위해 설정하는 파라미터입니다. 먼저 반드시 설정해야 하는 필수 하이퍼파라미터가 알파벳 순으로 나열되어 있습니다. 그 다음에 설정할 수 있는 선택적 하이퍼파라미터가 알파벳 순으로 나열되어 있습니다. Amazon SageM

docs.aws.amazon.com

training_rounds = range(len(eval_result['validation_0']['rmse']))
plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('RMSE')
plt.title('Training Vs Validation Error')
plt.legend()

xgb.plot_importance(regressor)
plt.show()

# Negative Values are predicted
df['count_predicted'].describe()

def adjust_count(x):
    if x < 0:
        return 0
    else:
        return x
        
df['count_predicted'] = df['count_predicted'].map(adjust_count)
df[df['count_predicted'] < 0]

plt.boxplot([df['count'],df['count_predicted']], labels=['actual','predicted'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('Target')
plt.grid(True)

# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = (df['count_predicted'] - df['count'])

plt.hist(residuals)
plt.grid(True)
plt.xlabel('(Predicted - Actual)')
plt.ylabel('Count')
plt.title('Residuals Distribution')
plt.axvline(color='g')

import sklearn.metrics as metrics
print("RMSE: {0}".format(metrics.mean_squared_error(df['count'],df['count_predicted'])**.5))

# Metric Use By Kaggle
def compute_rmsle(y_true, y_pred):
    if type(y_true) != np.ndarray:
        y_true = np.array(y_true)
        
    if type(y_pred) != np.ndarray:
        y_pred = np.array(y_pred)
     
    return(np.average((np.log1p(y_pred) - np.log1p(y_true))**2)**.5)
    
print("RMSLE: {0}".format(compute_rmsle(df['count'],df['count_predicted'])))

Kaggle TEST 파일 Predict 후에 결과 값을 정제하고 Kaggle에 업로드

df_test[df_test["count"] < 0]
df_test["count"] = df_test["count"].map(adjust_count)
df_test[['datetime','count']].to_csv('predicted_count.csv',index=False)

엇... 수정된 코드로 스코어가 0.62 -> 0.40 포인트까지 줄었다.

핵심은 데이터정제 시 Count에 log1p를 사용

- Training on SageMaker: xgboost_cloud_training_template.ipynb

[Upload Data to S3]

# Specify your bucket name
bucket_name = 'leedoing-ml-sagemaker'

training_folder = 'bikerental/training/'
validation_folder = 'bikerental/validation/'
test_folder = 'bikerental/test/'

s3_model_output_location = 's3://{0}/bikerental/model'.format(bucket_name)
s3_training_file_location = 's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = 's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_test_file_location = 's3://{0}/{1}'.format(bucket_name,test_folder)

def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)
        
write_to_s3('bike_train.csv', 
            bucket_name,
            training_folder + 'bike_train.csv')

write_to_s3('bike_validation.csv',
            bucket_name,
            validation_folder + 'bike_validation.csv')

write_to_s3('bike_test.csv',
            bucket_name,
            test_folder + 'bike_test.csv')

SageMaker 사용을 위해서 각 정제된 데이터를 S3에 저장

In the past, we had to maintain the algorithm containers mapping:


containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}

Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

과거 ecr을 이용한 containers 정의가 필요했으나, 이제 그럴 필요가 없음.

[Training Algorithm Docker Image]

# Establish a session with AWS
sess = sagemaker.Session()
role = get_execution_role()

# Sagemaker API now maintains the algorithm container mapping for us
# Specify the region, algorithm and version
container = sagemaker.amazon.amazon_estimator.get_image_uri(
    sess.boto_region_name,
    "xgboost", 
    "latest")

print('Using SageMaker XGBoost container:\n{} ({})'.format(container, sess.boto_region_name))

[Build Model]

# Configure the training job
# Specify type and number of instances to use
# S3 location where final artifacts needs to be stored

#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

estimator = sagemaker.estimator.Estimator(
    container,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path=s3_model_output_location,
    sagemaker_session=sess,
    base_job_name ='xgboost-bikerental-v1')

SageMaker의 high level interface의 estimators

# Specify hyper parameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference: 
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

# max_depth=5,eta=0.1,subsample=0.7,num_round=150
estimator.set_hyperparameters(max_depth=5,
                              objective="reg:linear",
                              eta=0.1,
                              num_round=150)

estimator.hyperparameters()

Hyperparameters 구성

XGBoost 하이퍼파라미터 - Amazon SageMaker

docs.aws.amazon.com

[Specify Training Data Location and Optionally, Validation Data Locaion]

# content type can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.s3_input(
    s3_data=s3_training_file_location,
    content_type='csv',
    s3_data_type='S3Prefix')

validation_input_config = sagemaker.session.s3_input(
    s3_data=s3_validation_file_location,
    content_type='csv',
    s3_data_type='S3Prefix'
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}
print(training_input_config.config)
print(validation_input_config.config)

[Train the model]

# XGBoost supports "train", "validation" channels
# Reference: Supported channels by algorithm
#   https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
estimator.fit(data_channels)

모델 트레이닝은 약 3분이 소요된다. 물론 데이터에 따라 다름...

Completed 메시지와 아래와 같이 쥬피터에서 RMSE 결과를 확인

[Deploy Model]

# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'xgboost-bikerental-v1')

estimator.deploy를 통해 모델을 배포한다.

배포는 약 5분 정도 소요된다.

[Predictions]

from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

predictor.predict([[3,0,1,2,28.7,33.335,79,12.998,2011,7,7,3]])

SageMaker와 XGBoost 사용 정리

1. Upload Train and Validation file to S3

2. Specify Algorithm and Hyperparameters

3. Configure type of server and number of servers to user for Training

4. Create a real-time Endpoint for interactive use case

- Prediction using SageMaker: xgboost_cloud_prediction_template.ipynb

1. Invoke Endpoint for interactive use cases

2. Connect to an exisiting Endpoint

3. Endpoint Security

4. Multiple obserations in a single round trip

import boto3
import re
from sagemaker import get_execution_role
import sagemaker

# Acquire a realtime endpoint
endpoint_name = 'xgboost-bikerental-v1'
predictor = sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name)

from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

df_all = pd.read_csv('bike_test.csv')
# Need to pass an array to the prediction
# can pass a numpy array or a list of values [[19,1],[20,1]]
arr_test = df_all[df_all.columns[1:]].values ##date를 빼고 np 형태로 변경
type(arr_test)
arr_test.shape
arr_test[:5]

result = predictor.predict(arr_test[:2])
result

TEST 배열을 나눠서 predict

# For large number of predictions, we can split the input data and
# Query the prediction service.
# array_split is convenient to specify how many splits are needed
predictions = []
for arr in np.array_split(arr_test,10):
    result = predictor.predict(arr)
    result = result.decode("utf-8")
    result = result.split(',')
    print (arr.shape)
    predictions += [float(r) for r in result]

np.expm1(predictions)
df_all['count'] = np.expm1(predictions) ##Count reverse(np.log1p)

numpy.expm1 — NumPy v1.17 Manual

Parameters: x : array_like Input values. out : ndarray, None, or tuple of ndarray and None, optional A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshly-allocated ar

docs.scipy.org

df_all.head()
df_all[['datetime','count']].to_csv('predicted_count_cloud.csv', index=False)

Scaling based on Invocations

MAX RPS: 100

Safety Factor = 0.5

Target SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60

SageMakerVariantInvocationsPerInstance = 100 * 0.5 * 60 = 3,000

일단 데이터 전처리 등 기본을 다뤄봤다. 그리고 캐글 데이터를 사용하여 sklearn을 활용한 딥러닝을 구현, 또한 SageMaker의 XGBoost 활용까지 정리.

'Machine Learning > SageMaker' 카테고리의 다른 글

Amazon SageMaker 샘플 비교 (0)	2019.02.15

이두잉의 AWS 세상

Amazon SageMaker & ML(sklearn)

'Machine Learning > SageMaker' 카테고리의 다른 글

티스토리툴바

Amazon SageMaker & ML(sklearn)

'Machine Learning > SageMaker' 카테고리의 다른 글

'Machine Learning/SageMaker' Related Articles

티스토리툴바