[알고리즘] Isolation Forest

Isolation Forest

정상 데이터로부터 학습한 모델을 기반으로 각 객체의 정상/이상 여부를 판단하는 방법론입니다.

이상치 데이터를 라벨링하지 않고, 정상 데이터만으로 이상탐지 모델을 학습할 때 사용하기도 합니다.

2008년 제안된 알고리즘이지만 현재까지도 이상치 탐지에 높은 성능을 보여 유용하게 사용됩니다.

알고리즘

포인트를 분리하기 위해 알고리즘은 속성을 무작위로 선택한 다음 해당 속성에 허용되는 최소값과 최대값 사이의 분할 값을 무작위로 선택하여 샘플에서 반복적으로 파티션을 생성합니다.

이렇게 여러 번 분할한 공간을 의사결정나무 (Decision Tree) 형태로 표현할 수 있습니다.

정상치일수록 완전히 고립시킬 수 있을 때까지 의사결정나무를 깊숙하게 타고 내려가야 합니다.

반대로 이상치의 경우, 의사결정나무의 상단부만 타더라도 고립될 가능성이 높습니다.

이런 특성을 이용하면 의사결정나무를 몇 회 타고 내려가야 고립되는가를 기준으로 정상치와 이상치를 분리할 수 있습니다.

아래 그림과 같이 분할을 생각할 수 있습니다.

정상 관측치를 고립시키기 위해서는 2진 분할을 여러 번 수행해야 합니다.

반면, 이상치를 고립시키기 위해서는 2진 분할을 적게 수행합니다.

따라서 각 관측치의 Path length(특정 관측치가 고립될 때까지 필요한 분할 횟수)를 기반으로 Anomaly score를 정의하여 부여합니다.

알고리즘 구현

전체 데이터에서 일부 관측치를 랜덤하게 선택
랜덤하게 선택된 관측치에 대해 임의의 변수(splitting variable)와 분할점(splitting point)을 사용하여 다음 조건을 만족할 때까지 이진분할 진행
위와 같은 과정으로 여러 개의 ITree를 구축
각 관측치의 평균 Path length를 기반으로 Anomaly score(이상치 스코어)를 계산 및 이상치 판별

이런 의사결정나무(Itree)를 여러 개 모아서 앙상블 모델(IForest)을 만들면 왼쪽 그래프처럼 안정적인 이상지수(score)를 산출할 수 있습니다.

약 50개에서 100개 정도의 의사결정나무를 이용하면 이상지수가 안정화된다는 내용이 논문에 있습니다.

Anomaly score 정의

Anomaly detection

오픈소스 PyCaret을 이용하여 단백질 측정값 이상치 탐지 예제를 진행하겠습니다.

PyCaret은 머신 러닝 워크플로우를 자동화하는 파이썬의 오픈 소스, 머신러닝 라이브러리입니다.

다른 오픈 소스 머신 러닝 라이브러리와 비교했을 때, PyCaret는 수백 줄의 코드를 몇 줄로만 대체하는 데 사용할 수 있는 대체 라이브러리입니다.

테디노트님의 04-Anomaly-Detection.ipynb파일을 이용합니다.

GitHub - teddylee777/machine-learning: 머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고

머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고자 만든 repository입니다. (This repository is intented for helping whom are interested in machine learning study) - GitHub - teddylee777/machine-learning: 머신

github.com

이 코드의 이상치 탐지에서 쓰인 알고리즘이 바로 앞서 설명한 'Isolation Forest'입니다.

Anomaly-Detection.py

# PyCaret 설치
# 먼저 pycaret 패키지를 인터프리터에 설치합니다.
!pip install pycaret

# Google Colab 사용자의 경우 다음의 코드를 실행합니다.
from pycaret.utils import enable_colab
enable_colab()

# 필요한 모듈 import
import pandas as pd
import numpy as np
import seaborn as sns
pd.options.display.max_columns = None


# 이상치 탐지 알고리즘
# 실습을 위한 데이터셋 로드
from pycaret.datasets import get_data
dataset = get_data('mice')


# 이 튜토리얼에서는 `Mice Protein Expression`이라는 UCI의 데이터 세트를 사용합니다. 데이터세트는 피질의 핵 분획에서 감지 가능한 신호를 생성한 77개의 단백질/단백질 변형의 발현 수준으로 구성됩니다. 데이터 세트에는 단백질당 총 1080개의 측정값이 포함되어 있습니다. 각 측정은 독립적인 샘플/마우스로 간주될 수 있습니다.
train = dataset.sample(frac=0.8, random_state=123)
test = dataset.drop(train.index)
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)
print('학습용 데이터셋: ' + str(train.shape))
print('예측용 데이터셋: ' + str(test.shape))

# 셋업 setup
from pycaret.anomaly import *           # 이상 탐지
s = setup(train,
          normalize=True,                # 데이터 정규화
          ignore_features=['MouseID'],   # 학습에 무시할 컬럼 지정
          session_id=123)                # 시드(SEED) 지정
       
       
#모델 생성 'iforest'
from IPython.display import Image
Image(url='https://miro.medium.com/max/1400/1*4P2vi2YVj4nHbU5SZ9i7Ig.png', width=750)

# isolation forest 모델 생성
iforest = create_model('iforest')

# 이상치 탐지: assign_model()
iforest_results = assign_model(iforest)
iforest_results.head()

# 결과 시각화
plot_model(iforest)

# 예측 predict_model
predicitons = predict_model(iforest, data=test)
predicitons[['Anomaly', 'Anomaly_Score']]

이 글은 김성범교수님의 'Anomaly Detection - Isolation Forest' 강의를 기반으로 작성하였습니다.

참고

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation Forest”, IEEE International Conference on Data Mining 2008 (ICDM 08)

Isolation Forest

Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explici

ieeexplore.ieee.org

Anomaly Detection - Isolation Forest

📖 이번 포스팅은 이전 포스트처럼 고려대학교 김성범 교수님의 강의영상을 참고하여 모델 기반 이상치 탐지 알고리즘 중 Isolation Forest에 대해 다루고자 합니다.정상 데이터로부터 학습한 모델

velog.io

Isolation forest - Wikipedia

From Wikipedia, the free encyclopedia Algorithm for anomaly detection Isolation Forest is an algorithm for data anomaly detection initially developed by Fei Tony Liu and Zhi-Hua Zhou in 2008.[1] Isolation Forest detects anomalies using binary trees. The al

en.wikipedia.org

의사결정나무를 이용한 이상탐지 | 로그프레소

Isolation Forest 모델의 동작 원리를 상세하게 알아봅니다.

logpresso.com

Isolation Forest (for anomaly detection)

Isolation Forest - Tree를 이용한 이상탐지를 위한 비지도학습 알고리즘 - Regression Decision Tree를 기반으로 실행 - Regression Tree 가 재귀 이진 분할을 이용하여 영역을 나누는 개념을 이용함 Random forest와

dodonam.tistory.com

Isolation Forest

Isolation Forest Regression tree기반의 split으로 모든 데이터 관측치를 고립시키는 방법 비정상 데이터가 고립되려면, root node와 가까운 depth를 가짐 정상 데이터가 고립되려면, tree의 말단노드에 가까운

donghwa-kim.github.io

What is the range of Scikit-Learn's IsolationForest decision_function scores?

Scikit-Learn's IsolationForest class has a method decision_function that returns the anomaly scores of the input samples. However, the documentation does not state what the possible range of these ...

stackoverflow.com

저작자표시 (새창열림)

'알고리즘 > 알고리즘' 카테고리의 다른 글

[기초] 자료구조 / 트리 (Tree) (0)	2022.08.12
[기초] 자료구조 / 그래프(Graph)의 표현과 탐색 (0)	2022.08.10
[기초] 자료구조 / 그래프 (Graph) (0)	2022.08.09
[기초] 정렬(sort) 알고리즘 (0)	2022.08.08
[기초] 알고리즘 수학2 (0)	2022.08.02

PSY

[알고리즘] Isolation Forest

Isolation Forest

알고리즘

알고리즘 구현

Anomaly score 정의

Anomaly detection

'알고리즘 > 알고리즘' 카테고리의 다른 글

댓글

티스토리툴바

[알고리즘] Isolation Forest

Isolation Forest

알고리즘

알고리즘 구현

Anomaly score 정의

Anomaly detection

'알고리즘 > 알고리즘' 카테고리의 다른 글

관련글

댓글

티스토리툴바