Classification of Spear Phishing Email using Machine Learning Approach
Keywords:
Spear Phishing, Email Classification, Machine LearningAbstract
The prevalence of spear phishing attacks targeting organizations is on the rise, accompanied by an increasing diversity in the techniques employed within spear phishing emails. Although previous research has focused on identifying phishing emails based on their headers, bodies, or attachments, this study aims to tackle spear phishing email classification using a machine learning approach. The research will focus on content-based features rather than headers, bodies, or attachment. The proposed spear phishing email classification model comprises seven distinct phases: raw data acquisition, data pre-processing, feature extraction, n-fold cross-validation, classification algorithm selection, email classification, and model performance evaluation. For this experiment, content-based features extracted from the Enron dataset will be utilized. The model's effectiveness will be assessed using the Random Forest and Naïve Bayes classification algorithms, with evaluation metrics including AUC, precision, F1-score, and recall. Random Forest performed exceptionally well with an Area Under Curve (AUC) score of 0.996, F1-Score of 0.968, precision of 0.969, and recall of 0.967. Naïve Bayes achieved moderate results: AUC 0.742, F1-Score 0.701, precision 0.677, and recall 0.727.