Feature Extraction Tool for Phishing Dataset
Keywords:
Feature Extraction, Word2vec, PhishingAbstract
Feature extraction is a fundamental technique applied by researchers to extract the dimensionality of dataset. Machine learning algorithms rely on feature vectors rather than raw data for effective processing. Thus, the development of feature extraction tools become crucial in converting raw text datasets into feature vectors. In this project, we propose the utilization of Word2vec model and object-oriented approach to extract features from raw text datasets. Our tool could be applied in research exploring text-based feature extraction, with the implementation realized through Python. The feature extraction tool offers the flexibility to extract user-selected features or all phishing-related features from the raw text dataset, enhancing its applicability in diverse research scenarios. Features that are included are Contains_Urgency, Contains_Free, Contains_Exclamation_Marks, Contains_Urls, Word_Count, Contain_Most_Similar_Words_To_Urgent_Word, and Contain_Most_Frequent_Word_Word. Features such as Contain_Most_Similar_Words_To_Urgent_Word and Contain_Most_Frequent_Word_Word are extracted using the Word2vec model. Main modules that are included in the tool are the user register module, user login module, upload module, download module, and feature extraction module.