intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Phishing attacks detection using genetic programming with features selection

Chia sẻ: Thi Thi | Ngày: | Loại File: PDF | Số trang:6

24
lượt xem
1
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Phishing is a real threat on the Internet nowadays. Therefore, fighting against phishing attacks is of great importance. In this paper, we propose a solution to this problem by applying Genetic Programming with features selection methods to phishing detection problem. We conducted the experiments on a data set including both phishing and legitimate sites collected from the Internet. We compared the performance of Genetic Programming with a number of other machine learning techniques and the results showed that Genetic Programming produced the best solutions to phishing detection problem.

Chủ đề:
Lưu

Nội dung Text: Phishing attacks detection using genetic programming with features selection

Phạm Tuấn Anh và Đtg<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> 122(08): 21 - 26<br /> <br /> PHISHING ATTACKS DETECTION USING GENETIC PROGRAMMING<br /> WITH FEATURES SELECTION<br /> Tuan Anh Pham1, Thi Huong Chu2, Hoang Quan Nguyen2,<br /> Quang Uy Nguyen2, Xuan Hoai Nguyen3, Van Truong Nguyen4<br /> 2The<br /> <br /> 1Centre of IT, Military Academy of Logistics, Vietnam,<br /> Faculty of Information Technology, Le Quy Don University, Vietnam,<br /> 3IT R&D Center, Hanoi University, Vietnam,<br /> 4College of Education, TNU, Vietnam<br /> <br /> SUMMARY<br /> Phishing is a real threat on the Internet nowadays. Therefore, fighting against phishing attacks is of<br /> great importance. In this paper, we propose a solution to this problem by applying Genetic<br /> Programming with features selection methods to phishing detection problem. We conducted the<br /> experiments on a data set including both phishing and legitimate sites collected from the Internet.<br /> We compared the performance of Genetic Programming with a number of other machine learning<br /> techniques and the results showed that Genetic Programming produced the best solutions to<br /> phishing detection problem.<br /> Keywords: Genetic Programming, Phishing Attack, Machine Learning<br /> <br /> INTRODUCTION*<br /> Genetic Programming (GP) [2] is an<br /> evolutionary algorithm aimed to provide<br /> solutions to a user-defined task in the form of<br /> computer programs. Since its introduction,<br /> GP has been applied to many practical<br /> problems [2]. GP has also been used as a<br /> learning tool for solving some problems in<br /> network security [3]. However, to the best of<br /> our knowledge, there has not been any<br /> published work on the use of GP for learning<br /> to detect phishing web sites except our<br /> preliminary work in [4].<br /> In the field of network security, phishing<br /> attack is one of the main threat on the Internet<br /> nowadays [5]. Phishing attackers attempt to<br /> acquire confidential information such as<br /> usernames, passwords, and credit card details<br /> by disguising as a trustworthy entity in an<br /> online communication [5]. Due to the<br /> simplicity, phishing attacks are very popular. .<br /> According to a report released by an<br /> American security firm, RSA, there have been<br /> approximately 33,000 phishing attacks<br /> globally each month in 2012, leading to a loss<br /> of $687 million [1]. Therefore, detecting and<br /> *<br /> <br /> Tel: 0915 016063, Email: nvtruongtn@gmail.com<br /> <br /> eliminating phishing attacks is very important<br /> for not only organizations but also<br /> individuals. One popular and widely-used<br /> solution with most web browsers is to<br /> integrate blacklisted sites into them.<br /> However, this solution, which is unable to<br /> detect a new attack if the database is out of<br /> date, appears to be not effective when there<br /> are a large number of phishing attacks carried<br /> out very day.<br /> In a recent research [4], Pham et al. proposed<br /> a solution to this problem by applying<br /> Genetic Programming to phishing detection<br /> problem. The results showed that GP<br /> outperforms some other machine learning<br /> methods on this important problem. However,<br /> the research in [4] has some drawbacks.<br /> 1) The data set for training and testing was<br /> rather small. Therefore, the models created<br /> based on this data set may not generalize well<br /> in the real environment.<br /> 2) More important, the number of features<br /> used in [4] seems to be limited. Moreover,<br /> some features may not be relevant for<br /> distinguishing<br /> between<br /> phishing<br /> and<br /> legitimate sites. This may hinder the<br /> performance of machine learning methods in<br /> solving this problem.<br /> 21<br /> <br /> Phạm Tuấn Anh và Đtg<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> In this paper, we extend the work in [4] in<br /> several ways. The main contributions of this<br /> paper are:<br /> 1) We enlarged both training and testing data<br /> set by collecting more phishing and legitimate<br /> sites from the Internet.<br /> 2) We enriched the features set by adding<br /> some institutive features which may be<br /> beneficial for discriminating normal and<br /> phishing sites.<br /> 3) We used a features selection method to<br /> eliminate some irrelevant features which<br /> helps to improve the performance Genetic<br /> Programming.<br /> The remainder of the paper is organized as<br /> follows. In the next section, we briefly review<br /> some previous research on detecting phishing<br /> attacks. In Section III we present our method<br /> using GP for solving the phishing detection<br /> problem. It is followed by a section detailing<br /> our experimental settings. The experimental<br /> results are shown and discussed in Section V.<br /> The last section concludes the paper and<br /> highlights some potential future works.<br /> RELATED WORKS<br /> Since phishing attacks are very popular, there<br /> has been a number of anti-phishing solutions<br /> proposed to date. Some methods aim to solve<br /> the phishing problem at the email level by<br /> preventing users from visiting the phishing<br /> sites. That is, the emails containing phishing<br /> sites are filtered before being able to reach to<br /> the potential victims. Apparently, these<br /> techniques are closely related to anti-spam<br /> research and has been used by both Microsoft<br /> [6] and Yahoo [7]. Other solutions attempt to<br /> protect valuable information from being<br /> exposed to the phishers by replacing<br /> passwords with site-specific tokens, or by<br /> using novel authentication mechanisms.<br /> These methods have been used in some<br /> popular anti-phishing tools such as PwdHash<br /> and AntiPhish.<br /> In PwdHash [8], a domain-specific password,<br /> that is rendered useless if it is submitted to<br /> 22<br /> <br /> 122(08): 21 - 26<br /> <br /> another domain, is created (e.g., a password<br /> for www.gmail.com will be different if<br /> submitted to www.attacker.com). Conversely,<br /> AntiPhish [9] takes a different approach by<br /> keeping track of where confidential<br /> information such as a password is being<br /> submitted. That is, if it detects that a<br /> password is being entered into a form on an<br /> untrusted web site, a warning is generated and<br /> the current operation is canceled.<br /> In this paper, we will focus on the approaches<br /> that only use the information available from<br /> the URL and the pages source code.<br /> Currently, there are two main such<br /> approaches for identifying phishpages - based<br /> on URL blacklists; and based on the<br /> properties of the page and (sometimes) the<br /> URL. More detailed description about these<br /> methods can be found in [4].<br /> METHODS<br /> This section presents the methods used in this<br /> paper. The way to extract the features for<br /> each web site is presented first. The method<br /> for features selection is discussed after that.<br /> Finally, the GP system for phishing detection<br /> is described.<br /> Features Extraction<br /> The first step of using GP to tackle the<br /> phishing detection problem is features<br /> extraction/selection. The extracted features<br /> must contain information that helps to<br /> distinguish phishing and legitimate sites. In<br /> this paper, we extend the features set in [4] by<br /> adding some more features that are based on<br /> URL of the sites. Totally, eighteen features<br /> are used in this paper including twelve<br /> content-based features that have been used in<br /> [4] and six new URL-based features. These<br /> six URL-based features are taken from [10]<br /> and are described as follows.<br /> • URL1: number of ’@’ in URL (X13).<br /> • URL2: number of ’-’ in URL (X14).<br /> • URL3: number of ’.’ in URL (X15).<br /> • URL4: number of ’.’ in URL (X16).<br /> <br /> Phạm Tuấn Anh và Đtg<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> • URL5: 1 if URL contain word ’ebayisapi’,<br /> otherwise 0 (X17).<br /> • URL6: 1 if URL contain word ’banking’,<br /> otherwise 0 (X18).<br /> Features Selection<br /> Feature selection is the process of choosing a<br /> subset of features relevant to a particular<br /> application [11]. There have been a number of<br /> features selection proposed for machine<br /> learning algorithms [12]. Among them,<br /> statistics based methods have shown good<br /> performance on a number of problems [12].<br /> In this paper, we use the mutual information<br /> (MI) concept as the features selection<br /> criterion.<br /> Mutual information (MI) is a basic concept in<br /> information theory. It is a measure of general<br /> interdependence between random variables<br /> [12]. Specifically, given two random<br /> variables X and Y, the mutual information<br /> I(X;Y) is defined as follows:<br /> I (X ; Y ) = H (X ) + H (Y ) − H (X ; Y ) (1)<br /> where H() is the entropy of a random variable<br /> and measures the uncertainty associated with<br /> it. If X is a discrete random variable, H(X) is<br /> defined as follows:<br /> H (X ) = −∑ P (X )log2 (P (X )) (2)<br /> Calculating exactly mutual information (MI)<br /> between two random variables is not a<br /> straightforward task. Therefore, it is often<br /> necessary that this value is estimated. In this<br /> paper, we estimate MI using the histogram<br /> approach [12]. According to this method, the<br /> probability density function of each variable<br /> is approximated using a histogram. Then, the<br /> MI can be calculated according to the<br /> following equation:<br /> <br /> I ( X ;Y )   P( X , Y )log 2<br /> x<br /> <br /> y<br /> <br /> P( X , Y )<br /> P( X ) P(Y )<br /> <br /> (3)<br /> <br /> where the summations are calculated over the<br /> appropriately discretized values of the<br /> random variables X and Y. For each<br /> histogram bin, the joint probability<br /> <br /> 122(08): 21 - 26<br /> <br /> distribution P(X,Y) is estimated by counting<br /> the number of cases that fall into a particular<br /> bin and dividing that number with the total<br /> number of cases. The same technique is<br /> applied for the histogram approximation of<br /> the marginal distributions P(X) and P(Y).<br /> Choosing an appropriate bin is a crucial issue.<br /> In this paper, we follow [19] in choosing the<br /> number of bins based on the Gaussianity rule.<br /> With Gaussian data, the proper number of<br /> bins is log2 N + 1.<br /> System Description<br /> The evolutionary learning process of GP for<br /> solving the problem of phishing detection is<br /> divided into two stages: training and testing.<br /> The objective of training stage is to evolve the<br /> model (the classifier) that can determine a site<br /> as either phishing or legitimate based on its<br /> feature values. In the testing stage, the learnt<br /> model is used to make predictions on the<br /> unseen data. The accuracy of this prediction is<br /> used as an indicator for the quality<br /> (effectiveness) of the model.<br /> In the training stage, a set of training sites<br /> (both phishing and benign) with their labels<br /> (either as phishing or normal) are provided.<br /> The feature extraction process is called to<br /> convert every site to a feature vector. This<br /> vector is then served as the input for an<br /> individual in GP and the output of the<br /> individual is a real value. If this real value is<br /> greater than zero, this site is tagged as a<br /> phishing, otherwise it is considered as benign.<br /> The next step in the training process is to<br /> measure the fitness of an individual in GP. In<br /> this paper, we use a simple way to measure<br /> the fitness of individual where the fitness is<br /> the percentage of sites in the training set that<br /> are correctly classified. This fitness, thought<br /> may not be a good indicator if the data is<br /> imbalance, is intuitive to identify the overall<br /> quality of a model.<br /> EXPERIMENTAL SETTINGS<br /> This section outlines the settings used in our<br /> experiments. First, we present the way that<br /> 23<br /> <br /> Phạm Tuấn Anh và Đtg<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> data was collected for training and testing the<br /> systems. After that GP configurations for the<br /> experiments are described.<br /> Data Collection<br /> The data used for training and testing the<br /> system in this paper was collected from both<br /> phishing sites and legitimate sites on the<br /> Internet. The process is similar to that in[4]<br /> except the number of pages is larger. In this<br /> paper, we collected 3528 phishing pages and<br /> 3965 normal pages.<br /> From the data set, eighteen properties on each<br /> page were extracted to create the set feature<br /> vectors. We retained only one feature vector<br /> in case there is duplication in the data set.<br /> Moreover, if a feature vector presented in<br /> both phishing data and legitimate data, this<br /> vector was removed. As a result, 1800 feature<br /> vectors for phishing and 1200 feature vectors<br /> for legitimate data were retrieved. Totally, we<br /> obtained 1800+1200=3000 feature vectors of<br /> both phishing and legitimate sites. These<br /> vectors are mixed and divided into two sets:<br /> one for training (1000 samples) and the other<br /> for testing (the rest). Finally, feature values<br /> were normalized to the range between (0, 1),<br /> and the vectors extracted from phishing pages<br /> were labeled 1, otherwise labeled 0.<br /> GP Parameters Settings<br /> To tackle a problem with GP, several<br /> elements need to be clarified beforehand.<br /> These elements often depend on the problem<br /> and the experience of practitioners. The first<br /> and important element is the fitness function.<br /> As aforementioned, in this paper we use the<br /> percentage of correct classifications as the<br /> fitness measurement for each individual in the<br /> population. Other factors that strongly affect<br /> the performance of GP are the set of nonterminals and terminals. The terminal sets<br /> include 18 variables (X1, X2,...,X18 )<br /> representing 18 features extracted from the<br /> sites. The non-terminal set include 5 functions<br /> (+, -, *, /, iff). Here, we used the protected<br /> 24<br /> <br /> 122(08): 21 - 26<br /> <br /> versions of division (/), meaning that if the<br /> denominator is zero, the returned value is 1.<br /> Other evolutionary parameters are kept the<br /> same as [2].<br /> We divided our experiments into three sets. In<br /> the first, we repeated the experiments in [4]<br /> meaning that we used only twelve features<br /> from X1 to X12. However, the data sets for<br /> both training and testing in this experiment<br /> are much larger than those in [4]. We used<br /> 1000 samples for training and 2000 for testing<br /> (compared with only 516 and 288 for training<br /> and testing samples in [4]). The objective of<br /> this experiment is to see if the performance of<br /> GP on a larger data set is still maintained.<br /> In the second set we aimed to examine the<br /> impact of enriching the features set to the<br /> performance of GP in phishing detection<br /> problem. Similar to the experiment in [4], we<br /> also compared the performance of GP with<br /> several well-known machine learning<br /> techniques<br /> including<br /> Support<br /> Vector<br /> Machines, Artificial Neural Networks and<br /> Bayesian Networks.<br /> In the third set, we investigated the impact of<br /> features selection scheme that are based on<br /> the mutual information to the performance of<br /> all tested machine learning methods. This<br /> experimental set aims to see if using the<br /> features selection method help to remove<br /> some irrelevant features and leading to the<br /> better performance of learning methods. The<br /> detail about these experiments are presented<br /> in the following section.<br /> RESULTS AND DISCUSSION<br /> To determine quality of the models produced<br /> by GP, at the end of each run, we selected the<br /> best-of-the-run individual (the individual with<br /> the best fitness on the training set in the entire<br /> run). This model is then tested on the testing<br /> set and the output on the testing set is<br /> considered as the prediction error of the<br /> model. In order to experiment other machine<br /> learning techniques to solve the problem, we<br /> <br /> Phạm Tuấn Anh và Đtg<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> used their implementations in Weka. We<br /> compare the results produced by these<br /> methods with the results obtained by GP. The<br /> percentage of correct prediction of these<br /> methods in three experiments (Exp) is<br /> presented in Table 1. In this Table, GP is the<br /> results produced by genetic programming.<br /> SVM is shorthanded for Support Vector<br /> Machine while ANN stands for the Artificial<br /> Neuron Network. It should be noted that in all<br /> Figures, the greater values are better.<br /> It can be seen that the results in Table 1 are<br /> consistent with the results in [4]. It confirms<br /> that the best model produced by GP is also<br /> the best model among all models produced by<br /> all learning systems. Overall, the prediction<br /> accuracy of GP learnt model is about 71% in<br /> the first experiment. These values of other<br /> methods ranges from 54% to 67% with the<br /> lowest value is obtained by SVM while the<br /> highest value is obtained by ANN.<br /> Table 1. The Percentage of Correct Prediction<br /> Exp<br /> <br /> GP<br /> <br /> SVM<br /> <br /> ANN<br /> <br /> BayesNet<br /> <br /> Exp1<br /> Exp2<br /> Exp3<br /> <br /> 71.6<br /> 76.3<br /> 78.8<br /> <br /> 54.3<br /> 56.5<br /> 58.1<br /> <br /> 68.2<br /> 74.2<br /> 73.2<br /> <br /> 63.6<br /> 73.1<br /> 73.6<br /> <br /> The second experimental set was aimed to<br /> test if by adding more features (that are based<br /> on URL) to the features set, we can obtain<br /> better performance of these learning methods<br /> on this problems. The results of the second<br /> experiment are presented in the second row of<br /> Table 1.<br /> It can be seen that by enriching features set,<br /> the performances of almost all learning<br /> methods were improved. The most<br /> remarkable improvement is achieved with<br /> ANN and BayesNet. The accuracy of these<br /> two methods increased to around 74%. With<br /> AVM, its performance was also enhanced<br /> from 54% to around 57%. However, what is<br /> more important is that the performance of GP<br /> is also improved and it still obtained the best<br /> results amongst all tested techniques. The<br /> <br /> 122(08): 21 - 26<br /> <br /> results obtained by GP with this features set is<br /> about 76%. In general, the results in this<br /> experiment show the beneficial effect of<br /> adding some URL-based features to the<br /> features set in this problem.<br /> The results in the second experimental set<br /> show that enriching features set helps to<br /> improve the performance of learning<br /> algorithms in phishing detection problem.<br /> However, this larger features set may also<br /> contains some irrelevant features that might<br /> hinder the performance of GP and other<br /> learning<br /> methods.<br /> Therefore,<br /> this<br /> experimental set aims to examine if using the<br /> features selection method based on mutual<br /> information helps to eliminate irrelevant<br /> features and leading to the better<br /> performance. We first calculated the mutual<br /> information between each feature and the<br /> label of the whole data set (including both<br /> training and testing set). After that, we sorted,<br /> in ascending order, the features based on its<br /> mutual information with the label. We<br /> omitted X8, X17 and X18 from the features<br /> set due to its loosely related to the label and<br /> we conducted the above experiments with the<br /> new features set. The results are given in the<br /> row 3 of Table 1. It can be seen from these<br /> results that by using the features selection<br /> technique to eliminate some irrelevant<br /> features (X8, X17 and X18 in this paper), we<br /> can achieve better performance for GP. While<br /> the performance of other learning algorithms<br /> is mostly the same with the experiment in the<br /> second set, the performance of GP is keeping<br /> enhanced and it obtains the best result in all<br /> experiments at about 78%. Overall, the<br /> experiments in this paper show the ability of<br /> GP in tackling phishing detecting problem<br /> and if we enrich the features set and using<br /> features selection to eliminate irrelevant<br /> features we can achieve rather good result, up<br /> to approximate 80% of correct prediction.<br /> Comparing to the best result in [4] with only<br /> about 70%, this is a significant improvement.<br /> 25<br /> <br />
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2