sing group http://sing.ei.uvigo.es  

head
sing logo
overview
Motivation
2007-08-26 02:03:20

SpamHunting is a spam filter project entirely developed by SING research group. We had completed a deep analysis of the inherent features of spam messages and the spam concept, as well as a compilation of the problematic issues and requirements of the spam message filters. The main conclusions extracted from the analysis carried out are summarized in the following items: (i) filters should be able to detect, manage or discard noise introduced by spammers in their messages; (ii) the spam filtering domain is affected by concept drift and, therefore, filters should use techniques for selecting the most suitable representation for each message at any time; (iii) spam is a disjoint concept and this issue should be kept in mind for developing filtering software; (iv) the cost caused by the distinct kind of mistakes is asymmetric and therefore, spam filtering techniques should be able to reduce false positive errors; and (v) tools for spam filtering should be able to evolve with the passing of the time in order to adapt themselves to the changing domain therefore (i.e., including high adaptation capabilities and using continuous knowledge updating techniques).

Taking into account that the vast majority of proposed techniques do not keep in mind the previous mentioned ideas, we had developed a content-base spam filter called SPAMHUNTING. We have selected the content-based approach because this kind of models can achieve higher generalization levels. Finally, SpamHunting has been designed following the lifecifle of a Case-Base Reasoning system. We had implemented all stages common on this kind of systems: (i) retrieval, (ii) adaptation, (iii) revision and finally, (iv) learning.

 
Goals
2007-08-28 18:04:13

The main goals of SpamHunting project are the following:

  1. Usage of continuous updating capabilities: We had developed a spam filter able to dinamically achieve new relevand knowledge and discard obsolete knowledge. Everytime a message is classified by a spam filter, some relevant knowledge should be extracted from it in order to improve future classification tasks. With the passage of the time, some features belonging to messages (generally terms) can get unuseful for spam classification. The system should detect and discard the information relative to these attributes in order to optimize the usage of the knowledge and improve the results of the system. 
  2. Handle concept drift. With the passage of the time new kinds of spam messages are delivered through Internet. The appearance of new kinds of spam emails is generally caused by the advertisement of new fraudulent or illegal products. Imagine the following situation: somebody has found the way to enlarge your life by the usage of an illegal drug and wants to get rich selling this illegal drug. In this situation, spamming techniques would be probabliy used for advertise this product because: (i) illegal products can´t be advertising using some other communication forms such as television or radio, (ii) due to the features of Internet financing, the delivery of spam messages is the cheapest advertisement form, (iii) nowadays Internet is a great commerce platform that can be easily used for selling illegal products. This is the beginning of the delivery of a new kind of spam messages talking about "enlarge your life". It this drug becames legal, as everybody can shop this product in any pharmacy, the delivery of this kind of messages takes no sense and this kind of messages dissapear from the user mailboxes and Internet email servers. The continuous updating capabilities mentioned can be useful for handling concept drift but this is generally not enouth.
  3. Manage disjoint concepts: spam and legitimate are disjoint concepts. In fact, there is no similarity between a spam message advertising viagra and another one offering low rate credits. Terms and expressions are different. Therefore there is no sense in use the frequency of the term "viagra" as an attribute in order to represent an email advertising low rate credits. This is a common mistake on the design of the current spam filters. We represent each message using some header attributes (RFC 822) and a set of words containing those ones that better describe it.
  4. Manage classification accuracy: When somebody ask us about something unknown, we usually answer using something like this: "I think that ..... but i'm not sure about it". When a spam filter classify a message it should provide information about the accuracy of the classification generated. This information can be provided keeping in mind the available knowledge for the construction of the solution (similarly to the humans behaviour).
  5. Continuously improve way of representing messages.
  6. Implement a spam filtering technique based on the behaviour of human people.
  7. Achieve a great performance in spam filtering.
  8. Allow the interaction of final users in order to improve the knowledge of the memory base.
  9. Handling common spam filtering problems such as the presence of messages written in many languages and the existence of several kind of user profiles. 
 
Project details
2007-08-29 14:05:05

We have selected case based reasoning (CBR) systems as the methodology for the development of the proposed model. The memory of the system has been implemented with an efficient indexing structure known as EIRN (Enhanced Instance Retrieval Network). This structure has been used to carry out the retrieval stage. In order to accomplish the reuse stage, we have used voting strategy known as PWV (Proportional Weighted Voting) while in the revision stage we compute the quality of the generated solution. Finally, SPAMHUNTING can learn during the retention stage throw interaction with the user and using a forget technique called GTC (Garbage Term Collection). The proposed model is able to exchange information with other execution instances of SPAMHUNTING, combining the advantages of collaborative and content-based approaches.

Talking about experimentation, we have introduced the need to carry out evaluations from a dynamical point of view in order to test the adapting capabilities of spam filters. Moreover, we have designed sudden concept drift tests used to benchmark models behaviour when the concept drift is present. Finally, we propose multiple language tests able to show the behaviour of filters working in multiple-language filtering environments.

We have built a test corpus with more than 20.000 messages written in five different languages and we have developed a visual tool for monitoring changes caused by concept drift in spam filtering environment. Moreover, this work contributes with a set of relevant guidelines for the construction of spam filters. First, we highlight the use of continuous updating techniques, which allows: (i) detect and learn new significant information, and (ii) discard obsolete knowledge. Next, we emphasize the disjoint knowledge representation as a relevant issue in spam filtering since the relevant features of spam messages can change as a consequence of time passing. Finally, we should keep in mind that spam is a disjoint concept and therefore, the message representation should be selected taking this fact into consideration.

 
Current and future work
2007-08-29 18:05:32

The current work is focused in:

  • Integrate semantic information during the filtering process
  • Develop a 3 dimensional graph of the tool in order to study the evolution of the system
  • Fully integration with the AIBench platform (http://www.aibench.org )