Print — Iraqi Digital Repository

مصنف رسائل البريد الالكتروني غير المرغوب بها باعتماد طريقة نيف بيزين == A Spam Email Classifier Based on Naive Bayesian Appr

Author name: سعدية فهد جبار

Supervisor name: مها ادهم البياتي

General topic: Computer Science

Specific topic: Computer Science

Degree: Master

University: Mustansiriyah University - College Of Science - Department Of Computer

Language: English

University location: Baghdad

First pages: 28T840 - p.pdf

Abstract: من المعروف ان البريد الالكتروني غدا مهما للعديد من اشكال التواصل الجماعي الذي شاع استخدامه من قبل الملايين من الناس, الافراد, والمؤسسات. في الوقت ذاته, فانه اصبح يشكل مصدرا للتهديدات. احد اكثر هذه التهديدات شيوعا تلك المعروفة بـ "رسالة الدعاية" او ما يسمى بـ "بريد الدعايات الغير مرغوب به" او "البريد الدعائي". ومع السباق غير المتوقف لمنشئ البريد الدعائي مع مطوري المرشحات لهذا البريد, يضل هذا البريد اخذا بالتغيير والتطور بشكل مستمر ما يجعله مشكلة خطيرة على الانترنت وتهديدا يصعب اكتشافه.يقدم هذا العمل اقتراحا لمنهج في تصنيف البريد الدعائي يعتمد اسلوب "التعلم الخاضع للاشراف". يعرض العمل مصنف Naive Bayesian (NB) قادر على تعريف رسالة البريد الالكتروني فيما اذا كانت رسالة دعاية ام رسالة شرعية مستندا بذلك على محتوى هذه الرسالة ( بمعنى اخر متن الرسالة). يتم تمثيل كل بريد الكتروني كـ "حقيبة للكلمات" (الخصائص) المكونة لمتن الرسالة في ذلك البريد. ولمواكبة اخر ما طور منشئ الرسالة الدعائية من التقنيات, كانت الحاجة الى اعتماد مجموعة بيانات لرسائل البريد الالكتروني متينة ومحدثة وهي مجموعة CSDMC2010 لرسائل الدعاية (والمحدثة مؤخرا في 2014) والتي تضم عددا من ملفات “.eml” لرسائل البريد الالكتروني الخام. لتحقيق اداء افضل, فقد تم استكمال بيئة NB بقائمة من 149 خاصية تم اقتراحها لتضم تلك الخصائص المستخدمة عموما من اغلب رسائل البريد الدعائية.تم تدريب مصنف NB المقترح على مجموعة من 3800 رسالة بريد الكترونية واختباره على مجموعة من 500 رسالة اخرى . بعض الاعدادات كانت ضرورية للشطب من المحتويات العاطلة في متن الرسالة ليتم بذلك الابقاء فقط على تلك التي تساعد في الوجيه لتصنيف كفؤ. تم تطبيق طريقة "حقيبة الكلمات" لانتزاع الخصائص لكل من رسائل البريد قيد التطبيق وانتاج رسائل يكون كل منها عبارة عن قائمة من الخصائص. لتقليص حجم الفضاء لتلك الخصائص, فقد تم اختبار كل من طريقتي IG وWF من طرق "اختيار الخصائص" وبشكل واعد على رسائل البريد في مرحلتي الدريب والاختبار.تم اجراء عدة تجارب لتقييم اداء المصنف المقترح وذلك باعتماد بعض المعايير, ولتحري تاثير حجم فضاء الخصائص على نسبة التصنيف فقد تم اعتماد ثلاثة نسب من الفضاء الكلي للخصائص : 25% , 50% , و75%. اظهرت النتائج بان نسبة 75% وباستخدام طريقة IG سجلت اقصاها من نسبة تصنيف وهي 91%. تم اجراء عدة تجارب لتقييم اداء المصنف المقترح ولتحري تاثير حجم فضاء الخصائص على نسبة التصنيف. ولتتبع الحالات التي صنفت خطا مع خوارزية NB تم اقتراح بعض الاحصائيات الخاصة (Extension of Naïve Bayesian ). اظهرت النتائج التجريبية بان هذا المد رفع دقة التصنيف الى100% . | Email is obviously important for many types of group communication that has become most widely used by millions of people, individuals and organizations. At the same time it has become a prone to threats. The most popular such threats what is called a spam, also known as unsolicited bulk email or junk email. With the non - stopping race of spammers against relative filter developers, spam have been continually changing over time, hence become serious problem on the internet and increasingly difficult threat to detect. This work proposes a spam classification approach using a supervised learning. It presents a Naive Bayesian (NB) classifier capable of identifying email messages as being spam or legitimate, based on the content of these messages (i.e. body). Each email is represented as a bag of its body’s words (features). To catch up with the spammers latest techniques, a robust, yet up - to - date dataset CSDMC2010 spam corpus (last updated 2014) : a set of “.eml” files of raw email messages. To best perform, NB’s environment was integrated with a list of 149 features (words and symbols) proposed to include those commonly used by most spam emails. The proposed NB classifier was trained on a set of 3800 email messages and tested on a set of 500 emails additional ones, also . Certain preprocessing was needed to drop out any redundant data, hence keeping those only parts of an email body that give useful information which helps guiding efficient classification. Bag of words method of feature construction was applied individually on emails under consideration, to produce each email as a list of features. To further reduce dimensionality of the feature space, information gain (IG) and word frequency (WF) methods of feature selection were rewardingly tested against these emails. Several experiments were conducted to evaluate the performance of the proposed classifier, on the bases of certain criteria, and to investigate the impact the size of feature space on the classification rate. Three proportions of the total feature space were considered : 25%, 50%, and 75%. Results have shown that, a proportion of 75%, using IG method, scored the most of 91%. To tolerate left over of misclassification by NB algorithm, certain statistics were suggested to extend NB algorithm with. Experimental results showed that this extension has lifted up accuracy to 100%.