Print — Iraqi Digital Repository

تشابه النصوص بالاعتماد على تقنية التحليل الدلالي المستتر == Text Similarity Based On Latent Semantic Analysis Technique

Author name: خضير جاسم كاظم

Supervisor name: احمد طارق صادق العبيدي

General topic: Computer Science

Specific topic: Computer Science

Degree: Master

University: Mustansiriyah University - College Of Science - Department Of Computer

Language: Arabic

University location: Baghdad

First pages: 28T832 - p.pdf

Abstract: ان اغلب التطبيقات الالية التي تستخدم في معالجة للغات الطبيعية ومساحات العمل التي ترتبط بتعدين النصوص مثل استرجاع المعلومات, تجميع الملفات, تلخيص النصوص, الترجمة الالية وغيرها من التطبيقات, جميع هذه التطبيقات تعتمد على اساس رئيسي واحد هو حساب مقدار التشابه بين نصيين او اكثر. هذه الرسالة تقترح منهجيين يركزان على مشكلة قياس التشابه الدلالي بين النصوص المكتوبة باللغة الانكليزية . تحاول الرسالة تحسين عملية ايجاد درجة التشابه الدلالي بين النصوص وجعلها اكثر تكيفا لمعالجة كلا من الجمل القصيرة والنصوص الطويلة. الرسالة تستخدم تقنية التحليل الدلالي المستتر وهي واحدة من التقنيات الذكية لقياس التشابه بين النصوص التي تستند في عملها على مجموعة من النصوص وتشتق التشابه الدلالي من خلال سياقات الجمل. كلا من المنهجيين المقترحين يستعملان نفس الاسلوب المستخدم في التقنيات التي تعمد على المعرفة , حيث يتم اشتقاق العلاقة الدلالية على مستوى المصطلحات او الكلمات من خلال الفضاء الدلالي وبعدها يتم حساب درجة التشابه بين النصيين بالكامل. هذه الرسالة تحاول ان تعالج مشكلتين ,الاولى حساب مقدار التشابه الدلالي للنصوص من خلال الكلمات والمصطلحات المتكونة منها , مستفيدة من الخواص الرياضية لخوارزمية تفسخ القيم المفرد دون الحاجة لاستخدام مصدر خارجي ( قاموس مفردات) والثانية هي حجم متجه تمثيل النص يعتمد على طول النصين المقارنين (المطلوب ايجاد التشابه بينهما) بدلا من التمثيل الذي يعتمد على حجم المتجه بحجم الفضاء الدلالي. من خلال تقيم النتائج على ثلاثة مجموعات مختلفة فان النظام المقترح يعطي نتائج جيدة عند مقارنتها مع الحكم البشري تساوي 76% مقارنتا مع النتائج 65% و69 % على التولالي ,التي تم الحصول عليها من تقنية التحليل الدلالي المستتر القياسية دون تعديلات ونظام قياس تشابه النصوص مجاني على الانترنيت. تمكن النظام المقترح من التغلب على طرق قياس التشابه المستخدمة في مجال اكتشاف الانتحال او سرقة النصوص , حيث حصل على نتائج 92% مقارنتا مع نتائج 60% , 89% التي حصلت عليها الطرق الاخرى. | The most applications which are used in the automatic Natural Language Processing (NLP), such as Information Retriever (IR), clustering, text summarization, machine translation and other, all of them depended on the major process of how to find similarity distance between a pair or more of texts. This thesis proposes two approaches which are focus on the problem of text semantic similarity in English language. It's trying to enhance the process of finding the semantic similarity distance between texts and making it more adaptable for both long text and short text. Latent Semantic Analysis (LSA) is the technique which used in this thesis. It's one of corpus - based intelligent measures techniques. The two proposed approaches are using the same style that used in Knowledge - based measures, where derived semantic relationship on terms level from the semantic space, and thus calculate the similarity between the two texts fully. This thesis tries to address two problems, the first is calculate the semantic similarity for texts, which is benefited from the results of Singular Value Decomposition (SVD) in LSA without using external dictionary. The second the size of text vector which depends on length of comparative texts, instead of depending on size of vector space. Evaluation results on three different data sets show that the proposed system gives results comparison to human judgment equal to 76% compared with the results 65% and 69% which obtained from the standard LSA and other system of text similarity measure free online respectively, and outperforms on several competing methods which are used for detecting Plagiarism in texts, where the proposed system achieves 92% while the results are obtained 60%, 89% from these methods