Share — Iraqi Digital Repository

عنقدة النصوص العربية باستعمال تحليل الاحالة المشتركة == Arabic Text Clustering Using Coreferance Resolution

Author name: فراس حمودي نعمة

Supervisor name: سلمى عبد الباقي محمود

General topic: Computer Science

Specific topic: Computer Science

Degree: Master

University: University Of Basrah - College Of Science

Language: Arabic

University location: Basrah

First pages: 28T756 - p.pdf

Abstract: Clustering texts organizes texts in subsets coherent and internally consistent, and different with each other. These subsets called clusters. The documents combines according to the similarity measures, that rely on features extracted of the documents. This technique applies in various fields such as web mining, search engines, and information retrieval. Clustering documents gives information retrieval, automatic extraction, and representation efficiently without user intervention for increasing growth of the new documents.This study aims to improve the accuracy and efficiency of arabic texts clustering. To achieve the aim of this study two approaches are used. The first approach uses the standard technique in Arabic texts clustering. The second approach is proposal to reduce features of the texts.The first approach is applied (K - medoids, K - means) and similarity measures (Euclidean, cosine). The problem of this approach is that huge of the features, which influence the efficiency and coherence of the clusters.The huge features for the documents adds challenge and lead to high dimension so coreference resolution technique approach is applied. This technique extracts main subjects for each document to improve arabic documents clustering in order to achieve the goal of our study. The system implements using a corpus contains on 200 sport news Arabic.Finally, evaluation measures are used including (Precision, Recall and F - measure) to evaluate our system, and we obtain acceptable results.First method exceed K - medoids and using similarity measures (Euclidean, Cosine) where highest values (0.60, 0.78, 0.67) of the method of k - means. In addition, the results of the proposed second way exceeds method of K - medoids and using similarity measures (Euclidean, Cosine) with coreference resolution, where the measures values (0.80, 0.83, 0.81).