|
|
arabic news articles classification using vectorizedcosine based on seed documents
|
|
|
|
|
نویسنده
|
elhadi mohamed t
|
منبع
|
journal of advances in computer engineering and technology - 2019 - دوره : 5 - شماره : 2 - صفحه:117 -128
|
چکیده
|
Besides for its own merits, text classification (tc) has become a cornerstone in many applications. work presented here is part of and a prerequisite for a project we have overtaken to create a corpus for the arabic text process. it is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. it also serves as a tool for the creation of arabic text corpora. in particular, we create a text classification process for arabic news articles downloaded from web news portals and sites. the suggested procedure is a pilot project that uses some human predefined set of documents that have been assigned to some subjects or categories. a vectorized term frequency, inverse document frequency (tfidf) based information processing was used for the initial verification of the categories. the resulting validated categories used to predict categories for new documents. the experiment used 1000 initial documents preassigned into five categories of each with 200 documents assigned. an initial set of 2195 documents were downloaded from a number of arabic news sources. they were preprocessed for use in testing the utility of the suggested classification procedure using the cosine similarity as a classifier. results were very encouraging with very satisfying precision, recall and f1score. it is the intention of the authors to improve the procedure and to use it for arabic corpora creation.
|
کلیدواژه
|
arabic text classification ,tfidfvector space model ,news articles ,corpora creation
|
آدرس
|
zawia university, faculty of information technology, computer technology department, libya
|
پست الکترونیکی
|
mtelhadi@yahoo.com
|
|
|
|
|
|
|
|
|
|
|
|
Authors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|