>
Fa   |   Ar   |   En
   The textcat package for n-gram based text categorization in R  
   
نویسنده hornik k. ,mair p. ,rauch j. ,geiger w. ,buchta c. ,feinerer i.
منبع journal of statistical software - 2013 - دوره : 52 - - کد همایش: - صفحه:1 -17
چکیده    Identifying the language used will typically be the first step in most natural language processing tasks. among the wide variety of language identification methods discussed in the literature,the ones employing the cavnar and trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. this paper presents the r extension package textcat for n-gram based text categorization which implements both the cavnar and trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. a multi-lingual corpus obtained from the wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.
کلیدواژه Language identification; N-grams; R; Text categorization; Text mining; Textcat
آدرس institute for statistics and mathematics,department of finance,accounting and statistics,wu wirtschafts-universität wien,augasse 2-6,1090 wien, Austria, institute for statistics and mathematics,department of finance,accounting and statistics,wu wirtschafts-universität wien,augasse 2-6,1090 wien, Austria, institute for statistics and mathematics,department of finance,accounting and statistics,wu wirtschafts-universität wien,augasse 2-6,1090 wien, Austria, institute for statistics and mathematics,department of finance,accounting and statistics,wu wirtschafts-universität wien,augasse 2-6,1090 wien, Austria, institute for statistics and mathematics,department of finance,accounting and statistics,wu wirtschafts-universität wien,augasse 2-6,1090 wien, Austria, technische universität wien, Austria
 
     
   
Authors
  
 
 

Copyright 2023
Islamic World Science Citation Center
All Rights Reserved