La compréhension automatique du texte bruyant des médias sociaux est l'un des secteurs de recherche contemporaine principaux. This paper reports an initial study to understand the characteristics of code-mixing in the social media context and presents a system developed to automatically detect language boundaries in code-mixed social media text, here exemplified by Facebook messages in mixed English-Bengali and English-Hindi. Though language identification has been considered an almost solved problem in other applications, language detectors fail in the social media context due to phenomena such as code-mixing, code-switching, lexical borrowings, Anglicisms, and phonetic typing. Most research has so far concentrated on English texts however, more than half of the users are writing in other languages, making language identification a prerequisite for comprehensive processing of social media text. Cross-language categorization surprisingly shows similar performance and is marginally better for some of the languages.Īutomatic understanding of noisy social media text is one of the prime present-day research areas. Our algorithm outperforms the existing supervised technique, which used the same dataset. Empirical results obtained on five experimental languages show that categorization with expanded topics shows a very wide performance margin when compared to usage of the original topics. Furthermore, we compare the performance of our classifier with two state-of-the-art supervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. We compare the performance of the classifier with a model of it using the original class topics. The JRC-Acquis dataset is based on subject domain classification of the European Commission's EuroVoc microthesaurus. The multilabel categorization task uses the JRC-Acquis dataset. We evaluate our categorization algorithm using a multilabel text categorization problem. The categorization algorithm computes the distributed semantic similarity between the expanded class topics and the text documents in the test corpus. The lexical knowledge in BabelNet is used for the word sense disambiguation and expansion of the topics' terms. In this paper, as a specific contribution to the document index approach for text categorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semantic term expansion of class topic terms through an optimized knowledge-based word sense disambiguation. Considering the semantics of terms is necessary because of the polysemous nature of most natural language words. Term expansion such as query expansion has been applied in numerous applications however, a major drawback of most of these applications is that the actual meaning of terms is not usually taken into consideration. One of these challenges is that the developer is required to have many different languages involved. Besides the rigor involved in developing training datasets and the requirement for repetition of training for different texts, working with multilingual texts poses additional unique challenges. The majority of the state-of-the-art text categorization algorithms are supervised and therefore require prior training.
0 Comments
Leave a Reply. |