Browsing by Author "AUBAID, Asmaa"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
Item STUDY OF WORD EMBEDDING RULES AND MACHINE LEARNING BASED TEXT CLASSIFICATION(2022-01-26) AUBAID, Asmaa; Mishra, Alok; GÖRÜR, AbdülkadirWith the growth of online information and the sudden growth in the number of electronic documents provided on the Web and in digital libraries, there is difficulty in categorizing text documents. Therefore, embedding, rule-based and machine learning approaches are the best solutions to this problem as the rule-based approach is considered to be one of the most flexible methods by which the black box of the process of the text classification technique can be shown. The details of a process of classification can be seen and it can add some tools or new instructions to obtain good results. This approach has high value for information retrieval, e-governments, information filtering, text databases, digital libraries, and other applications. The problem of the embedding technique and generating rule-based is very significant for text categorization. The general idea of any embedding technique is to determine the importance of keywords using a technique that can keep informative words and remove non-informative words, which can then help the text-categorization engine to categorize a document into a category. This thesis deals with the rule-based approach using the embedding technique for the word to vector (word2vec) and document to vector (doc2vec) approaches. It will use these two techniques to prepare keywords depending on the computation of similarity. After that, we use those keywords to apply the rule-based approach for a classifier to achieve to the best performance of the system by computing performance evaluation measures such as accuracy, recall, precision, and F-Measures. Experiments were performed on the Reuter 21578 and 20 Newsgroups datasets to classify the top ten categories of Reuter 21578 and 20 Newsgroups datasets. The Python language was used to create a rule-based approach followed by the overall effectiveness of the approach being measured with the F-Measure score, error rate, and accuracy. The results of rule-based with the embedding technique using the doc2vec model (d2vRule) in the case of the Reuter 21578 dataset were 79% precision, 75% recall, 76.75% F-Measures, 9.28% error rate and 90.72% accuracy measurements. For the 20 Newsgroups dataset, the results were 76% precision, 66.64% recall, 70.98% F-Measures, 9.93% error rate and 90.07% accuracy measurement. In addition, when the machine learning algorithms J-RIPPER (JRip), One Rule (OneR) and ZeroR were applied to the Reuter 21578 dataset, we obtained F-Measures and accuracy metrics of 0.713 − 0.752, 0.506 − 0.598 and 0.219 − 0.39 for JRip, One R and ZeroR, respectively. In addition, when applying those algorithms to our dataset, there was agreement and it appeared that our algorithm (d2vRule) performed better than these three algorithms mentioned above. Moreover, it provides a good classification process according to the evaluation metrics. On the other hand, when using the embedding technique with the word2vec model, it is predictable that these results depended on precision, recall and F-Measures approaches. Finally, it is clear that our rule-based approach is better than the results of machine learning, namely Naïve Bayes, Naive Bayes Updateable, Rules.DecisionTable, Lazy. IBL and Lazy.IBK. When it is validated for our rule-based (w2vRule), it can be seen that the rule-based (RB) classifier of a certain reference has the highest accuracy with 82.19% of correctly classified instances, while Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), and Bayes Net (BN) have accuracies of 81.72%, 81.49%, 81.19%, and 77.85%, respectively, and the Temporal Specificity Score (TSS) classifier correctly classified 77.19% of instances referenced. However, our word-to-vector rule-based classifier (w2vRule) has an observed level of measurements in the case of the Reuter 21578 dataset were 73% precision, 77.71% recall, 75.09% F-Measures, 10.09% error rate and 89.91% accuracy. Therefore, it achieved the best result when we compared it with previous rule-based and machine learning classifiers.