AN INVESTIGATION OF THE IMPACT OF DIFFERENT DATA CLEANING TECHNIQUES ON METRIC RESULT QUALITY IN MACHINE LEARNING
Date
2022-06-14
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Enormous growth of data due to e-commerce platforms and online applications
has posed a big challenge for data analysis and processing. It is now a frequent practice
for e-commerce web sites to enable their customers to write reviews of products that
they have purchased. Such reviews provide valuable sources of information on these
products. A product review has important data source for sentimental analysis is used
in all online product firms. This huge volume of data influence leads to a great
challenge. These datasets, however, contain different data’s issues. Typically, different
data mining technique used in before deploying data in many cases. Spatially, in
supervised machine learning models trained on historical and labelled data to predict
unseen data, data that a model has never learned before.
In this thesis, we focused on design of experiment study in machine learning
too [1]. We applied Ronald Fisher theories [2] regularly to find cause- effect
relationship .For carry out this design of experimental study, we chose supervised
machine learning classification algorithms with sentimental analysis, it is an approach
to natural language processing (NLP).This is a popular way for organizations to
determine and categorize opinions about a product, service .It involves the use of data
mining, machine learning and artificial intelligence to mine text for sentiment and
subjective information [3].This study established with Multinominal Naïve Bays
,Random Forest and Logistic Regression to analysis impact of five experimental
groups (duplicate data ,punctuation mark ,stop words, limmatezr, TF-IDF transform )
and compare with one control group (no data cleaning applied). To determine the
impact experimental group on three models’ efficiency and classification ratio and
explain the interesting observations.
A simulation done on 353 projects chosen randomly from Amazon product
review dataset from twenty-four different categories . Thus, Dataset was collected
from Amazon.com by McAuley and Leskovec [4][5]. After collecting metric dataset,
SPSS software used for analyzing. A repeated-measure ANOVA was performed to
examine this research question and the descriptive statistics of metric used. Analysis
result shows there are different impact for data cleansing on machine learning models
performance . data cleaning in same cases impacted positively on Random Forest and
negatively in Multinominal Naive Bays and Logistic Regression. In other cases, had
no impact at all. In overall, experimental result showed Random Forest classifier more
sensitive on data cleaning than Multinominal Naïve Bayes classifier and Logistic
Regression classifier ,both algorithms get high classification score in un-cleaned data
set. Moreover, the experiment results showed data issues behavior differ in machine
learning model. We cannot consider data quality issues as irrelevant data in all machine
learning algorithm. Analysis result will be explained in detail on result and discussion
chapter 4 and 5.
Description
MAKİNE ÖĞRENMESİNDE, FARKLI VERİ TEMİZLEME
TEKNİKLERLERİNİN SONUÇ ÖLÇEVLERİ ÜZERİNDEKİ ETKİSİNİN
İNCELENMESİ
Keywords
computer engineering