Source Themes

Effect of Missing Data Treatment on the Predictive Accuracy of C4.5 Classifier

Missing data is a common problem confronted by researchers in machine learning applications. Missing values affect both the performance of analysis tools, as well as the quality of the drawn decisions. This research aims to analyze the impact of four missing data treatment methods on the predictive accuracy of the C4.5 decision tree algorithm. It also investigates the imputation accuracy of each imputation method using a single dataset with missing values presented in a single variable. The work was performed under three missing data assumptions, namely, Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) with three missingness’ rates: 5%, 10%, and 15%. The methods used to treat the missing data are: lite-wise deletion, mean/mode imputation, K-nearest neighbor imputation, and decision tree imputation. The results of the experiments showed that the C4.5 classifier achieved better performance under the MCAR assumption. While the mean/mode imputation has the highest C4.5 predictive accuracy under MAR and MNAR assumptions. The k-nearest neighbor method obtained the most accurate imputation result under the MCAR assumption, whereas mean/mode imputation was the most accurate method under the MAR assumption. On the other hand, the lowest imputation accuracy levels were achieved under the MNAR assumption attributed to the mean/mode imputation method.

Attitudes Evaluation Toward COVID-19 Pandemic: An Application of Twitter Sentiment Analysis and Latent Dirichlet Allocation

USA is among the countries that have been considerably affected by the COVID-19 to contain the largest proportion of cases globally. This research aims at investigating the American Community opinions polarity toward the outbreak of the virus in the US throughout twitter social media platform. Further, a topic modeling approach was employed to gain insights about the most discussed topic by the US community during pandemic. A total number of 1,385,469 tweets were collected for the purpose of the study over the period of early February to late April. In addition, unsupervised approaches were employed in the analysis including VADER lexicon for sentiment analysis and Latent Dirichlet Allocation (LDA) for topic modelling. The main findings of the research showed that the largest share of the collected tweets is of positive sentiment followed by negative and neutral. Further, temporal sentiment analysis on monthly basis in comparison with COVID-19 cases showed how the tweets polarity changed over time from early February to late April. In total, the polarity of the tweets was negative before the virus outbreak and positive during the outbreak. In addition, LDA analysis showed that the overall discussed topics tend are oriented toward economy, politics, and the spread of the virus outside the US in February while March and April topics are oriented toward discussing the prevention from the virus as well as spread of the virus inside USA.

Tool Wear Prediction in Computer Numerical Control Milling Operations via Machine Learning

Tool life and tool wear contribute significantly to any machining activity and directly affect the quality of the machined part, machining device performance as well as the production rates and costs. This research aims to investigate the performance of six supervised learning algorithms in predicting the cutting tool condition in Computer Numerical Control (CNC) milling operations using a novel form of CNC internal data that eliminate the need for sensory devices installation during the machining process for data acquisition purposes. The employed supervised learning algorithms include Decision Tree, Artificial Neural Network, Support Vector Machine, k-Nearest Neighbor, Logistic Regression and Naive Bayes. The results showed that Decision Tree, Artificial Neural Network, K-Nearest Neighbors and Support Vector Machine achieved overall classification accuracy greater than (85%) while Logistic Regression and Naive Bayes achieved overall classification accuracy of (57.1%) and (60.1%) respectively. Further, naive Bayes was able to correctly predict the cutting tool as worn from the test set despite its lower overall accuracy. In addition, features importance and decision rules were extracted from the Decision Tree algorithm as it achieved the highest overall accuracy score to investigate the most important features that influence the tool condition. The result showed that only three features have the highest influence on the tool condition while decision rules were used to investigate the value of these features to cause the cutting tools to be worn.