The task is to build a classifier for recognizing the theme of a short text (press note) in one of the following five groups: business, entertainment, politics, sport, tech. Work with the provided dataset of texts, develop a representation, feature extraction method, and a final classisifer to correctly label any new text samples.
The work should be performed in two stages, which can be repeated iteratively: (1) working out a data representation and a set of features to be used in the classification, and writing a program to convert the set of texts to a set of feature vectors, and (2) building the classifier, and its optimization.
Popular schemes for representing variable-length texts for the purpose of automatic classification are: BoW (Bag-of-Words), TF-IDF (Term Frequency-Inverse Document Frequency), Word2Vec, additionally using the N-gram model (e.g. bigram).
Any machine learning algorithms can be used in the construction of the classifier. It is worth doing a few experiments, starting with some statistical analysis, and then performing an initial machine learning experiment using the simplest data representation scheme and the Naive Bayes classifier. The results from this experiment can then be used as the reference to evaluate the correctness of applying more advanced techniques. These can include other text representation methods as well as other machine learning algorithms, such as decision trees, nearest neighbors, SVM, neural networks, any ensemble learning approches, etc.
The results of each experiment should be evaluated using appropriate error measures including (but not limited to) Accuracy computed on both the training set and the cross-validation method, as the simplest measure to detect overfitting.
Optimizing the results can focus on either or both of selecting the best machine learning algorithm and tuning its parameters, as well as trying ensemble learning approaches by building hybrid classifiers. It is also possible to go back to the previous step - representation - and attempts to modify it to achieve better classification.
Please work out the results obtained in the form of a report describing your work (all important steps) and the results obtained. Additionally, please prepare the development package, allowing to reproduce your classifier operation.
The report should have the following general structure:
The main criteria for assessing the report are: brevity, clarity and readability of the description, as well as precision and completeness. The subsequent steps of the project should be justified briefly.
Report penalties:
The development package should correspond to the best classifier model found during the project, and:
Natural Language processing with Python (Steven Bird, Ewan Klein, and Edward Loper)
https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://scikit-learn.org/stable/index.html
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/
https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/
https://monkeylearn.com/text-classification-support-vector-machines-svm/
https://medium.com/analytics-vidhya
https://www.youtube.com/watch?v=Zt83JnjD8zg