
										Wojciech Czaja


Suggested Form of a Final Project on Machine Learning
======================================================


Proper document classification is essential to the efficient management and retrieval of knowledge. Document classifications are typically assigned by humans who read the documents and are knowledgeable in the subject matter. In many large organizations, huge volumes of textual information are both created and examined, and some form of automatization of the process of information flow is required. An example of an important problem in document retrieval is determining whether a document is relevant to the given query.

In this project you will start with a database of documents with some a priori provided classifications, and your goal will be to determine a method of properly classifying new incoming texts. This process should involve analysis of provided classifiers, as well as your own input into how to properly classify documents.

Aspects to consider:

- An automoted decomposition of textual data into separate words.

- The dictionary creation process. This can take form of e.g., defining additional classes, subclasses, or other attributes based of the dataset you are working with.

- The representation step for mapping each individual document into a training
sample using the above dictionary, and associating it with a label that
identifies its category.

— An induction step for finding patterns that distinguish categories.

— An evaluation step for choosing the best solution, based on minimizing the
classification error.

The proposed database is the Reuters-21578 Text Categorization Collection Data Set:

http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

There are 21,450 news stories from the year 1987. All stories beyond April 7 are to be used as independent test cases, and the remaining data as the training cases. The data consist of 14,704 training cases and 6746 test cases. There are 135 topics of interest.

Alternatively you can work with your own preferred document dataset, after obtaining approval from the class head.