**Machine Learning Natural Language Text Documents**

**Overview of the Course**

This class covers machine learning applied to natural language text documents. We will cover the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc. We'll start with some introduction to the subject matter, comparison of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations. We'll cover various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. spam filtering) part of speech identification, topic modeling, sentiment extraction, etc.

We'll use open literature for the reading in the class and hand out those references as we go along.

**Prerequisites**

The class will employ undergraduate-level probability, calculus and linear algebra (e.g. peruse the appendices in "Introduction to Data Mining" by Tan et. al. or Linear Algebra, and Probability Theory.) You'll need some familiarity with basic machine learning algorithms (regression, logistic regression, regularized regression, svm, ensemble methods, clustering, etc.) You can find coverage of these methods in Tan's book. If you have taken Machine Learning 101 and 102 classes, you are well prepared for this course.

Participants should be familiar with R or be willing to pick R up outside of class. We will hand out R-code for most of our examples, but we won't spend time going through introductory material on R. Come to the first class with R loaded on your computer. http://cran.r-project.org/ For your review, R are here: References for R, Reference for R Comments, More R references. To integrate R with Eclipse click here.

To get the most out of the class, participants will need to work through the homework assignments.

**General Sequence of Classes:**

**Machine Learning 101: ** Supervised learning

Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

**Machine Learning 102: **Unsupervised Learning and Fault Detection

Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

**Machine Learning 201: ** Advanced Regression Techniques, Generalized Linear Models, and Generalized Additive Models

Text: "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

**Machine Learning 202: **Collaborative Filtering, Bayesian Belief Networks, and Advanced Trees

Text: "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

**Machine Learning Big Data: ** Adaptation and execution of machine learning algorithms in the map reduce framework.

Mike Bowles

## Comments (0)

You don't have permission to comment on this page.