Machine Learning Natural Language Text Documents
Overview of the Course
This class covers machine learning applied to natural language text documents. We will cover the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc. We'll start with some introduction to the subject matter, comparison of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations. We'll cover various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. spam filtering) part of speech identification, topic modeling, sentiment extraction, etc.
We'll use open literature for the reading in the class and hand out those references as we go along.
Prerequisites
The class will employ undergraduate-level probability, calculus and linear algebra (e.g. peruse the appendices in "Introduction to Data Mining" by Tan et. al. or Linear Algebra, and Probability Theory.) You'll need some familiarity with basic machine learning algorithms (regression, logistic regression, regularized regression, svm, ensemble methods, clustering, etc.) You can find coverage of these methods in Tan's book. If you have taken Machine Learning 101 and 102 classes, you are well prepared for this course.
Participants should be familiar with R or be willing to pick R up outside of class. We will hand out R-code for most of our examples, but we won't spend time going through introductory material on R. Come to the first class with R loaded on your computer. http://cran.r-project.org/ For your review, R are here: References for R, Reference for R Comments, More R references. To integrate R with Eclipse click here.
To get the most out of the class, participants will need to work through the homework assignments.
General Sequence of Classes:
Machine Learning 101: Supervised learning
Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar
Machine Learning 102: Unsupervised Learning and Fault Detection
Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar
Machine Learning 201: Advanced Regression Techniques, Generalized Linear Models, and Generalized Additive Models
Text: "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Machine Learning 202: Collaborative Filtering, Bayesian Belief Networks, and Advanced Trees
Text: "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Machine Learning Big Data: Adaptation and execution of machine learning algorithms in the map reduce framework.
Mike Bowles
Comments (0)
You don't have permission to comment on this page.