| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Dokkio Sidebar (from the makers of PBworks) is a Chrome extension that eliminates the need for endless browser tabs. You can search all your online stuff without any extra effort. And Sidebar was #1 on Product Hunt! Check out what people are saying by clicking here.

View
 

FrontPage

Page history last edited by mike@mbowles.com 7 years ago

Machine Learning Natural Language Text Documents

 

Overview of the Course

This class covers machine learning applied to natural language text documents.  We will cover the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc.  We'll start with some introduction to the subject matter, comparison of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations.  We'll cover various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. spam filtering) part of speech identification, topic modeling, sentiment extraction, etc. 

 

We'll use open literature for the reading in the class and hand out those references as we go along. 

 

Prerequisites

The class will employ undergraduate-level probability, calculus and linear algebra (e.g. peruse the appendices in "Introduction to Data Mining" by Tan et. al. or Linear Algebra, and Probability Theory.)  You'll need some familiarity with basic machine learning algorithms (regression, logistic regression, regularized regression, svm, ensemble methods, clustering, etc.)  You can find coverage of these methods in Tan's book.  If you have taken Machine Learning 101 and 102 classes, you are well prepared for this course.  

 

Participants should be familiar with R or be willing to pick R up outside of class.  We will hand out R-code for most of our examples, but we won't spend time going through introductory material on R.  Come to the first class with R loaded on your computer.  http://cran.r-project.org/  For your review, R are here: References for R,  Reference for R Comments,  More R references.  To integrate R with Eclipse click here

 

To get the most out of the class, participants will need to work through the homework assignments. 

 

General Sequence of Classes:

Machine Learning 101:   Supervised learning

Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

Machine Learning 102Unsupervised Learning and Fault Detection

Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

 

Machine Learning 201:    Advanced Regression Techniques, Generalized Linear Models, and Generalized Additive Models    

Text:  "The Elements of Statistical Learning - Data Mining, Inference, and Prediction"  by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

 

Machine Learning 202:   Collaborative Filtering, Bayesian Belief Networks, and Advanced Trees

Text:  "The Elements of Statistical Learning - Data Mining, Inference, and Prediction"  by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

 

Machine Learning Big Data:  Adaptation and execution of machine learning algorithms in the map reduce framework.

 

Mike Bowles

 

Week
Topic
References
1st Week
Introduction and Basic Text Manipulations

Introduction

LatentSemanticIndex

Stemmer

2ndLectureNotes

2nd Week
Text Classification

3rdLecture  

MLText-HW1.txt

MLTextHW2.txt

3rd Week
Topic Modeling

LDA

4thLectureNotes

5thLectureNotes

4th Week
Parsing - POS, Sentences, Chunking

POSTags  

6thLectureNotes

MLText-HW3

7thLecture

gibbsSamplerNotes

8thLectureNotes

5th Week
Machine Translation

9thLectureNotes  

10thLectureNotes

 

 

Comments (0)

You don't have permission to comment on this page.