File history uploaded by mike@mbowles.com 12 years, 8 months ago
MLText HW2

1.  In class we went combined the corpus containing articles on acquisitions with the one with articles on crude oil prices.  The we clustered the documents into two clusters using kmeans clustering.  In class, we got something like 60% to 65% clustering accuracy - kinda lackluster.  What could we do to improve that?  Here are some ideas.  
a.) try various combinations of stemming, stopword removal, weighting and frequency cutoff in the creation of the document-term matrix.  See if that helps.  
b.) try a gaussian mixture model clustering algo (mclust) so that the neighborhoods don't wind up being spherical

2.  Since we know the topic of the articles, we could use a supervised learning technique.  Try your favorite (gbm, glmnet (for logistic regression, svm (package e1071) etc).  Did you get better performance?  

3.  Try some other supervised learning algorithms on the spam filtering problem.  Can you get an improvement?

Comments (0)

You don't have permission to comment on this page.