File history uploaded by mike@mbowles.com 12 years, 8 months ago
Homework for Machine Learning Text

Use the tools we reviewed in class (R, tm package, etc.) to replicate the SIAM article titles example from the paper by Berry et. al.  Enter the full article titles in to a document corpus.  Remove punctuation, reduce to all lower case, remove stop words, generate a term-document matrix and then use remove terms that only show up in one document.  Use the R SVD function to creat a rank two approximation to the term-document matrix and plot the documents in this two-space to replicate the graph from the paper.  

Then try the following alterations (separately and in conjunction with one another) to the process to see the effect they have on the way the documents plot.  
1.  Use the Porter algorithmic stemmer (after removing stop words).
2.  Use td.idf weighting to generate the term-document matrix.

Comments (0)

You don't have permission to comment on this page.