Friday, December 12, 2014

Competency 8.5

Competency 8.5: Examine texts from different categories and notice characteristics they might want to include in feature space for models and then use this reasoning to start to make tentative decisions about what kinds of features to include in their models.

I tried the Bazaar activity in Prosolo (but my myself since I didn't get matched to a teammate), to explore advanced feature extraction in LightSIDE and see which features work well giving better performance. I first used the sentiment_sentences data set and configured stretchy patterns using the pre-defined categories positive and negative. There was a very significant improvement in performance from unigrams only to stretchy patterns.

To look at the details, I used Explore results pane to analyse the results.

The more indicative words which occurred more times had stronger weights (E.g. dull, too, enjoyable). Commonly occurring words like a, of and punctuation had lesser or no feature weights assigned to them. The stretchy patterns helped in predicting many positive and negative instances correctly, by considering the position and structure of previous and coming words. Examples below:
STRONG-POS [GAP] , but --> the movie is loaded with good intentions , but ---> neg
one [GAP] the STRONG-POS --> one of the best of the year --> pos

In the newsgroup data set, there were overlaps in some categories like religion & atheism, forsale & windows due to some words. The context should be captured more in such cases using stretchy patterns.

In my test data set of plants classification into fruits, vegetables and flowers, it was seen the the unigram features were most predictive. The structure of the text was not of importance since the unigrams feature space did a decent prediction than bigrams and trigrams included.

No comments:

Post a Comment

All materials are based on the EdX course - Data, Analytics and Learning
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.