Competency 8.5: Examine texts from different
categories and notice characteristics they might want to include in feature
space for models and then use this reasoning to start to make tentative
decisions about what kinds of features to include in their models.
I tried the Bazaar activity in Prosolo (but my
myself since I didn't get matched to a teammate), to explore advanced feature extraction in LightSIDE and see which
features work well giving better performance. I first used the
sentiment_sentences data set and configured stretchy patterns using the
pre-defined categories positive and negative. There was a very significant
improvement in performance from unigrams only to stretchy patterns.
To look at the details, I used Explore results pane
to analyse the results.
The more indicative words which occurred
more times had stronger weights (E.g. dull, too, enjoyable). Commonly occurring
words like a, of and punctuation had lesser or no feature weights assigned to
them. The stretchy patterns helped in predicting many positive and
negative instances correctly, by considering the position and structure of
previous and coming words. Examples below:
STRONG-POS [GAP] , but --> the movie is loaded with good intentions ,
but ---> neg
one [GAP] the STRONG-POS --> one of the best of the year --> pos
one [GAP] the STRONG-POS --> one of the best of the year --> pos
In the newsgroup data set, there were overlaps in some categories like
religion & atheism, forsale & windows due to some words. The context
should be captured more in such cases using stretchy patterns.
In my test data set of
plants classification into fruits, vegetables and flowers, it was seen the the
unigram features were most predictive. The structure of the text was not of
importance since the unigrams feature space did a decent prediction than bigrams and trigrams included.
No comments:
Post a Comment