Activity: Textual data pre-processing
and informal analysis
Rule 1:
I created a list of positive words (unigrams and bigrams) from the given
data and used them to identify positive and negative instances.
IF (effective OR intriguing OR breathtaking OR
captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR
charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR
fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR
well_handled OR touching OR believable OR likeable OR very_successful OR enjoy
OR interesting OR good OR entertaining OR great OR believable OR
engaging) THEN pos ELSE neg
This rule doesn't apply correctly on all negative instances since some
of them have positive words also.
Rule 2:
This rule is based on a list of negative words from the given data
IF (not_perfect OR
dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR
went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR
waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR
false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR
not_original) THEN neg ELSE pos
This rule predicts some positive instances wrongly
since a few of negative words occur in positive instances.
Rule 3:
To overcome the issue of wrong predictions due to
some instances containing both positive and negative words, I used count to
see which dominates which.
FOR ALL(effective OR intriguing OR breathtaking OR captivated OR (NOT
not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled
OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR
true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR
believable OR likeable OR very_successful OR enjoy OR interesting OR good OR
entertaining OR great OR believable OR engaging) Add 1 to count_pos
for each occurrence
FOR ALL(not_perfect OR dull OR onedimensional OR misused OR unnatural OR
lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR
really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR
immaturity OR passionless OR false_hope OR collapse OR annoying OR
undercut OR not_so_well OR disaster OR not_original) Add 1 to count_neg
for each occurrence
IF count_pos> count_neg, THEN pos
ELSE neg
Rule 4:
We can see that the list of words are hand-picked based on our sample
data, so the above rule over-fits to our data. I removed words which may have
different contexts in different occurrences and maintained only words that are
predictive at all occurrences.
FOR ALL(effective OR breathtaking OR loved OR (real OR
true_chemistry) OR really_good OR enthralled OR beautifully_done OR
thoughtprovoking OR fabulous OR excellent OR well_handled OR very_successful) Add 1 to count_pos
for each occurrence
FOR ALL(dull OR unnatural OR missmarketed OR went_wrong OR worst OR
shallow OR awful OR terrible OR really_bad OR waste OR silliness OR
annoying) Add 1 to count_neg for each occurrence
IF count_pos> count_neg, THEN pos
ELSE neg
Even though the above rule seems to fit okay, it may not be very
predictive of instances which contain words other than the ones listed
or which contain an opposite context of a word. They can be
captured to some extent by complex rules involving the proximity of word
occurrence. More features can be added and tested by cross-validation until we
get a model with reasonable reliability. My take away is that it is not at all
an easy task! :)
No comments:
Post a Comment