Wednesday, December 10, 2014

Week 8 Activity - Data preparation

Activity: Textual data pre-processing and informal analysis

Rule 1:
I created a list of positive words (unigrams and bigrams) from the given data and used them to identify positive and negative instances.

IF (effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) THEN pos ELSE neg

This rule doesn't apply correctly on all negative instances since some of them have positive words also. 

Rule 2:
This rule is based on a list of negative words from the given data

IF (not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) THEN neg ELSE pos

This rule predicts some positive instances wrongly since a few of negative words occur in positive instances. 

Rule 3: 
To overcome the issue of wrong predictions due to some instances containing both positive and negative words, I used count to see which dominates which.

FOR ALL(effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) Add 1 to count_pos for each occurrence

FOR ALL(not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) Add 1 to count_neg for each occurrence

IF count_pos> count_neg, THEN pos
ELSE neg

Rule 4: 
We can see that the list of words are hand-picked based on our sample data, so the above rule over-fits to our data. I removed words which may have different contexts in different occurrences and maintained only words that are predictive at all occurrences.

FOR ALL(effective OR breathtaking OR loved OR (real OR true_chemistry) OR really_good OR enthralled OR beautifully_done OR thoughtprovoking OR fabulous OR excellent OR well_handled OR very_successful) Add 1 to count_pos for each occurrence

FOR ALL(dull OR unnatural OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR waste OR silliness OR annoying) Add 1 to count_neg for each occurrence

IF count_pos> count_neg, THEN pos
ELSE neg

Even though the above rule seems to fit okay, it may not be very predictive of instances which contain words other than the ones listed or which contain an opposite context of a word. They can be captured to some extent by complex rules involving the proximity of word occurrence. More features can be added and tested by cross-validation until we get a model with reasonable reliability. My take away is that it is not at all an easy task! :)




No comments:

Post a Comment

All materials are based on the EdX course - Data, Analytics and Learning
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.