jeudi 1 mars 2012

An afternoon in style on the Internet

Over at Language Log, Mark Liberman writes of the "serendipitous conversational cross-fertilization comes from random encounters in the corridors and cafeterias of the internet". And the example he chooses happens to be about stylistics and measuring text similarity! Serendipitous indeed.

A couple of useful links/references (blog and academic):

Ted Underwood, "The differentiation of literary and nonliterary diction, 1700-1900", (Blog: The Stone and the Shell, 26 feb 2012)
- Ted Underwood diachronically compares literary text genres (poetry, drama, fiction) with non-fiction. Comparing word-frequency similarity, he finds that over time (1700 to 1900) the literary genres became less and less similar to non-fiction.

Interestingly, he also finds that, over that time period, non-fiction changed the most in relation to itself - that is, it changed the most as a genre. (He discusses some of the possible reasons for this so I won't repeat them here)

As Mark Liberman mentions, an obvious limitation is that this measure of similarity will depend on topic as well as style or diction.
I still need to go back to this later to look again at the stats and details.



Arvind Narayanan in "Is Writing Style Sufficient to Deanonymize Material Posted Online?" (Blog: 33 Bits of Entropy, 20 feb 2012) (blog also links to draft of forthcoming paper, "On the Feasibility of Internet-Scale Author Identification") looks at identifying blog authors:
"We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author [...]. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs."

Take-aways: 
- I discovered the term "stylometry", which is more specific and useful than what I had been calling "sort of like forensic linguistics".
- I have a lot of reading to do on stylistics (and stylometrics). Because basically all I had so far was Coupland (2007) Style: Language Variation and Identity, Cambridge University Press. [reviews: 1, 2]



AND THEN

Two minutes on Twitter means I come across this, from Google ResearchQuantifying comedy on YouTube: why the number of o’s in your LOL matter, looking at the comments under YouTube videos. Variations in these "laughs" (lol, loool, aha, hahaha, etc... not to mention the recent comeback of "mwahaha") is exactly one of the things I was counting on looking at - my hypothesis is that the choice of one of these will strongly depend on the community context in which we are speaking. For example, we know that for some people, "lol" is totally uncool and would only be used ironically, in the form of "LOL" or "LULZ" in an otherwise capitalisation-free discourse.
The different types, and amount of, emphasis we might use (as the post describes, "e.g. capitalization (LOL), elongation (loooooool), repetition (lolololol), exclamation (lolllll!!!!!)") is another variable.

Google's research question is different: They're looking at to what extent this variation might be an indicator of the funniness rating of the video by the viewer. Though they admit that "funniness" is a difficult question (humour preference is subjective), they apparently have some human-annotated data to train classifiers with. Also, they took video tags (and, presumably, the video title) into account. And some audio-visual features. So, that sounds like fun.
Just when I'm wondering why I can't intern at Google, here are two things I need to look into:
1) What is a "passive-aggressive" ranking algorithm? Does it leave notes on the fridge rather than speaking to you, or does it secretly give items controversial rankings? Maybe I need to watch this.
2) "human-annotated pairwise ground truth" - I've assumed this refers to a human-annotated training corpus, but I haven't heard of "ground truth" before. Hmm.

And their references (copy-pasted):

Opinion Mining and Sentiment Analysis,” by Bo Pang and Lillian Lee.
A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews,” by Oren Tsur, Dmitry Davidov, and Ari Rappoport.
That’s What She Said: Double Entendre Identification,” by Chloe Kiddon and Yuriy Brun.


Plus, from the comments:

Biel, J-I. and Gatica-Perez, D. VlogSense: Conversational Behavior and Social Attention in YouTube. ACM Trans. Multimedia Comput. Commun. Appl. 2, 3, Article 1 (May 2010)
Biel, J-I., Aran, O., and Gatica-Perez, D. "You Are Known by How You Vlog: Personality Impressions and Nonverbal Behavior in YouTube".In  [I think] ICWSMThe AAAI Press (2011) .

Aucun commentaire:

Enregistrer un commentaire