Tuesday, August 27, 2013

iambic pentameter and stratified sampling

This post is to remind me later of a cool idea I just had. I have no idea if it will ever be carried out, but if it’s written down somewhere obvious, it’s more likely to happen. (I should do this more often.)

Pentametron 2013 is a Twitter bot account that searches for tweets that happen to be written in iambic pentameter and compiles them into sonnets. They’re really just grouped into rhyming couplets, but you can see the most recently compiled sonnets at pentametron.com. (You can also read the creator’s description here.) Much of the time, this method of culling messages from the 400 million tweets produced each day results in an odd mishmash: random song lyrics (no surprise those tend to fall into the Bard’s classic style), expressions of irritation or frustration (maybe those emotions lead to the punchier iambs?), snippets of musings (the author is in a poetic frame of mind). But sometimes you get a glimpse into how someone very far removed from you personally is reacting to a common cultural event. For instance, I noticed today that several of the tweets were about back-to-school topics.

So here is my question, which I believe should be quantifiable to an extent:

What is the chance that a message retweeted by Pentametron will be about a trending topic?

This seems like a good project for a statistics class, or perhaps even my probability class this fall.

Here’s why I find this question interesting: Pentametron’s algorithm amounts to choosing a relatively small number of tweets from the vast panoply of Twitter users in a systematic way that isn’t biased too much towards one type of tweet or the other (except perhaps for the above-noted frequency with which song lyrics appear). In statistics, this is called stratified sampling, and it provides the basis for essentially all polling from large populations. If the method of sampling selects from as many different kinds of groups (“strata”) as possible in a way that is random (or systematic in a way unrelated to the division into groups), then the results are very often representative of the entire population, even more so than a random selection from the whole pool, without stratification. I expect this effect can be explained by some theoretical arguments which I now intend to learn.

(By a strange quirk, I learned about stratified sampling from the introduction to Donald Knuth—yes, that Donald Knuth—’s book 3:16 Bible Texts Illuminated. He explains that by selecting verse 16 of chapter 3 from every book of the Bible—insofar as possible—we should theoretically get a relatively good picture of the Bible’s message as a whole. I recommend reading his description of the process, and his justification of it as one way of studying the Bible, alongside others of course.)

A relatively small number of tweets turn out to be in iambic pentameter, but they are spread out among all of the different types of Twitter users. Thus, perhaps by looking at just these tweets, we can get a sense of what Twitter as a whole is talking about. Another indicator of these large-scale trends are the so-called trending topics, which simply refer to words, phrases, or hashtags that appear with a high frequency in a particular region or worldwide. My question above can be rephrased as, how often do these two indicators (the subjects of tweets in iambic pentameter and the trending topics) align? A priori, it may seem like there is no connection, but I think the above discussion suggests that a correlation is likely, measurable, and even estimable if one knows the right guesses and assumptions to make.

Any thoughts on how to go about studying this?

No comments: