If you want to look at documents and extract information automatically, you run into an interesting problem. Lots of words are there just for grammatical reasons (the, a, to, for, etc.) and have little meaning. These words can get in the way when you try to do analyses, and you might want to see how your analysis changes when you just cut them out.
To be properly rigorous, you don’t want to use a subjective reason for calling a word meaningless. We get the sense that the words I mentioned before will do little to distinguish the content on page contains from another, but what about more borderline words like “your”, “his”, “can’t”, etc.? Where do we draw the line? It’s helpful to start with thinking about what makes a word meaningless.
By meaningless, I mean that the word doesn’t help distinguish the content of one news or blog article from another. You can try to translate that idea into quantitative terms. If we want to base this definition on total counts of words, and ignore context (much easier to work with for large datasets), we need to come up with a measure of word use based on word counts. If a word occurs with roughly the same frequency across all documents, (frequency = number of occurrences / total words), then we’ll say that it’s used in the same way in those documents.
How do we measure whether or not it’s “roughly the same frequency across some set of documents”? One way is to measure the mean and standard deviation of that word over the set of documents. If the standard deviation in the word frequency is small (compared to the mean) relative to typical standard deviations, then the word frequency doesn’t vary much across the set of all documents. That suggests (with a caveat) that the word does little to distinguish the content of that document from all the others. The caveat is that you can usually choose a subset of your documents over which the standard deviation is small. For example, while the word “cat” will have a larger (standard deviation)/(mean), we can choose to look at the subset of documents talking about cats. Then the ratio should be much smaller.
Lets talk numbers. I looked at a selection of 3000 blogs selected at random from our dataset. I made a histogram of the std dev/mean ratios of all of the words just to see what interesting features pop up. Here is that histogram:

Some cool features to note are first that the std dev are usually (nearly always) several times the word’s mean frequency. Next, that there several spikes. I suspect these spikes are words that serve a common function, like identifying a particular topic of discussion, framing the post, or they might simply be memes. They will definitely be the subject of a future post.
Back to our discussion. The relevant part of the histogram for our discussion is the part on the far left side, in the low std dev/mean ratio section. Since the height of the histogram is small there, we expect that relatively few of the words we’re looking at are stop words.
We can start playing with cutoffs. For starters, we’ll take a word we think is a stop word (like “the”), and see where it’s std dev/mean ratio falls. Calculating it, we find “the”’s ratio is 0.703723. This is very atypical of our data (see histogram), as it’s standard deviation is LESS than the mean frequency. Looking at a few more words,
OF 0.803520
TO 0.733592
A 1.012831
IS 0.939458
THAT 1.063791
THEY 1.648334
AS 1.148611
I 1.712150
WHAT 1.690779
ABOUT 1.852174
THEIR 1.587259
NOT 1.430927
MORE 1.581704
we start getting a good picture of where the ratios for stop words fall. What is more, we see that we’re getting a rough measure of how meaningful words actually are. Higher ratios, like “I” (1.71) have more meaning than low ratio words like “OF” (.80).
I played around with cutoffs, and found something interesting. Remember our caveat? Since this data is coming from spidering outward from a set of political blogs, we’re expecting a higher proportion of documents with political content. One of the first “meaningful” words I found, right at the top of the threshold, was “OBAMA”. Everyone talks about him in the political blogosphere such that his name seems not to reliably distinguish the content of one political blog from another. On the set of political blogs, his name is a stop word.
Whether or not you would want to remove all of the stop words you detect is a matter left up to the researcher. You wouldn’t want to remove “obama” if you were planning to do further analysis with him as your subject. As an example of a refinement, you might want to couple this technique with a part-of-speech tagger, and try not to remove proper nouns. The technique I’ve described here is good for identifying words that might complicate your analysis. It’s also good, if used naively, for ruining a perfectly good dataset.