Mr Chatter Box: as-you-type messaging demo

The comet paradigm is here, yet we still have to wait for our interlocutor to finish typing before seeing their prose in so called ‘instant’ chat message sessions. During a typical verbal exchange every sentence’s beginning is uttered and understood before its completion. This demo project represents a technical and social experiment in the instantaneous messaging world. You may remember the feature appearing in the now late Google Wave project a few years back.

Let’s see how to pull off a character-by-character as-you-type chat session with recent technologies. On the front end, we take socket.io’s javascript library, which is traditionally used in conjunction with node.js on the backend. Instead the browser bidirectionally communicates with a server written using an extension of Facebook’s Tornado Web written in Python called tornadIO. And for this version, there is no persistence layer, all data is in the app’s local data structure held in memory. Voila, that’s it.

This is of course intended as demo only, there are known technical limitations as well as user experience issues yet to be addressed. For example, a user may edit any part of the chat box and delete it entirely, in effect changing conversation history. It’s also not clear how to deal with out of bound text, what should the behavior of the scroll bar be in this case.

As a first improvement, instead of sending the content of the whole textarea element on every keyup event, so as to not miss any pasted text, a better approach would be to use Google’s Diff Match Patch project to only send the new characters along. Then, when time comes to scale, app instances can be distributed across processes and machines while communicating via a message broker like rabbitMQ and its associated pika/tornado driver.

Have fun with the demo and checkout the source. Project title credit.

Network Inference, or “Who Got Me Sick?”

There’s a very interesting question you can ask about a network over which diffusion events occur. If, for example, you’re talking about a population of people, a diffusion event could be the spread of an epidemic. What you observe in such a network is a time series infections. The first person is infected, then another, then another, and so on. The question, then, is this: given a set of diffusion events – several different epidemics – what are the routes of infection? That is, if we know who gets infected and when, can we infer from that information alone who is infecting them? This concept generalizes to other types of diffusion events, including the spread of consumer fads and news stories. The same math can be applied to each, with only slight modifications.

The inspiration for this post comes from a paper I saw on the arxiv’s soc-ph section by Gomez-Rodriguez et al.

They propose a new algorithm using a parameterized model of infections that is very parallelizable. The use a few different functions to model a quantity describing how likely it is that person i infects person j as time goes on after i gets infected. They use a number,  \alpha_{j,i} as an adjustable parameter. The larger  \alpha_{j,i} is, the more likely it will be that j is infected after i is infected. All of the functions have the property that they get smaller over time, which just says that if j isn’t infected by i soon after i is infected, then i probably isn’t going to infect j. That is, the likelihood decreases with time.

A key way that this approach differs from previous approaches is that they allow the disease transmission paths to be heterogeneous in time. That is, they allow each transmission path to have a continuously varying parameter, the \alpha_{j,i}, describing how effective each path is at transmitting disease (or information, or fads) and also how long it can take to do it. This contrasts with other approaches that allow the different paths to have different strengths, but also force them to transmit disease over the same time scale.

Finally, a great strength is that the problem can be broken into N separate problems! You can distribute that among N processors, making the problem very parallelizable.

So how well does it work? In simulated networks, it’s edge discovery is great. It’s about 95% accurate. As for getting the weights right, it’s a little less impressive. There was about 25% error in the \alpha_{j,i}, which is actually pretty good, considering they were estimating a couple thousand real numbers. Of course it decreases as the number of cascades grows.

Still, if I’m trying to estimate the edge strength in a real network with far more nodes and edges (they used around 1000 nodes, and around 2000 edges), I’ll probably have far fewer cascades than the 5000 they used. This could make the model much less useful if I’m dealing with a sufficiently dynamic network, where transmission paths change quickly relative to the time between cascades (epidemics, fads, breaking news).

In summary, this is some cool work. You can actually infer who infected whom given a good history of past infections. You can figure out whether two people are exposed to each other with great accuracy without too much data. If, however, you want to figure out how exposed they are, you need much more data.

An Introduction to Meme Tracking

What is a “meme”? A phrase or quote? An idea? A cultural object, like a fad or custom?

We’ve heard the term thrown around, and references back to the (1976) conception by Dawkins as some self-replicating unit of information that can evolve. “Memes” have been represented by everything from twitter hashtags and URLs , to quotes, to mutating phrases. While critics are quick to (I think rightly) point out that these don’t necessarily have anything to do with “real” units of information, it’s hard to overstate the power of these various approaches.

The major shortcoming shared by all of these approaches is that words don’t directly correspond to the ideas they’re intended to represent. While a hashtag is great for helping narrow your search down to a general topic, say V-day (an Italian political movement), it doesn’t capture the particular piece of cultural information conveyed in each tweet. You’re limited to more basic analyses of volume, like “how much is V-day being talked about?”. If you add in some linguistic tricks, you can even find “how much is V-day being agreed with? disagreed with?”.

In a similar vein, you can track significant quotes, and even pieces of quotes and new ones they evolve into. These tell you that someone is likely referring to something someone said, but don’t tell you what is being said about it. They give you a real measure of the flow of information (as the quote is learned and repeated by more and more people), but don’t necessarily tell you much about the cultural ideas surrounding the quote. They don’t tell you whether someone was influenced by this information to change the way they think about the world. They simply tell you that someone possess this information.

With these limitations in mind, these methods can be powerful for pointing you in the right direction for deeper analyses. Without advanced natural language interpreters, we end up needing to have people read content to figure out what the ideas surrounding a “meme” (as defined by the above criteria) really are. You don’t know, for example, if a blog post quoting Obama on health care is agreeing with him or not. You don’t know (barring other “memes” used to support their arguments) what their particular agreements or disagreements are. You simply know that something is being said about something Obama said.

With these limitations laid out, I’d say that the approach that best captures Dawkins’ original definition of a cultural idea that replicates and evolves is probably Leskovec et al’s approach using mutating phrases. It has strength both for information tracking and potentially for measuring cultural influence. Allowing subsets of quotes to be matched with the original quote allows the researcher to track the original quote further than otherwise possible. On my second point, which parts of a quote someone selects can reflect their personal bias. The same can be said for how someone modifies or paraphrases a quote, and so there’s potential for tracking cultural differences.

So how does it work? How can it be used? I’ll attempt to explain now….

In the following, most of the examples are from this paper.

Imagine you have the entire set of blog and media articles, in machine-readable form, over several news cycles (in Leskovec et al’s paper, it was 3 months). Sometimes, you have very quotable phrases emerge, like “lipstick on a pig” or “our entire economy is in danger”. They get repeated over and over again as the information and perspectives they represent is spread, or re-emerges organically. You get small variations of these phrases, and you sometimes even find them buried in larger phrases which are themselves repeated. This can happen, for example, if a new statement is making an allusion to an earlier statement. We get all these mechanisms for mutations. Phrases, like genetic code, evolve as they reproduce (propagate).

The question, then, is how to distinguish distinct phrases from one another. How do you draw a line between phrases that have evolved smoothly into very different phrases? Is there a unique, rigorous way for doing this? We want the result to be objective and intuitive. Leskovec et al have found a solution.

The idea is to draw an arrow from phrase A to phrase B when A is roughly (up to some small changes) included in B. You end up with a long web of arrows, but never a path of arrows that brings you full-circle to any phrase. There are phrases within this web that have no arrows coming out of them, and these are the longest phrases. They can be considered “root” phrases, where the evolution (not necessarily with time) originates. The solution, then, comes from asking “what are the smallest (and least significant) total arrows I can delete in order to break this web into pieces, and where each piece contains exactly 1 root phrase?”. These pieces of the web can be thought of as the set of original phrases and all the mutational variations of them. The deleted arrows can be thought of as connections between phrases that probably weren’t very important, like people happening to use the same phrasing to convey different meanings.

So with that basic idea, you can sort out the “memes” in the political blogosphere. If you graph their volume over time (as Leskovec et al. did in figure 4 of their paper), you get a very informative picture of the rise and fall of different topics and ideas during the news cycle. While this method isn’t perfect for identifying the flow of cultural attitudes and ideas, I’d say it does an excellent job.

How To Find Stop Words: pt. 2

In my last post, I showed how you can find “meaningless” words by looking for ones with very low standard deviation to mean ratios (after talking with a linguist friend, I learned this is more commonly called “coefficient of variation” (COV)). Since then, I’ve done a little more looking into the histogram I posted then showing number of words with different COVs.

The histogram, which I’ll repost here, is based on 2999 documents from our dataset. We found what fraction of the words on a page were some word (”the”, for example), and then for each word measured the average and the standard deviation of that fraction over all the pages. (More precisely, we used normalized word vectors, which are slightly different.) We plotted the ratio of the standard deviation to the mean for each word in a histogram. Here is that histogram:
VarToMeanHist3

The first feature I wanted to explain was that huge spike on the right hand side. It goes completely off the chart. I did a little math, and calculated what the COV would be if there was only one nonzero entry in N articles. That is, what is the COV when the word only occurs in one article? It turns out that it’s independent of how many times that word occurs. When N is 2999, it turns out to be exactly where our huge spike is, at around 55:
COVvsPages

If it only occurs twice, and with the same value each time, you get another spike. This one is around 39. It turns out this effect explains all the big spikes in the data. Having more entries allows the spikes to broaden, since the standard deviation can change more. This is why the COV = 39 spike (from having two non-zero entries) is broader than the COV = 55 spike (from having 1 nonzero entry). Here is how the spike location changes with number of non-zero entries IF ALL OF THOSE ENTRIES HAVE THE SAME VALUE:
SigToMuOfJ

These results are interesting. They say that part of the COV is due simply to the fact that fewer pages mention the word. Further, the COV is becomes less sensitive to changes in number of sites using the word the more sites that use it (the curve flattens as the number of non-zero entries increases). There is another part of the COV that is due only to how much the word varies when it occurs. That part is 0 when the word occurs only once. If we can find bounds on it as the number of documents in which it occurs increases, we can see how the COV’s contribution from each of these effects changes as number of documents in which a word occurs increases.

I worked out an expression for how the COV is effected by the allowing the non-zero entries to take on a distribution with mean  \mu_x and standard deviation  \sigma_x . If there are  q zero entries in an N-entry vector, the squared COV for the word is:

 \frac{\sigma^2}{\mu^2} = \frac{1}{1-p/N}\frac{\sigma_x^2}{\mu_x^2} + \left[ \frac{(p/N)^2}{1-p/N} + \frac{q - p}{N} + p/N \right]

By setting  \sigma_x = 0 and p = N – 1, we should be able to recover our result above for N = 2999, having a peak at around 55 (it works, choose q = N-1). I allowed for you to consider q – p of the 0’s to be part of the distribution of the word (that is, p of the q 0’s should not be considered measurements of the distribution).

Notice that the first term in the squared COV above we see the word’s COV appearing. The second term is entirely due to the presence of zero entries. We might consider it to represent sampling error. Now that we’ve separated these two effects, we can come up with a new metric, a modified COV, that measures exactly what we want. We’ll leave an adjustable parameter that ranges from 0 upward, where 0 implies the COV is entirely due to the number of entries that are non-zero, and larger values imply that the COV is more due to the distribution,  X , of relative word occurrences when they do occur. For simplicity, we take p = q. Here is such a metric:

 \frac{\sigma^2}{\mu^2} = \frac{1}{1-a\epsilon}\frac{\sigma_x^2}{\mu_x^2} + \left[ \frac{(a\epsilon)^2}{1-a\epsilon} + a\epsilon \right]

Here,  a is the adjustable parameter, and  \epsilon = p/N . When  a = 0 , we just get the COV of the word when it appears. When  a = 1 we take the full error term into account. We could choose to let  a \rightarrow N/p , in which case the error term gets large compared with the word’s COV, so we only get the sampling effect.

Notice that for subsequent peaks in the COV distribution, where p = N-2, N-3, etc., we get a minimum value for  \sigma_x = 0 and higher values for non-zero  \sigma . This accounts for the spikes with tails we see in the histogram above.

Now that we’ve developed this, we can figure out what word occurrences are telling us a little more easily. We can choose a cutoff in number of documents where a word appears such that the error is small compared to the word’s COV, or we can adjust the word’s COV to subtract off error (since it is fixed!) based on the number of appearances of the word. Look forward to my next post!

How to Find Stop Words

If you want to look at documents and extract information automatically, you run into an interesting problem. Lots of words are there just for grammatical reasons (the, a, to, for, etc.) and have little meaning. These words can get in the way when you try to do analyses, and you might want to see how your analysis changes when you just cut them out.

To be properly rigorous, you don’t want to use a subjective reason for calling a word meaningless. We get the sense that the words I mentioned before will do little to distinguish the content on page contains from another, but what about more borderline words like “your”, “his”, “can’t”, etc.? Where do we draw the line? It’s helpful to start with thinking about what makes a word meaningless.

By meaningless, I mean that the word doesn’t help distinguish the content of one news or blog article from another. You can try to translate that idea into quantitative terms. If we want to base this definition on total counts of words, and ignore context (much easier to work with for large datasets), we need to come up with a measure of word use based on word counts. If a word occurs with roughly the same frequency across all documents, (frequency = number of occurrences / total words), then we’ll say that it’s used in the same way in those documents.

How do we measure whether or not it’s “roughly the same frequency across some set of documents”? One way is to measure the mean and standard deviation of that word over the set of documents. If the standard deviation in the word frequency is small (compared to the mean) relative to typical standard deviations, then the word frequency doesn’t vary much across the set of all documents. That suggests (with a caveat) that the word does little to distinguish the content of that document from all the others. The caveat is that you can usually choose a subset of your documents over which the standard deviation is small. For example, while the word “cat” will have a larger (standard deviation)/(mean), we can choose to look at the subset of documents talking about cats. Then the ratio should be much smaller.

Lets talk numbers. I looked at a selection of 3000 blogs selected at random from our dataset. I made a histogram of the std dev/mean ratios of all of the words just to see what interesting features pop up. Here is that histogram:

VarToMeanHist3

Some cool features to note are first that the std dev are usually (nearly always) several times the word’s mean frequency. Next, that there several spikes. I suspect these spikes are words that serve a common function, like identifying a particular topic of discussion, framing the post, or they might simply be memes. They will definitely be the subject of a future post.

Back to our discussion. The relevant part of the histogram for our discussion is the part on the far left side, in the low std dev/mean ratio section. Since the height of the histogram is small there, we expect that relatively few of the words we’re looking at are stop words.

We can start playing with cutoffs. For starters, we’ll take a word we think is a stop word (like “the”), and see where it’s std dev/mean ratio falls. Calculating it, we find “the”’s ratio is 0.703723. This is very atypical of our data (see histogram), as it’s standard deviation is LESS than the mean frequency. Looking at a few more words,

OF 0.803520
TO 0.733592
A 1.012831
IS 0.939458
THAT 1.063791
THEY 1.648334
AS 1.148611
I 1.712150
WHAT 1.690779
ABOUT 1.852174
THEIR 1.587259
NOT 1.430927
MORE 1.581704

we start getting a good picture of where the ratios for stop words fall. What is more, we see that we’re getting a rough measure of how meaningful words actually are. Higher ratios, like “I” (1.71) have more meaning than low ratio words like “OF” (.80).

I played around with cutoffs, and found something interesting. Remember our caveat? Since this data is coming from spidering outward from a set of political blogs, we’re expecting a higher proportion of documents with political content. One of the first “meaningful” words I found, right at the top of the threshold, was “OBAMA”. Everyone talks about him in the political blogosphere such that his name seems not to reliably distinguish the content of one political blog from another. On the set of political blogs, his name is a stop word.

Whether or not you would want to remove all of the stop words you detect is a matter left up to the researcher. You wouldn’t want to remove “obama” if you were planning to do further analysis with him as your subject. As an example of a refinement, you might want to couple this technique with a part-of-speech tagger, and try not to remove proper nouns. The technique I’ve described here is good for identifying words that might complicate your analysis. It’s also good, if used naively, for ruining a perfectly good dataset.

Manual Transformations of SVG Graphics

SVG graphics represent a major advancement in web standards. The Wikipedia page has a lot of useful information. The gist of it is; developers (and designers) can now store image data as XML. Files of both vectors and binary data can be used to represent images either inline or as the src attribute of an image.

One of the major pluses of using XML as the representation for these images is the images can now be interacted with! Since SVG supports animation; one could, hypothetically, build a fairly complicated SVG image into a page to give it a highly graphical level of interactivity.

There are a number of w3 proposals and tutorials out there for manipulating and transforming SVG graphics. Many of them operate on vectors with matrices. This is a pretty fun thing. We get to use a little linear algebra!

A rotation matrix is an orthogonal matrix that has a determinant of 1. Here is a good general example of a 2×2 rotation matrix:

You can change out the angle, theta, and that varies the degree to which the vector is rotated each time the matrix operates on it.  If you want to find the vector after it has been rotated an angle of theta you can use this simple equation:

I worked out an example in javascript to transform a vector manually. It operates on the path attribute of an SVG element. It handles only very simple paths right now, but this can be improved upon later. For now one can see how fast the transformation occurs.

Check out the link here. The main file that does the transformation is SVGTransform.js.

Naive Bayesian Data Sorting

In my last post, I introduced the problem of automatically sorting through a large data set to distinguish between one type of data and another. After manually going through our data set, we found 30 main types of data:

  • 1. News Site (msn,yahoo,aol)
    • 2. Search Engine Results
    • 3. List of Stories/Page of summaries
    • 4. Full-length Article
    • 5. Article Teaser (Single summary)
    • 6. Quasi-static Page (about/tos/etc.)
  • 7. Political Blog
    • 8. Left
    • 9. Right
    • 10. Quasi-static Page (about/tos/etc.)
  • 11. Non-Political Blog
    • 12. Personal
    • 13. Niche
    • 14. Quasi-static Page (about/tos/etc.)
  • 15. Prompt (blogger, etc.)
    • 16. Blogger Login
    • 17. Wordpress Login
    • 18. Tell-a-friend
  • 19. Blog company's main site
    • 20. Blog Search Results (List of blogs' links
    • 21. Tutorial
    • 22. Quasi-static Page (about/tos/etc.)
  • 23. Advertiser
    • 24. Internet Services Advertiser
    • 25. Other Services Advertiser
    • 26. Product Advertiser
    • 27. Quasi-static Page (about/tos/etc.)
  • 28. Pornography
  • 29. Foreign Language
  • 30. Other

The first classification was between good data and spammy data. Good data was considered categories 1, 3-5, and 7-14. The rest was considered spammy. Taking the first 90% of our 694 categorized documents to train the filter, I classified the remaining 10% and found that the pages were about 94% correctly classified. That’s a basic application of a spam filter, so we should expect it to work pretty well.

The next part gets more interesting. Can a naive bayesian classifier tell a political blog from a non-political blog? This implies that we’ll have to pare down our training data set to just the blog stuff, so we’ll expect less training. After doing this, we’re left with training data consisting of 260 blog documents, about 72% of which are political blogs.

Training on the first 90% of the data, and testing on the remaining 10%, we find the filter to be 80.8% accurate. Training on the last 90% of the data and testing on the first 10% we find the filter is 76.9% accurate (it mis-classified one more blog than in the first case). So far it looks like it’s doing a decent job. I expect that a little more training data could improve the accuracy.

An interesting issue that arises when you start trying to use these filters for more and more nuanced distinctions is exactly how your training categories are defined. We used very loose definitions of what a political or non-political blog is. If we read through it, and they seemed to be dealing with primarily political content, we called it “political”. If it didn’t, we didn’t. A far more rigorous way of dealing with this would be to set hard, quantitative definitions of what a political blog is. Doing that would naturally lead to a more precise (but not necessarily more accurate) process for automatically identifying blogs simply by checking each of a set of criteria. The advantage of the bayesian approach with manual training is that we don’t have to consider all of the qualities that make something a political blog, and we usually get nice intuitive results.

Look forward to further refinement of these procedures, as I’m eventually going to work out a hybrid decision tree/bayesian classifier approach.

Exploring the HTML5 File API

HTML5 has a lot of amazing things to offer!  I thought I would take a moment to do a working example of what the File API can do.

Until now, it has not been possible to load a file into the browser (to the best of my knowledge) without first sending the file to the server.  This technique is slow!  It requires an HTTP request to send it to the server, then a separate request to send it back to the client.  This is not to mention the bandwidth usage required for the process.

The File API provides a way to load files directly from the user’s computer to their browser.  While support is currently limited to the most recent versions of our favorite browsers; it has been pretty universally accepted.

I whipped up a quick example so you can see what this implies. You can choose an image on your computer and upload them directly into your browser’s memory. You then have the option to access that piece of memory using a temporary URL that references your file.

Here is some example code:

$("#button").click( function(){
		var myfile = files.files[0];
		var myimg = document.getElementById("img");
 
		if (window.URL){
			objectURL = window.URL.createObjectURL(myfile);
			$(myimg).attr("src",objectURL);
			myimg.onload = function () {
				t = setInterval(fitToFrame, 500);
			};
			myfile.onerror = function () {
				alert("There was an error with the upload.");
			};
		} else if (FileReader){
			var reader = new FileReader();
				reader.onerror = function () { alert("Error reading file."); };
				reader.onloadend = function () {
					objectURL = reader.result;
					$(myimg).attr("src",objectURL);
					myimg.onload = function () {
						t = setInterval(fitToFrame, 500);
					};
				}
 
			reader.readAsDataURL(myfile);
 
		} else {
			alert("The File API is not supported by your browser");
		}
	});
 
	var files = document.getElementById("myfile");
 
	files.onchange = function () {
		$("#img").css("opacity","0");
	};

Here I register an onclick event with the preview button.  The file upload input is referenced here in the variable “files”.  I have it globally defined with: var files = document.getElementById("myfile");.  We can then reference our file in the files object as a numeric array; var myfile = files.files[0]; is the first (and in this case only) file.  Now we have the file in the variable myfile.

Next we check to see that the FileReader or the window.URL objects are supported. We do this with our if statements. Assuming the window.URL test passes (as it does for Firefox); we can use that API. To get our URL in memory for the file that is uploaded using the window.URL object we simply use the method window.URL.createObjectURL() with our file as the argument. Altogether; if we want to display the image we’ve just loaded into our browser, we can just load an image with the src equal to window.URL.createObjectURL(myfile) .

With the FileReader API we go about it a little differently. This API is the more universally accepted one as best I can tell. It has a number of nice-to-use events to trigger callback functions. The complete list of callbacks are:

  • onabort – Called when the read operation is aborted.
  • onerror – Called when an error occurs.
  • onload – Called when the read operation is successfully completed.
  • onloadend – Called when the read is completed, whether successful or not. This is called after either onload or onerror.
  • onloadstart – Called when reading the data is about to begin.
  • onprogress – Called periodically while the data is being read.

You can find these in Mozilla’s documentation. I set two callbacks; onloadend and onerror. When the image finishes loading I set an interval function to fit it to the frame and display it. If there is an error with the load the user will see an alert indicating this. After defining these functions I call the reader.readAsDataURL() method with myfile as the argument. When it is finished being loaded the onloadend callback is fired and the image is stored in the reader.result attribute of the FileReader prototype copy.

This is probably the simplest FileReader example to do. There are also a number of more interesting methods available to the developer such as readAsBinaryString() and readAsText(). This leaves a lot up to the imagination.

A Working CURLOPT_WRITEFUNCTION Function for libcurl

I looked around for a long time before I was able to completely put together a working function for my curl routines that would be binary safe and threadsafe. By the end of it the writefunction was a little bit of a beast compared to what it started as.

cURL, by default, outputs downloaded data to STDOUT. This is great if you’re doing a page at a time and you just want to direct the output into a stream. In my case it was impossible to work with. With 50 or so threads all downloading with libcurl at the same time; to have them all write to STDOUT would give me a jarbled mess. cURL offers the CURLOPT_WRITEFUNCTION option and the CURLOPT_WRITEDATA function for users to store the output in memory instead of streaming it out.

The writefunction option takes a function handle of the form:
size_t static writefunction(void *ptr, size_t size, size_t nmemb, void *userdata)

The writedata option takes an address to the variable you would like to write your data to in the writefunction (the void *userdata parameter).

My first attempt at the writefunction was rather naive:

size_t static writefunction(void *ptr, size_t size, size_t nmemb, void *userdata)
{
   char **temp = (char **) userdata;
   *temp = strndup((const char *)ptr,(size_t)size*nmemb);
   return nmemb*size;
}

The problem with this function comes with the way it handles the data as it’s downloaded.  cURL uses a buffer to store data as it is downloaded.  It calls your writefunction each time the buffer is ready to be read.  It is up to the developer to handle the function being called several times before the page is finished being read.  The function above, as you can see, only copies over the buffer…  and over and over.  The result is that only the last buffer read is stored in the variable passed in the CURLOPT_WRITEDATA option.

The function above has a second shortcoming as well.  strndup() assumes the input is a char array!  This is no good if you happen to start downloading a jpeg or some other binary data.  Your program would likely crash or corrupt some memory somewhere.  A second attempt uses memcpy() to make no assumptions about the data being read:

size_t static
writefunction( void *ptr, size_t size,
               size_t nmemb, void *userdata)
{
    size_t nbytes = size*nmemb;
      struct thread_status **this_status;
      this_status = (struct thread_status **) userdata;
 
      if (!(*this_status)->initialized){
            (*this_status)->data = (void *)malloc(nbytes);
            (*this_status)->bufferlen = nbytes;
            (*this_status)->writepos = 0;
            (*this_status)->initialized = true;
      }
 
      if ((*this_status)->bufferlen < ((*this_status)->writepos + nbytes)){
         (*this_status)->bufferlen = (*this_status)->bufferlen + nbytes;
         (*this_status)->data = realloc((*this_status)->data, (size_t) ((*this_status)->writepos + nbytes));
      }
 
      assert((*this_status)->data != NULL);
      memcpy((*this_status)->data + (*this_status)->writepos, ptr, nbytes);
      (*this_status)->writepos += nbytes;
    return nbytes;
}

OK.  This function is MUCH more complex.  Now instead of giving cURL a variable to work with we’re giving it an entire void * to a structure.  This is because the function we pass with CURLOPT_WRITEFUNCTION must keep track of the buffer length, as well as the write position of the function.  We also need to keep track of whether or not our memory for the variable that stores our data in our structure has been initialized.  To do all this we make a structure of the form:

struct thread_status {
    void *data;
    size_t bufferlen;
    size_t writepos;
    int initialized;
};

This allows us to keep track of all we need.  We can then pass our function and our structure to cURL like so:

curl_easy_setopt(curl,
      CURLOPT_WRITEFUNCTION, writefunction);
   curl_easy_setopt(curl,
      CURLOPT_WRITEDATA, (void *) &this_struct);

So how does it work?  Well.  The writefunction must return the number of bytes handled to indicate success.  If it matches up with the number of bytes the function was passed cURL continues downloading.  Otherwise libcurl registers an error.  After we figure out how many bytes we’re passed we can just store it in nbytes.

In the next lines:

struct thread_status **this_status;
      this_status = (struct thread_status **) userdata;

We cast our void * to the form of our data structure passed in the CURLOPT_WRITEDATA option.  Now we can access it locally!  Then we have to ask ourselves whether or not the variable to handle the downloaded data has been allocated any memory:

if (!(*this_status)->initialized){
            (*this_status)->data = (void *)malloc(nbytes);
            (*this_status)->bufferlen = nbytes;
            (*this_status)->writepos = 0;
            (*this_status)->initialized = true;
      }

If it hasn’t then we allocate enough memory to handle the current buffer, nbytes.  We then indicate that the current buffer length is nbytes long.  Since we haven’t written anything to it yet we indicate the write position is 0, and then indicate that the structure has been initialized for subsequent calls to the writefunction.

Each other call to the writefunction will use this routine:

if ((*this_status)->bufferlen < ((*this_status)->writepos + nbytes)){
         (*this_status)->bufferlen = (*this_status)->bufferlen + nbytes;
         (*this_status)->data = realloc((*this_status)->data, (size_t) ((*this_status)->writepos + nbytes));
      }

This checks to see that the buffer length is long enough to accommodate the next nbytes to be written to it.  If the buffer isn’t long enough it simply adds nbytes to the current length.  This is a slight performance issue to be changed in a later version.  Apparently it is worth the extra memory to avoid the n^2 complexity of reallocating memory in this way.  Whatever the case, we finally write to the buffer in a binary safe way after asserting we are not out of memory (data != NULL):

assert((*this_status)->data != NULL);
      memcpy((*this_status)->data + (*this_status)->writepos, ptr, nbytes);
      (*this_status)->writepos += nbytes;

If the assertion succeeds we return nbytes and then curl handles the rest.  At the end of the download the complete downloaded page is available in the structure’s data variable.  You can also check the length of data to be read by handling functions by reading the writepos variable of the structure.

I hope this saves everyone some time.  It certainly took me a little while to get a binary-safe, thread safe writefunction.  To ensure thread safety, though, the structure passed in the CURLOPT_WRITEDATA option should be initialized in the handling thread.  cURL should also be encapsulated in that thread and the CURLOPT_NOSIGNAL option must be set to true.

Adding functions to selenium via user-extensions.js

The official selenium documentation regarding user-extensions and selenium RC mentions finding many ways to accomplish this all over the internet. Well here’s yet another one. Since I couldn’t readily find what I needed  I detail below of how to add functionality to selenium via user-extensions.js and python. Specifically, the way to add javascript functions which preform actions, return text or a boolean differs slightly.

First we define the functionality in javascript:

  1. Selenium.prototype.doSomeAction = function(inputParams)
  2. {
  3. alert("Hello there");
  4. return null;
  5. };
  6.  
  7. Selenium.prototype.getSomeText = function(inputParams)
  8. {
  9. return "some text";
  10. };
  11.  
  12. Selenium.prototype.getIsVisible = function(inputParams)
  13. {
  14. return true;
  15. };

Then we can call the functions defined above in python directly like below or by extending the selenium.py driver.

test.py   
import selenium 
 
sel = selenium.selenium("localhost", 4444, "*firefox", "http://www.google.com")
sel.start()
# The 'do' in front of the function name is added later on by selenium
# and the casing is modified.
sel.do_command("someAction", [])
print sel.get_string("getSomeText", [])
print sel.get_boolean("getIsVisible", [])
 
sel.stop()

And before runing the python script above, make sure to start the selenium RC like so:

java -jar selenium-server.jar -userExtensions user-extensions.js

I had a hard time piecing this little bit together from various mailing list posts. Watch the casing of the functions names.