Wednesday, November 14, 2007

Hacking Freerice.com: A Program to Feed the World

While I was working on some changes to Twittervision yesterday, I saw someone mention freerice.com, a site where you can go quiz yourself on vocabulary words and help feed the world. How? Each word you get right gives 10 grains of rice to, one hopes, someone who needs it.

The idea is that you will sit there for hours and look at the advertising from the do-gooder multinationals who sponsor it. Which I did for a while. I got up to level 44 or so and got to feeling pretty good about Toshiba and Macy's.

It occurred to me though that my computer could also play this game, and has a much better memory for words than I do. In fact, once it learns something, it always chooses the right answer.

So I wrote a program to play the freerice.com vocabulary game. In parallel. 50 browsers at a time. Sharing what they learn with each other. Cumulatively.

It's a multithreaded Ruby program using WWW::Mechanize and Hpricot. Nothing terribly fancy, but it does learn from each right and wrong answer, and after just a few minutes seems to hit a stride of about 75-80% accuracy. And a rate of about 200,000 grains of rice per hour (depending on the speed of your connection).

UPDATE: With some tuning, the script is now able to push out about 600,000 grains of rice per hour, which according to the statistic of 20,000 grains per person per day, is enough to feed over 720 people per day! If one thousand people run this script, it will (allegedly) generate enough to feed 720,000 people per day.

Before you go off on me, disclaimer: Yes, I realize this program subverts the intent of the freerice.com site. I've released this not to "game" freerice.com but simply to show a flaw in their design and have a little fun at the same time. If what they are after is human interaction, this design doesn't mandate it. That's all I'm saying.

Run it for a while and see how many people you can feed!

Prerequisites:

  • Ruby (Linux, OS X, Other)
  • Rubygems
  • gem install mechanize --include-dependencies


Download the code

Saturday, November 10, 2007

Concurrent Erlang: Watch out, API developers!

Continuing my theme of lifting ideas from Dave Thomas' blog posts, in our last episode we built a somewhat broken program to sequentially fetch feed information from Youtube's XML API.

I had some trouble understanding why I wasn't getting #xmlText records back from xmerl_xpath. Thanks to a comment by Ulf Wiger, I now understand what was going wrong.

As many of us do, I was using the shell to play around with ideas before committing them to my program, and in the shell the record format is not defined because xmerl.hrl is not referenced. This *was* being included in my program, but I wasn't running the program in my testing in the shell.

I took his advice, used the force, and got my patterns to match #xmlText records.

I also copied Dave Thomas' design pattern for parallel spawning of the fetch process to produce this program which 1) grabs a feed of the most_viewed videos on YouTube, and then 2) grabs in parallel the user profiles for each of those videos.

While I still only have a rudimentary understanding of the language, I at least understand everything that's going on in this program. It's amazing how quick concurrent programs in Erlang can be. The fetch_parallel function in this program runs in about 3 seconds, while the fetch_sequential version takes about 20 seconds.

If you think about what this means for API developers, it has scary implications. In short, they will need a lot more bandwidth and processing capacity to deal with concurrent clients than are presently needed to deal with a sequential universe. Most API developers are accustomed to interacting with programs that make a single request, do some processing, and then make another related request.

A world of Erlang-derived, concurrent API clients likely calls for Erlang-derived concurrent API servers. Today's API interactions are timid, one-off requests compared to what's possible in a concurrent API interaction.

Imagine a recursive program designed to spider through API data until it finds the results it's looking for. You could easily write a program that grabs a set of N search results, which in turn generates N concurrent API queries, which in turn generates N^2 concurrent API requests, which in turn generates N^3 requests.

You get the idea. Rather than being simple request & response mechanisms, APIs in fact expose all of the data they contain -- in parallel and all at once. A single concurrent Erlang client can easily create as much simultaneous load as 10,000 individual sequential API consumers do now.

API developers should start pondering the answer to this question. Right now, there are no standards for enforcing best practices on most APIs. There's nothing to stop a developer from requesting the same data over and over again from an API, other than things like the Google maps geolocation API limit of 50,000 requests per day. But what about caching and expiring data, refusing repetitive requests, enforcing bandwidth limits or other strategies?

Many people do all of these things in different ways, but we're at the tip of the iceberg in terms of addressing these kinds of issues. A poorly designed sequential API client is one thing; a badly designed concurrent API client is another thing altogether and could constitute a kind of DoS (denial of service) attack.

Start thinking now about how you're going to deal with the guy who polls you every 10 seconds for the latest status of all 142,000 of his users -- in parallel, 15,000 at a time.

And for you would-be API terrorists out there, here's some code:


-module(youtube).
-export([fetch_sequential/0, fetch_parallel/0]).
-include_lib("xmerl/include/xmerl.hrl").

get_feed() ->
{ ok, {_Status, _Headers, Body }} = http:request("http://gdata.youtube.com/feeds/standardfeeds/most_viewed"),
{ Xml, _Rest } = xmerl_scan:string(Body),
xmerl_xpath:string("//author/name/text()", Xml).

get_user_profile(User) ->
#xmlText{value = Name} = User,
URL = "http://gdata.youtube.com/feeds/users/" ++ Name,
{ ok, {_Status, _Headers, Body} } = http:request(URL),
{ Xml, _Rest } = xmerl_scan:string(Body),
[#xmlText{value = Id}] = xmerl_xpath:string("//id/text()", Xml),
[#xmlText{value = Published}] = xmerl_xpath:string("//published/text()", Xml),
{ Name, Id, Published }.

fetch_sequential() ->
lists:map(fun get_user_profile/1, get_feed()).

fetch_parallel() ->
inets:start(),
Users = get_feed(),
lists:foreach(fun background_fetch/1, Users),
gather_results(Users).

background_fetch(User) ->
ParentPID = self(),
spawn(fun() ->
ParentPID ! { ok, get_user_profile(User) }
end).

gather_results(Users) ->
lists:map(fun(_) ->
receive
{ ok, Anything } -> Anything
end
end, Users).

Sunday, November 4, 2007

Erlang Makes My Head Hurt

For those of you who haven't heard about Erlang yet, it is a functional programming language (like Lisp or Prolog) developed quietly over the last 20 years by telecoms giant Ericsson for use in telco switches.

Ericsson has been using it for roughly the last 14 years; it has several properties that make it particularly relevant to many of the problems facing developers today. It's one of the few languages that is particularly good at letting programmers take advantage of multi-core/multi-CPU systems, and distribute services across multiple boxes. YAWS, a webserver written in Erlang and a poster child of its efficiencies, kicks Apache's tail from a scalability standpoint. This is no small accomplishment for a high level language.

Today's scaling strategies revolve less around faster clock speeds and more around adding cores. Scaling out to many machines is also important, but power and space considerations are also more of an issue than ever before.

So Erlang is gaining ground because it addresses scalability for this new age of multi-core systems. Today you might have a dual Clovertown Xeon box with 8 cores, but very little software to take advantage of it. Once you get past 2 or 4 cores, that extra capacity provides little to no benefit. Enter a language like Erlang, and suddenly all that power becomes available to the programmer.

Some of my coder buddies, (Jay Phillips, Rich Kilmer and Marcel Molina) have been looking at Erlang for various tasks, and Dave Thomas' blog posts on Erlang have also inspired me to take a look at the language for some of my own work.

I picked up Joe Armstrong's book, Programming Erlang from Pragmatic Bookshelf and started reading it on a recent airplane flight.

Today I put together my first Erlang program, based on knowledge gleaned from Dave Thomas' postings and from the book.

This very simple program grabs the top_rated feed of videos from YouTube (an XML RSS feed) and then iterates through the result set to get the profile URL for each user. It is a fairly useless and trivial example, but if I can make this work then there are other things I can do down the line.


-module(youtube).
-export([fetch_each_user/0]).
-include_lib("xmerl/include/xmerl.hrl").

get_feed() ->
{ ok, {_Status, _Headers, Body }} = http:request("http://gdata.youtube.com/feeds/standardfeeds/top_rated"),
{ Xml, _Rest } = xmerl_scan:string(Body),
xmerl_xpath:string("//author/name/text()", Xml).

get_user_profile(User) ->
{_,[A|B],_,[],Name,_} = User,
URL = "http://gdata.youtube.com/feeds/users/" ++ Name,
{ ok, {_Status, _Headers, Body }} = http:request(URL),
{ Xml, _Rest } = xmerl_scan:string(Body),
[{_,[C|D],_,[],Id,_}] = xmerl_xpath:string("//id/text()", Xml),
{ Name, Id }.

fetch_each_user() ->
lists:map(fun get_user_profile/1, get_feed()).


I am pretty sure I am doing this All Wrong (tm).

My biggest area of confusion comes in the pattern matching that's required to match (and thus read) the results from the xmerl_xpath:string parsing. According to Dave Thomas' examples, xmerl_xpath should produce a #xmlText record (or a set of them) that can then be matched with the #xmlText{} syntax.

In practice, and with the Youtube API data I used, I see no such #xmlText records. Instead I get a flattened tuple from such parsing, along the lines of:


>xmerl_xpath:string("//location/text()", Xml)
[{xmlText,[{'yt:location',14},{entry,1}],1,[],"GB",text}]


The only way I can find to match this is something like this:


[{_,[A|B],_,[],Location,_}] = xmerl_xpath:string("//location/text()", Xml)


I am sure I am missing some key step or concept, but that's how we learn new languages -- stumble along til we figure out how to solve the things we want to solve.

There's an incredible amount of functionality packed into these 19 lines of code. It'll be even more amazing when I figure out my initial questions and then add concurrent processing of the user profile URLs. In theory I can simultaneously process dozens of URL feeds from Youtube and spider their API data as though through a fire hose. Stay tuned.

Meantime if anyone has any suggestions on my current puzzlements I'd love to hear them.

Erlang is a cool language. It doesn't give me the aesthetic fuzzies I receive from programming in Ruby, but I do get pretty jazzed up thinking about what should be possible from a performance and economy standpoint. Erlang doesn't allow for mutable state; variables have fixed values once assigned, and algorithms are generally handled via recursion. This is how it scales out to so many cores/cpus/machines so readily. It's kinda weird if you're used to "normal" mutable state languages.

Whenever I learn a new language (human or computer) I generally have weird and overactive dreams. I attribute this to my brain shuffling things around to accommodate new grammar and semantics.

The last few days have produced particularly vivid dreams.