Saturday, November 10, 2007

Concurrent Erlang: Watch out, API developers!

Continuing my theme of lifting ideas from Dave Thomas' blog posts, in our last episode we built a somewhat broken program to sequentially fetch feed information from Youtube's XML API.

I had some trouble understanding why I wasn't getting #xmlText records back from xmerl_xpath. Thanks to a comment by Ulf Wiger, I now understand what was going wrong.

As many of us do, I was using the shell to play around with ideas before committing them to my program, and in the shell the record format is not defined because xmerl.hrl is not referenced. This *was* being included in my program, but I wasn't running the program in my testing in the shell.

I took his advice, used the force, and got my patterns to match #xmlText records.

I also copied Dave Thomas' design pattern for parallel spawning of the fetch process to produce this program which 1) grabs a feed of the most_viewed videos on YouTube, and then 2) grabs in parallel the user profiles for each of those videos.

While I still only have a rudimentary understanding of the language, I at least understand everything that's going on in this program. It's amazing how quick concurrent programs in Erlang can be. The fetch_parallel function in this program runs in about 3 seconds, while the fetch_sequential version takes about 20 seconds.

If you think about what this means for API developers, it has scary implications. In short, they will need a lot more bandwidth and processing capacity to deal with concurrent clients than are presently needed to deal with a sequential universe. Most API developers are accustomed to interacting with programs that make a single request, do some processing, and then make another related request.

A world of Erlang-derived, concurrent API clients likely calls for Erlang-derived concurrent API servers. Today's API interactions are timid, one-off requests compared to what's possible in a concurrent API interaction.

Imagine a recursive program designed to spider through API data until it finds the results it's looking for. You could easily write a program that grabs a set of N search results, which in turn generates N concurrent API queries, which in turn generates N^2 concurrent API requests, which in turn generates N^3 requests.

You get the idea. Rather than being simple request & response mechanisms, APIs in fact expose all of the data they contain -- in parallel and all at once. A single concurrent Erlang client can easily create as much simultaneous load as 10,000 individual sequential API consumers do now.

API developers should start pondering the answer to this question. Right now, there are no standards for enforcing best practices on most APIs. There's nothing to stop a developer from requesting the same data over and over again from an API, other than things like the Google maps geolocation API limit of 50,000 requests per day. But what about caching and expiring data, refusing repetitive requests, enforcing bandwidth limits or other strategies?

Many people do all of these things in different ways, but we're at the tip of the iceberg in terms of addressing these kinds of issues. A poorly designed sequential API client is one thing; a badly designed concurrent API client is another thing altogether and could constitute a kind of DoS (denial of service) attack.

Start thinking now about how you're going to deal with the guy who polls you every 10 seconds for the latest status of all 142,000 of his users -- in parallel, 15,000 at a time.

And for you would-be API terrorists out there, here's some code:


-module(youtube).
-export([fetch_sequential/0, fetch_parallel/0]).
-include_lib("xmerl/include/xmerl.hrl").

get_feed() ->
{ ok, {_Status, _Headers, Body }} = http:request("http://gdata.youtube.com/feeds/standardfeeds/most_viewed"),
{ Xml, _Rest } = xmerl_scan:string(Body),
xmerl_xpath:string("//author/name/text()", Xml).

get_user_profile(User) ->
#xmlText{value = Name} = User,
URL = "http://gdata.youtube.com/feeds/users/" ++ Name,
{ ok, {_Status, _Headers, Body} } = http:request(URL),
{ Xml, _Rest } = xmerl_scan:string(Body),
[#xmlText{value = Id}] = xmerl_xpath:string("//id/text()", Xml),
[#xmlText{value = Published}] = xmerl_xpath:string("//published/text()", Xml),
{ Name, Id, Published }.

fetch_sequential() ->
lists:map(fun get_user_profile/1, get_feed()).

fetch_parallel() ->
inets:start(),
Users = get_feed(),
lists:foreach(fun background_fetch/1, Users),
gather_results(Users).

background_fetch(User) ->
ParentPID = self(),
spawn(fun() ->
ParentPID ! { ok, get_user_profile(User) }
end).

gather_results(Users) ->
lists:map(fun(_) ->
receive
{ ok, Anything } -> Anything
end
end, Users).

2 comments:

topfunky said...

Hammering a server with requests has always been bad form. My general understanding has been that issuing more than one request per second is rude.

Some services limit API consumers to a certain number of requests per hour to discourage developers from writing programs like this.

Dave Troy said...

Geoffrey - I agree entirely; badly designed code that stresses APIs unnecessarily makes no sense.

That said, a well-designed algorithm that benefits from fast, concurrent access to an API shouldn't be arbitrarily constrained in terms of number of requests per second.

If we need to impose limitations today because of the way that APIs are designed and implemented, that is understandable. But ultimately if Erlang gains popularity for developing client-side algorithms, I'd say it's a pretty good bet that Erlang will gain popularity on the server side as well.

The current Web 2.0 paradigm is not geared towards parallel, concurrent algorithms. Maybe Erlang will lead people down this path.