Wed, 26 Nov 2008

Windows Azure Worker Role to Deal with Spam

As promised, today I added a worker role to asynchronously process comments and attempt to detect spam, and I invite you to test it out!  See the bottom of this post for details.

Design

Here’s a flow diagram I drew on my whiteboard:

image

The steps are:

  1. A comment comes in via my blog.
  2. The comment gets stored in a Windows Azure table.
  3. A reference to the comment gets stored in a Windows Azure queue.
  4. (Some time later) a worker role picks up the queue item and retrieves the comment from table storage.
  5. The worker talks to TypePad AntiSpam and asks whether the comment is spam or not.
  6. The worker updates the comment table to reflect the result of the spam test.

Note that after step (3), the synchronous portion is done, so the website remains responsive.  (No need to wait for the spam check, which I consider potentially slow, despite it being quite speedy in practice.)  The IsSpam property defaults to false, so the comment shows up right away, providing immediate feedback that comment submission succeeded.

The big advantage to this architecture is the loose-coupling.  Because the spam check is asynchronous, the blog itself can continue to function without it.  That means that if TypePad AntiSpam has downtime (or my worker role has a bug in it), normal use of my blog won’t be disrupted.  It also means that if I later plug in a more sophisticated (and slower) analysis, I don’t have to worry about my comment form responding slowly or my front-end getting bogged down.

I can also scale the roles differently.  I’m using two instances of the web role right now, but there’s no need for more than one worker role, since the incoming rate on comments is less than 50 comments an hour.

Implementation

In my last post, I described the changes I made to my data model and the blog code.  One additional change is a one-liner to enqueue work when a comment has been stored:


    QueueStorage.Create(StorageAccountInfo.GetDefaultQueueStorageAccountFromConfiguration())
        .GetQueue("commentqueue")
        .PutMessage(new Message(string.Format("{0}/{1}", comment.PartitionKey, comment.RowKey)));

The worker role code is quite simple.  To talk to TypePad AntiSpam, I used the Akismet .Net 2.0 API project on Codeplex, with a minor change to point to TypePad AntiSpam instead.  This is nearly all of the code from the worker (omitting the function which converts my comment object to an AkismetComment):

public override void Start()
{
    var q = QueueStorage.Create(StorageAccountInfo.GetDefaultQueueStorageAccountFromConfiguration())
        .GetQueue("commentqueue");
    var akismet = new Akismet("<KEY DELETED>", "http://blog.smarx.com", "blog.smarx/2");
    if (!akismet.VerifyKey())
    {
        throw new ArgumentException("Invalid key.");
    }

    var svc = new BlogDataServiceContext();

    while (true)
    {
        var msg = q.GetMessage();
        if (msg != null)
        {
            var split = msg.ContentAsString().Split('/');
            var partitionkey = split[0];
            var rowkey = split[1];

            var comment = (from c in svc.BlogCommentTable
                           where c.PartitionKey == partitionkey
                            && c.RowKey == rowkey select c).FirstOrDefault();
            if (comment != null)
            {
                var akismetComment = GetAkismetComment(comment);
                if (akismetComment == null)
                {
                    // comment is for a non-existent blog entry
                    comment.IsSpam = true;
                    svc.UpdateObject(comment);
                }
                else if (akismet.CommentCheck(akismetComment))
                {
                    comment.IsSpam = true;
                    svc.UpdateObject(comment);
                }
                svc.SaveChanges();
            }
            q.DeleteMessage(msg);
        }
        else
        {
            Thread.Sleep(1000);
        }
    }
}

Try it!

You can test this out yourself.  Post a comment with the author “viagra-test-123” and watch it disappear within a couple seconds.  (This string is hard-coded in Akismet and TypePad AntiSpam to be a spam indicator.)