Tue, 25 Nov 2008

Adding a Property (Column) in Windows Azure Tables

As you may already understand, a table in Windows Azure storage stores entities, each of which has a number of properties.  (We call this an “entity store.”)  I stress this terminology over the more familiar “rows” and “columns,” because they’re really not rows and columns.  One way they’re different is that in Windows Azure tables, there’s no fixed schema.  A single table can hold a variety of different types of entities, each with its own set of properties.

If you want to, you can use ADO.NET Data Services with a fixed set of strongly-typed .NET classes to treat Windows Azure tables as though they have strict schema, and that’s typically what I do.  However, occasionally it’s important that underneath there’s more flexibility.  One common use of that flexibility is to introduce new properties into an existing table.

Today I added a new property in a table for my blog, and I thought I’d share what I did with you.

The problem

As you may remember, I posted a while ago asking what I should do about spam.  My plan for now is to use TypePad AntiSpam.  It uses the same API as Akismet, and it’s free, which sounded like a good price to me.  When I took a look at the Akismet API, it looked like there are a couple pieces of data I need to send to AntiSpam that I wasn’t collecting from comments on my blog.  Specifically, the comment check method requires the IP address of the user, the user agent string, and the HTTP_REFERER header.  Everything else I already had.

I also don’t want to actually delete the spam.  I just want to mark it as such and stop displaying it.  That means I need an IsSpam boolean property to keep track of whether or not a comment should be displayed.

In code, I needed to migrate my BlogComment entity from this:


public class BlogComment : TableStorageEntity
{
    public string Author { get; set; }
    public string Url { get; set; }
    public string Body { get; set; }
    public DateTime Posted { get; set; }
    public override string PartitionKey { get; set; }
    public override string RowKey { get { return string.Format("{0:d19}", Posted.Ticks); } set { } }

    ...

to this:

public class BlogComment : TableStorageEntity
{
    public string Author { get; set; }
    public string Url { get; set; }
    public string Body { get; set; }
    public DateTime Posted { get; set; }
    public override string PartitionKey { get; set; }
    public override string RowKey { get { return string.Format("{0:d19}", Posted.Ticks); } set { } }
    public string UserIp { get; set; }
    public string UserAgent { get; set; }
    public string Referrer { get; set; }
    public bool IsSpam { get; set; }

    ...

The solution

Actually, I already gave you the solution.  I just changed the class.  I didn’t do anything at all to the table or interact with storage in any way.  This works because Windows Azure doesn’t have strict schema.  When the ADO.NET Data Services client library retrieves blog comments, it creates a new BlogComment object, and it sets any properties according to what it got back from the table.  That means that if a property doesn’t get returned from the table, it just doesn’t get set.  It remains with its default value.

Multiple schemas

One interesting (and very useful) side effect of this self-imposed schema via .NET objects is that different applications operating on the same set of data can see different schemas.  During this change to support spam detection, I wrote a simple web application I ran locally to let me manually mark comments as spam.  The interesting code looks like this:

    protected void spam_Command(object sender, CommandEventArgs e)
    {
        var split = e.CommandArgument.ToString().Split('/');
        var partitionkey = split[0];
        var rowkey = split[1];
        var svc = new BlogDataServiceContext();
        var comment = (from c in svc.BlogCommentTable
                       where c.PartitionKey == partitionkey
                       && c.RowKey == rowkey select c).Single();
        comment.IsSpam = true;
        svc.UpdateObject(comment);
        svc.SaveChanges();
    }

    protected void Page_PreRender(object sender, EventArgs e)
    {
        commentRepeater.DataSource = new BlogDataServiceContext().BlogCommentTable;
        commentRepeater.DataBind();
    }

In this application, I was using the new class (with the added properties).  Simultaneously, my blog was running with the old class (no new properties).  This is not a problem, because extra properties returned from the table that don’t have corresponding properties in the class just get ignored.  That means my blog can happily keep running, oblivious to the fact that some comments now have IsSpam set to true.  This ability to work with the same data using multiple schemas simultaneously means it’s trivial to do a backward- and forward- compatible change to add a property.

Once I’d marked everything as spam manually that I thought should be spam, I just updated my blog code to understand the new fields and updated my queries to only bring back entities with IsSpam set to false.

Don’t try this at home (localhost)

One caveat: the local development storage actually does have fixed schema, because under the covers, it’s using SQL Server to simulate Windows Azure storage.  That means everything above won’t work unless you’re running against cloud storage.

What’s next?

I’m now collecting all the data I need to give to TypePad AntiSpam to get help classifying spam (and in the mean time, I have a manual mechanism for removing it).

The next step is to create a worker role that will classify spam asynchronously.  I hope to tackle that in the next few days.