Tue, 29 Jun 2010

Pivot, OData, and Windows Azure: Visual Netflix Browsing

netflixpivot screenshot The PivotViewer Silverlight control shipped this morning, which means you can now embed a Pivot collection (with great UI) directly in a web page. Pivot is fantastic for sorting, filtering, and browsing large numbers of items.

I’ve put together my own example of using the new PivotViewer control at http://netflixpivot.cloudapp.net. It lets you browse the top ~3,000 movies that Netflix has available to stream online. I really encourage you to click through to the demo… it’s a fantastic way to find a movie to watch.

Technical Overview

The demo is built on Windows Azure and consists of a web role (which serves the web page itself), a worker role (which creates the Pivot collection once every hour or so), and blob storage, which hosts the collection and the Silverlight control (all behind the Windows Azure CDN). The data comes from Netflix’s OData feed.

I only had to write about 500 lines of code to make this all happen, and I suspect that number would go down if I used the Pauthor library (which I didn’t have access to when I wrote this demo).

Creating the Pivot Collection

The Pivot collection is created by a worker role that only has a single instance. It takes more than an hour to process the latest Netflix feed into the form needed for Pivot. I could have parallelized some of this and spread the load across multiple instances, but the feed changes infrequently, so I’m not in any particular rush to get the work done. Using a single instance makes the code very simple, because everything happens locally on a single disk, but I have also built Pivot collections in the past using a large number of instances.

The collection is created on the local disk in NetflixPivotCreator.cs. The first step is loading all the available titles from the OData feed. To generate the NetflixCatalog class, I just right-clicked on the worker role’s references and added a service reference to http://odata.netflix.com/Catalog.

var context = new NetflixCatalog(new Uri("http://odata.netflix.com/Catalog"));
DataServiceQueryContinuation<Title> token = null;
var response = ((from title in context.Titles 
                 where title.Instant.Available && title.Type == "Movie"
                 orderby title.AverageRating descending select title)
                as DataServiceQuery<Title>)
               .Expand("Genres,Cast,Directors")
               .Execute() as QueryOperationResponse<Title>;
int count = 0;
var ids = new HashSet<string>();
do
{
    if (token != null)
    {
        response = context.Execute<Title>(token);
    }
    foreach (var title in response)
    {
        if (ids.Add(title.Id))
        {
            if (count < howMany)
            {
                yield return title;
            }
            count++;
        }
    }
    token = response.GetContinuation();
}
while (token != null && count < howMany);

The next step is to download each title’s box art and create a Deep Zoom images out of it. This is a simplified version of that code:

Parallel.ForEach(GetTopInstantWatchTitles(3000),
    new ParallelOptions { MaxDegreeOfParallelism = 16 },
    (title) =>
{
    var boxArtUrl = title.BoxArt.HighDefinitionUrl ?? title.BoxArt.LargeUrl;
    var imagePath = string.Format(@"{0}\images\{1}.jpg", outputDirectory, title.Id.ToHex());
    new WebClient().DownloadFile(boxArtUrl, imagePath);
    new ImageCreator().Create(imagePath, string.Format(@"{0}\output\{1}.xml", outputDirectory, title.Id));
});

Note the use of the Task Parallel Library, which is an awesome way to make multi-threaded programming easy.

From there, there are just one more line to create the full Deep Zoom collection:

new CollectionCreator().Create(
    titles.Select(t => string.Format(@"{0}\output\{1}.xml", outputDirectory, t.Id.ToHex())).ToList(),
    string.Format(@"{0}\output\collection-{1}.dzc", outputDirectory, suffix));

At this point, I’m ready to create the actual Pivot collection (a .cxml file that contains all the details about the movies). Check out the source code in the method CreateCxml to see how this is done. (It’s just XML generation, probably made much simpler if I use the Pauthor library.

Storing the Collection in Blob Storage

Once the collection has been created, the worker role uploads it to blob storage, using some rather mundane code. I’m including it here because it demonstrates a few important details: parallelizing uploads for performance, setting the correct content type on blobs, and setting the cache control header when using the CDN. Note also that the main .cxml file is uploaded last, to ensure that it’s not served to users before all the supporting files have been uploaded.

private void UploadDirectoryRecursive(string path, CloudBlobContainer container)
{
    string cxmlPath = null;

    // use 16 threads to upload
    Parallel.ForEach(EnumerateDirectoryRecursive(path),
        new ParallelOptions { MaxDegreeOfParallelism = 16 },
        (file) =>
    {
        // save collection-#####.cxml for last
        if (Path.GetFileName(file).StartsWith("collection-") && Path.GetExtension(file) == ".cxml")
        {
            cxmlPath = file;
        }
        else
        {
            // upload each file, using the relative path as a blob name
            UploadFile(file, container.GetBlobReference(Path.GetFullPath(file).Substring(path.Length)));
        }
    });

    // finish up with the cxml itself
    if (cxmlPath != null)
    {
        UploadFile(cxmlPath, container.GetBlobReference(Path.GetFullPath(cxmlPath).Substring(path.Length)));
    }
}

private IEnumerable<string> EnumerateDirectoryRecursive(string root)
{
    foreach (var file in Directory.GetFiles(root))
        yield return file;
    foreach (var subdir in Directory.GetDirectories(root))
        foreach (var file in EnumerateDirectoryRecursive(subdir))
            yield return file;
}

private void UploadFile(string filename, CloudBlob blob)
{
    var extension = Path.GetExtension(filename).ToLower();
    if (extension == ".cxml")
    {
        // cache CXML for 30 minutes
        blob.Properties.CacheControl = "max-age=1800";
    }
    else
    {
        // cache everything else (images) for 2 hours
        blob.Properties.CacheControl = "max-age=7200";
    }
    switch (extension)
        {
            case ".xml":
            case ".cxml":
            case ".dzc":
                blob.Properties.ContentType = "application/xml";
                break;
            case ".jpg":
                blob.Properties.ContentType = "image/jpeg";
                break;
        }
    blob.UploadFile(filename);
}

Serving the Collection

Once the collection is done, there’s very little left to do. I subclassed PivotViewer to handle users clicking through to the movie listing on Netflix (either by clicking “View on Netflix” or by double-clicking an item).

public class NetflixPivotControl : PivotViewer
{
    public NetflixPivotControl()
    {
        ItemActionExecuted += new EventHandler<ItemActionEventArgs>(NetflixPivotViewer_ItemActionExecuted);
        ItemDoubleClicked += new EventHandler<ItemEventArgs>(NetflixPivotViewer_ItemDoubleClicked);
    }

    private void BrowseTo(string itemId)
    {
        HtmlPage.Window.Navigate(new Uri(GetItem(itemId).Href));
    }

    private void NetflixPivotViewer_ItemDoubleClicked(object sender, ItemEventArgs e)
    {
        BrowseTo(e.ItemId);
    }

    private void NetflixPivotViewer_ItemActionExecuted(object sender, ItemActionEventArgs e)
    {
        BrowseTo(e.ItemId);
    }

    protected override List<CustomAction> GetCustomActionsForItem(string itemId)
    {
        var list = new List<CustomAction>();
        list.Add(new CustomAction("View on Netflix", null, "View this movie at Netflix", "view"));
        return list;
    }
}

Finally, I wrote an ASP.NET MVC web role that serves up a web page with the Silverlight control embedded. The actual Silverlight application (.xap file) is stored in blob storage just like the rest of the content, so the interesting part of the controller is constructing the proper URL to the CDN version of the blob.

We have to be careful not to get a mismatched collection due to CDN caching (for example, an updated collection with new movies mismatched with cached Deep Zoom images). To avoid this situation, every time a new collection is created, all files involved are suffixed with a reversed timestamp. The ASP.NET MVC controller below references the latest .cxml file (which in turn references the matching Deep Zoom images).

private Uri GetBlobOrCdnUri(CloudBlob blob, string cdnHost)
{
    // always use HTTP to avoid Silverlight cross-protocol issues
    var ub = new UriBuilder(blob.Uri)
    {
        Scheme = "http",
        Port = 80
    };
    if (!string.IsNullOrEmpty(cdnHost))
    {
        ub.Host = cdnHost;
    }
    return ub.Uri;
}

public ActionResult Index()
{
    var blobs = CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("DataConnectionString"))
        .CreateCloudBlobClient();
    var cdnHost = RoleEnvironment.GetConfigurationSettingValue("CdnHost");

    var controlBlob = blobs.GetBlobReference("control/NetflixPivotViewer.xap");
    var collectionBlob = blobs.ListBlobsWithPrefix("collection/collection-").OfType<CloudBlob>()
        .Where(b => b.Uri.AbsolutePath.EndsWith(".cxml")).First();

    ViewData["xapUrl"] = GetBlobOrCdnUri(controlBlob, cdnHost).AbsoluteUri;
    ViewData["collectionUrl"] = GetBlobOrCdnUri(collectionBlob, cdnHost).AbsoluteUri;
    return View();
}

Download the Code

You’ve now seen nearly all of the code involved, but you can download the full Visual Studio 2010 solution at http://cdn.blog.smarx.com/files/NetflixPivot_source_updated3.zip. [UPDATE 11/09/2011] Up-to-date code is now available at https://github.com/smarx/NetflixPivot.

If you want to run it, you’ll also need:

And note that the collection takes quite some time to create, so expect to run this for at least an hour before you can see anything.

[UPDATE 2:58pm] The first revision of this code had a couple bugs, notably around the use of the CDN. (I originally didn’t create new blob names for each update to the collection, so mismatches due to caching were possible.) I’ve updated a couple code snippets in the text of this post, and I’ve posted a new version of the source code.

[UPDATE 6/30/2010 4:28pm] I made another small revision to the code to make sure the .cxml file is uploaded last. I updated one code snippet above, and I’ve posted a new version of the source code.

[UPDATE 8/11/2010] I was missing a line in the zip of the source code (my fault for editing without compiling). Fixed. Thanks Ivan!

[UPDATE 11/09/2011] The source code has now been moved to GitHub: https://github.com/smarx/NetflixPivot. I’ll keep it up-to-date there.