A Data Model and Workflow for Crowd-sourced Web Applications

Published: 4 March, 2012

Introduction

I've been working on this data model and workflow specifically for a scientific data platforms in order to simplify contributions from casual users without sacrificing the platform's overall data integrity. This post describes how to build a moderated workflow that tracks every change made by users to the data model, allowing moderators to guide input into your data platform made by common users.

This model puts all versions of an entity on the same playing field to allow all previous versions to be accessed in the same way as the current version. While this increases query complexity, it makes it possible to do full searches against the database for the most recent entity version at a given point in time. So instead of releasing your database in build cycles, you can release the data in real time while maintaining consistent results for a given query. This is a powerful feature for scientific data platforms, allowing replication of down-steam analyses performed on query results without requiring a complete snapshot of the whole database at the time of analysis. As a result, any specific query will always return the same results, despite the underlying data being in a constant state of revision.

This model operates on entities in a CRUD-like fashion, but instead of performing Update operations on entities, a Create operation is used to make a new version of the entity. In this way, CRUD operations are performed on entity versions rather than entities. This allows us to expose only Create operations to basic users. For example, to delete an entity, a basic user would Create a new version having a status = "deleted" (soft delete), and a moderator would then Update the version revisionStatus = "accepted". This model allows for complete version tracking - including blame-ability for all changes.

What We Need

For a real-time crowd sourced data platform, we need to be able to:

  • store entities with complex relationships
  • store multiple “pending” versions for an entity, including all relationships
  • track which users created each entity version
  • track which user reviewed each entity version
  • fetch the most recent accepted version of an entity, including relationships, since a given time
  • fetch a collection of previous versions of an entity, including relationships, since a given time
  • fetch a collection of entities matching search criteria, including relationships, since a given time
  • fetch a collection of most recently accepted version of entities matching search criteria, including relationships, since a given time

Data Model

For this example, I am going to show a simple model for a platform that allows users to create collections of files. Each Post entity can have many File entities.

PostVersion Entity Properties

Name Type Comment
Id Guid ID for the specific Post version
PostId Guid Global ID for the post – all versions share the same global ID
Status Enum {public, deleted} Indicator of the state of the version (used for soft deletion)
AuthoredAt DateTime When the post version was created
Author User User entity that created this post version
AuthorComment String Comment attached to this post version for reviewer to read
ReviewStatus Enum {pending, accepted, rejected} Indicator of the review status of the post version. Pending versions have not yet been reviewed by a moderator
ReviewedAt DateTime When the post version was reviewed
Reviewer User User entity that reviewed this post version
ReviewerComment String Comment made by reviewer for author
IsCurrent Boolean If the entity is the most recent version – this is a hack to make searching faster for the most current versions
Description String Description of the post
Files FileCollection List of File items related to this post

File Entity Properties

Name Type Comment
Id Guid ID for the File entity
PostVersionId Guid ID for the parent post version
Filename String Filename
Description String Description of the File
Data Binary Binary encoded file data

User Entity Properties

Name Type Comment
Id Guid ID for the User entity
Username String Username
DispalyName String The user's name as they would like it to appear to the program.

In this example, you can see that File entities are subordinate to Post entities. Therefore, File entities do not need all the meta data that Post entities do. However, there are many cases where you might want subordinate entities to be managed in the same way as the parent entity. In order to achieve this functionality, you would need to apply all meta data and work flow processes to the subordinate entities you want to be tracked.

Workflow for Model Transitions

Below are details for the common state transitions of the model.

Creating a New Post

  • Users creates a new PostVersion entity
  • A Moderator updates the new PostVersion entity meta data to accept or reject

Updating a Post

  • User creates a new PostVersion entity with an existing Post ID
  • Moderator updates new PostVersion meta data to accept or reject

Deleting a Post

  • User creates a new PostVersion entity with existing Post ID and sets "status" to "deleted"
  • Moderator updates new PostVersion meta data to accept or reject

Fetching Data

These examples show ORM code for fetching data using this model (it is Doctrine 2 ORM DQL code, which should be familiar to anyone using any Hibernate-like query languages). It should be easy enough to convert it to SQL if needed.

The basic idea is to subset versions based on a target date, then select the max accepted version (this gives you the current version at that time point), then filter across current versions based on your selection criteria. This is shown in the figure below (the two subset steps can be combined into one SQL sub-query):

All queries against this model can be thought of as simple sub-setting of entity space, each version being a unique element.

Fetching All Versions of Post

    SELECT * FROM PostVersion JOIN Author, Reviewer, File
    WHERE PostVersion.AuthoredAt < "2011-05-05"
    ORDER BY PostVersion.AuthoredAt DESC
    

(Note: this returns all versions, including rejected versions and soft-deleted versions)

Fetching the Most Recent Accepted Version of each Post

    SELECT PostVersion FROM Application\Model\PostVersion PostVersion
    JOIN PostVersion.Author Author
    JOIN PostVersion.Reviewer Reviewer
    JOIN PostVersion.FilePostVersion.File File
    WHERE PostVersion.ReviewStatus = "accepted"
       AND PostVersion.PostId = 3
    ORDER BY PostVersion.ReviewedAt DESC
    LIMIT 1
    

Fetching the Most Recently Accepted Posts Matching Search

    SELECT PostVersion
    FROM Application\Model\PostVersion PostVersion
    JOIN PostVersion.Author Author
    JOIN PostVersion.Reviewer Reviewer
    JOIN PostVersion.FilePostVersion.File File
    WHERE PostVersion.ReviewedAt = (
            SELECT MAX(PostVersionSubset.ReviewedAt)
            FROM Application\Model\PostVersion PostVersionSubset
            WHERE PostVersion.Id = PostVersionSubset.Id
                AND PostVersionSubset.ReviewStatus = "accepted"
                AND PostVersionSubset.ReviewedAt &lt; "2011-05-05"
            GROUP BY PostVersionSubset.Id
        )
        AND PostVersion.Status = "public"
        AND PostVersion.ReviewStatus = "accepted"
        AND PostVersion.Author.Username = "brady"
        AND (
            PostVersion.File.Description LIKE "%searchTerm%"
            OR PostVersion.File.Description LIKE "%searchTerm"
        )
    ORDER BY PostVersion.AuthoredAt DESC
    

As you can see, the first WHERE clause selects all PostVersion entities reviewed since the given time point ("2011-05-05"). Then additional search criteria can be added to the top level query to search within the subset. Though the sub-query has a performance hit, it allows queries across versions at a particular time point. Since the query results are not expected to change when new data is added, they can be cached without expiration.

Speeding up Fetches on Globally Most Recently Accepted Posts

For fetches of the most recent versions ("since" datetime = now), we can add an indexed field to find the most recent version without a sub-query. This works great for when you don't need the most recent versions at a particular time point and just want to fetch the most-recent versions of all current posts.

    SELECT PostVersion
    FROM Application\Model\PostVersion PostVersion
    JOIN PostVersion.Author Author
    JOIN PostVersion.Reviewer Reviewer
    JOIN PostVersion.FilePostVersion.File File
    WHERE PostVersion.IsCurrent = true
        AND PostVersion.Status = "public"
        AND PostVersion.ReviewStatus = "accepted"
        AND PostVersion.Author.Username = "Brady"
        AND (
            PostVersion.File.Description LIKE "%searchTerm%"
            OR PostVersion.File.Description LIKE "%searchTerm%"
        )
    ORDER BY PostVersion.AuthoredAt DESC
    

The IsCurrent field has to be updated for all versions when a new version is accepted for this to work. In most cases, you will probably want fetches to be fast at the expense of slowed inserts.

Mapping the Workflow to a RESTful Interface

We need to be careful to keep PUT and DELETE operations idempotent. That means having a user PUT multiple times should not create multiple new versions of a post. Since updating an resource in this workflow is actually creating a new version, API users should not have access to PUT operations except on version meta data. Similarly, DELETE operations should be replaced by a version POST having the new deletion status.

Method URI Action
GET /posts? List only the latest post versions matching query
POST /posts Create a new post version with a unique PostId
GET /posts/:id/versions? List all versions for this post matching query
POST /posts/:id/versions Create a new post version (this is what should be used to submit an update to an item – NOT PUT)
PUT /posts/:id/versions/:vid Update meta data fields (ReviewStatus, etc...). Should give permission denied (403) if user is not a moderator.
GET /posts/versions? List all post versions matching search criteria
POST /posts/versions Create a new post version
GET /posts/versions/:id Get data for a single post version
PUT /posts/versions/:id Update meta data fields (ReviewStatus, etc...). Should give permission denied (403) if user is not a moderator.

Examples:

Fetch all versions of a post made by user "brady":

GET /posts/3/versions?Author.Username=brady

Fetch all versions of a particular post made by user "brady":

GET /posts-versions?PostId=3&Author.Username=brady

Fetch all post versions reviewed by user "brady" across all posts:

GET /posts/versions?Author.Username=brady

Conclusion

If you want to add a layer of moderation to your data API in order to open up your database to multiple users, this model and workflow should make things pretty easy. It reduces all "edit" operations to "create version" operations, and enables you to fetch the most recent accepted versions at any time point. I hope this set-up helps you to open your databases up to the masses.

So far this has worked well for me, but I'm sure there could be better ways of doing this kind of thing out there - let me know if you have one!

Comments