Keeping Reputation Data Consistent in Samplestack

February 26, 2015 Data & AI, MarkLogic

In designing Samplestack, a sample MarkLogic application that provides search and updates across Question & Answer content, our team wanted to demonstrate how the database’s built-in capabilities enhance the application developer’s experience. One of the differentiating features of MarkLogic’s enterprise NoSQL database is having ACID transactions, and more specifically its support for multi-document, multi-statement transactions.

It was a no-brainer that we would look for ways to meet requirements and keep the data consistent through the use of transactions where appropriate.

Once we defined the application requirements, we ended up with a scenario that required the database to successfully execute multi-statement transactions.

  • When an answer is selected as the ‘accepted’ answer, parallel updates are required for the content and the user(s):
    • Update the answer to indicate its status as ‘accepted’
    • Increase the reputation of the user with the ‘accepted’ answer
    • Decrease the reputation of the user with the previously ‘accepted’ answer (if applicable)

But how does this apply to YOUR application? What considerations did we take into account to determine this was the best course of action, and how is it implemented? The Application Developer’s Guide gives a great overview of the mechanics of transactions in MarkLogic, but I’m going to provide a little more context. I’ll walk through how we implemented the scenario above while ensuring that user reputation stayed consistent with the state of the Q&A data.

Note: When discussing “we” during the implementation – I mean my talented engineering colleague Charles Greer who provided both the brains and the muscle behind the operation.

Document Data Model

Before diving into transactions, I need to explain how the data is modeled in Samplestack, since that played a large role in determining where the reputation and related updates needed to occur upon answer acceptance.

When setting up our data model we thought about types of information we’d capture:

  • Questions
  • Answers
  • Comments
  • User name
  • User location
  • User votes
  • Votes on questions and answers
  • Answer acceptances
  • User reputation
  • Question metadata

We also thought about the most common types of updates users would be making:

  • Asking Questions
  • Voting
  • Answering questions
  • Accepting answers

And the range of queries (searches) we needed to support for end users:

  • Using keywords/full text search
  • By tag
  • By date
  • By user
  • Whether questions were resolved (had accepted answers)

We wanted to denormalize the data where sensible to enhance searchability, but to keep frequent updates scalable and bounded.

Much of the data could be logically grouped into either “Question & Answer” (QnA) content tracking the thread of a conversation and associated metadata (tags, votes on content) or “User” data with specifics on the user’s activity and profile. Users participate in QnA threads, so the user name appeared in both groupings. Including it in the QnA document provided a way of searching for their content updates. User records allowed us to keep fields that might be more frequently changed (user location, user votes) in a separate document so we wouldn’t have to update every QnA thread where the user participated in the case of a vote or a physical move.

One key decision was to leave user reputation out of the QnA document. Reputation could change constantly (when users had their answers accepted and their content voted on), meaning every document containing a user’s reputation would have to be touched during an update. This could translate into thousands of documents for an active user participating in many QnA threads. We did not have an explicit requirement to search or sort documents by reputation, so we chose to normalize reputation and keep it in the user record only. We still wanted to show reputation alongside user names, but we accomplished that with a transform that joins search results with user reputations. Joining user reputation with QnA documents to display one page of search results cost less than performing a join for sort or search across all results.

Let’s look at where we landed with our two record types modeled as JSON documents.

User Record

Key fields used for the “Contributor” role in the application (simplified for this walk-through)

{
  "com.marklogic.samplestack.domain.Contributor": {
    "displayName": "JoeUser",
    "userName": "joe@example.com",
    "id": "cf99542d-f024-4478-a6dc-7e723a51b040",
    "location": "San Francisco",
    "reputation": 50,
    "voteCount": 3,
  }
}

Question and Answer document

Basic structure of a QnA thread (simplified for this walk-through)

{
  "originalId": "3587258",
  "title": "Blocking until event",
  "text": "In a web worker, I'm firing off a postMessage() and need to wait for the result before continuing execution. Is there anyway of blocking until the onMessage event occurs, short of busy waiting plus something like Peterson's Algorithm?",
  "owner": {
    "displayName": "Michael",
  },
  "accepted": false,	
  "acceptedAnswerId": null,
  "answerCount": 1,
  "answers": [
    {
      "text": "Sounds like you need to break up your script into two parts and fire the second part when the message comes back. That is how any asynchronous call works in the JavaScript world. ",
      "owner": {
        "displayName": "epascarello",
      },
      "id": "soa3588111"
    }
  ],
}

This meant that for our anticipated user updates, there were never more than 3 or 4 documents requiring simultaneous database updates. We chose this limit as it made sense based on our project requirements. The key outcome was that it was a known, constrained set of document updates as a basis for future scale and performance.

Considering Transactions

Given our data model, we knew the updates required as a part of accepting an answer would span multiple documents. But what if there was a system failure? Or another user searched the database while an update was in progress? Without transactions there would exist the potential for a user reputation to be inconsistent with the QnA document denoting the accepted answer.

Q: How do I solve this problem? – Mary

A: Look it up in the documentation. – Joe (→ Accepted!)

Joe User Record

Reputation: 0

?!

We wanted to be production-ready for an enterprise environment and knew that having eventually consistent data would not be good enough. If a failure or another query happened mid-update, we did not want to present an ‘unstable’ state where an answer had been accepted but no one received credit. We’d like to either roll back all updates or complete them all at once.

In the User Interface, when the Question Asker selects ‘accept’…

Q: How do I solve this problem? – Mary

A: Look it up in the documentation. – Joe

Upon click, simultaneous updates to both the QnA and User documents must be made:

QnA Document

“accepted”: true

“acceptedAnswerId”: A1

JoeUser

Reputation: 1

We concluded database transactions allowed us to avoid the risks of system failure or mid-update access by another application to the same dataset. With MarkLogic, we could update multiple documents in a single transaction – keeping the reputation consistent with the QnA data.

The most common example illustrating the need for transactions are debits and credits. As Samplestack demonstrates, data integrity is not only relevant for financial applications. Situations which demand that data meet all validation rules at any given point in time require consistency. Also keep in mind when designing your data model, that normalized data does not become inconsistent. For denormalized data you may need transactions to keep redundant or related data synchronized.

Implementing Multi-Statement Transactions

Samplestack is a three-tiered application based on the Reference Architecture. The Java version of the application primarily uses the Java Client API for managing interactions between the application middle tier and the database, including in the case of updating reputation using multi-statement transactions.

Let’s walk through a selection of the Samplestack application code to highlight the key components to successfully executing a transaction upon answer acceptance. Keep in mind the following code is specific to the Samplestack application and includes references to private functions defined elsewhere in the codebase (not necessarily cut-and-paste for your application).

1. Open a transaction

Transaction transaction = startTransaction(SAMPLESTACK_CONTRIBUTOR);

2. Perform the required updates

This application uses DocumentPatchBuilder to make the document changes.

//QnA Document: Mark the Question as having an accepted answer, note  
//Answer Id for the accepted answer, and update the last activity date.
patchBuilder.replaceValue("/acceptedAnswerId", answerId);
patchBuilder.replaceFragment("/accepted", true);
patchBuilder.replaceValue("/lastActivityDate", ISO8601Formatter.format(new Date()));

//Reputation handling: Loop through all answers to find the currently 
//and previously accepted answers. Increase reputation by 1 for the 
//user with the current accepted answer. Decrease reputation by 1 for 
//the user who previously had the accepted answer (if applicable).
ArrayNode answers = (ArrayNode) qnaDocument.getJson().get("answers");
Iterator<JsonNode> iterator = answers.iterator();
while (iterator.hasNext()) {
  JsonNode answer = iterator.next();
  String id = answer.get("id").asText();
  if (!previousAnsweredId.isMissingNode()
      && id.equals(previousAnsweredId.asText())) {
    adjustReputation(answer.get("owner"), -1, transaction);
  }
  if (answer.get("id").asText().equals(answerId)) {
    adjustReputation(answer.get("owner"), 1, transaction);
  }
}

3. Either rollback or commit

// all or nothing – consistency ftw!
try {	
  // Transactional updates
  transaction.commit();
  transaction = null;
  return acceptedDocument;
} finally {
  if (transaction != null) { transaction.rollback(); };
}

One tricky part is to make sure and account for error scenarios and to include the rollback. Remember, too, that because this is a multi-statement transaction, updates will not be available to others until you commit. The updates will, however, be available to you in real-time, for search for example during the transaction. Part of the benefit of performing the update via MarkLogic, is that search and other indexes are updated real-time during a transaction. You’ve made the latest information available while keeping reputation consistent.

Armed with this overview of the design and implementation considerations for multi-document, multi-statement transactions, you should be well on your way maintaining data consistency in your own applications!

Additional Resources

Kasey Alderete