Introduction to the UML-to-Entity Services Toolkit: UML Modeling with MarkLogic’s Entity Services

November 27, 2017 Data & AI, MarkLogic

In my previous blog I explained why upfront high-level modeling is essential. I recommend using the Unified Modeling Language (UML), as it helps to visually depict your model for greater clarity. UML can feed into MarkLogic’s Entity Services, which is a shockingly low-effort means to model-driven data management in MarkLogic. When I first played with it, I was surprised how little input I had to provide to reap a treasure chest of outputs.

My UML-to-Entitly-Services toolkit provides the ability to transform a UML data model to a MarkLogic Entity Services model. To use it, you’ll need MarkLogic 9-0.3 or later plus your preferred third-party UML modeling tool. The UML tool you select must support UML 2.x, must be able to export UML models to XML Metadata Interchange (XMI) 2.1, must be able to import UML profiles, and must support stereotypes and tagged values. In the suite of examples featured in the toolkit, I used two such tools: MagicDraw 18.5 and Eclipse Modeling Framework 2.x. The toolkit includes several UML examples that demonstrate the model-driven workflow process.

Let’s use one of these examples – the movie model – to walk through the process. For comparison, refer to the toolkit’s documentation of this example, which both describes the recipe and provides the finished product. In this post, we’ll follow along with the recipe.

Designing the UML Model

The first step is to open your favorite UML editor, create a new UML model, and import into the model the toolkit’s UML profile for MarkLogic Entity Services. The profile is an XMI file. Follow the approach specific to your UML tool to add this profile to your model.

Next draw the movie model. You will be composing a UML class diagram consisting of classes, their attributes, and class relationships. I used MagicDraw, but any UML tool that meets the requirements will suffice. Here is what the final model looks like:

Figure 1: UML class diagram of movie data

At a high level, the model describes two main types of data, movies and contributors. Contributors are of two types: persons (actors, directors, writers, etc) and companies (production companies, special effects companies, etc). There is a many-to-many relationship between movie and contributor, and we express that relationship as role. A contributor performs a role (or perhaps several roles) in a movie; the set of roles for a contributor is that contributor’s filmography. A movie’s cast is the set of roles — director roles, actor roles, writer roles, production company roles, and others — in that movie. A movie also has a set of parental certificates, i.e. the parental ratings per country for the movie. A movie and a person contributor can have user documents. These are user-contributed posts, such as actor biographies and movie plot summaries.

The model has three levels of structure. At the highest level is package, which describes the overall model and maps to the Entity Services notion of model. In MagicDraw, the package details are configured in a separate dialog window, shown in the Figure 2. We name our package MovieModel and tag it with two properties that are needed by Entity Services: baseUri and version. These tags belong to the esModel stereotype from the custom profile.

Figure 2: Package details with two properties tagged

At the next level is classes. Our model has seven classes: Movie, MovieContributor, PersonContributor, CompanyContributor, UserDocument, ParentalCertifcate, and Role. These map to Entity Services entities. Notice that two of the classes are stereotyped:

  • MovieContributor has the stereotype exclude, which instructs the toolkit not to include this class as an entity in the Entity Services model descriptor. It serves a purpose in the UML model, but we don’t need it in Entity Services. More on this shortly.
  • UserDocument has the stereotype xImplHints with the tag reminders. The reminder is “If docText is larger than 1M, store in a separate text document.” The designer is offering implementation advice that if the user document’s text is large, store that text in a separate .txt document rather than in the docText attribute of UserDocument. The toolkit records this fact as a triple in the extended Entity Services model. The toolkit also generates a comment that can be pasted into the model’s XQuery conversion module. The developer responsible for ingestion of user documents sees and is guided by that comment in the module.

Each class contains one or more attributes, which map to Entity Services properties. An attribute has a name, a type, multiplicity, and can be stereotyped with Entity Services configuration. Here are a few examples from the class Movie:

  • movieId is a String of multiplicity [1], indicating that it is a required attribute, with exactly one value expected. We stereotype it as PK to indicate it is the primary key of the class.
  • seriesId is a String of multiplicity [0,1], indicating that it is an optional attribute.
  • countries is a String of multiplicity [0..*], indicating that it is an array of Strings.
  • imdbUserRating is a Real of multiplicity [1], indicating that it is a required floating point value. We stereotype it as rangeIndex; Entity Services will generate an element range index for it, enabling us to run range queries against it.

Especially interesting in this model are the class relationships:

  • We model the many-to-many relationship between Movie and MovieContributor as a birdirectional association bearing an association class called Role. Let’s understand this in two ways: conceptually and physically. Conceptually, a movie’s cast is the set of its contributors, and for each contributor Role provides further information about the contribution: roleType (e.g., actor, writer, director), roleNames (e.g., the character or characters the contributor acted in the movie), and contribClass (person or company). Conversely, a contributor’s filmography is the set of its movies, each described by Role. Physically, the XML document in MarkLogic that represents the movie contains the element cast that is a list of roles; each role contains roleType, roleNames, contribClass, plus the contribId of the contributor. Conversely, the XML document that represents the contributor contains the element filmography that is a list of roles, each specifying roleType, roleNames, contribClass, plus the movieId of the movie. Thus in the physical representation, a movie contains its cast, and a contributor contains its filmography. You might have noticed the FK stereotype on cast. It means that movie’s cast contains its roles, and each role refers by key to the contributor. Had we omitted the FK, the movie’s cast’s roles would further contain the contributor itself! There is also an FK on filmography.
  • MovieContributor is a generalization of PersonContributor and CompanyContributor. Put differently, PersonContributor and CompanyContributorinherit the attributes, including the primary key, contribId, of MovieContributor. Significantly, each also inherits the Role association with Movie; a person has roles, as does a company. As mentioned above, we exclude MovieContributor from the Entity Services model. Thus, in MarkLogic, we will expect to have instances of person and company contributors, but the base class will never be instantiated; it exists solely to model the inherited attributes of its subclasses.
  • PersonContributor and MoviecomposeUserDocument. Conceptually, this means that a person contributor or movie contains its user documents. But when we map this relationship physically to MarkLogic, we want UserDocument to be its own XML document, not a child of PersonContributor or Movie. For we want to maintain user documents separately, and in a later phase, will introduce the notion of User and link a user to his/her documents. Thus, we use the exclude stereotype to exclude Movie’s and PersonContributor’s containment of UserDocument. UserDocument refers to PersonContributor and Movie by reference; notice the FK stereotype. In the later phase, UserDocument will also refer to its user/author by reference.
  • MoviecomposesParentalCertificate. This means that a parental certificate is part of the movie record and could not exist without the movie. Thinking ahead, we foresee ParentalCertificate residing in the MarkLogic database, not as its own document, but as a subdocument of Movie.

Transforming UML to Entity Services

From the UML tool, export the class diagram to an XMI file. It is now time to transform the XMI to an Entity Services model descriptor. The toolkit provides a gradle-based utility to do this. The basic steps are the following:

  • Create a movie database containing the toolkit’s transformation module. Run gradle includeXMI2ESTransform mlDeploy.
  • Import and transform the movie model: gradle ingestModel.
  • Deploy the model descriptor using the gradle mlgen task. This task generates several artifacts, notably a database index configuration file, an XQuery conversion script, and a TDE template. Examine these artifacts and modify them if necessary. (We discuss the conversion modifications below.)
  • Deploy the artifacts using gradle mlDeployDatabases mlReloadModules mlReloadSchemas.
  • Ingest the sample movie data using gradle ingestMovieData.

The README file in the toolkit explains these steps in detail.

Let’s review the mapping for our movie model. The following code listing is an excerpt of the model descriptor produced by the transformation. (If you compare it to the UML diagram in the previous section, you see how the mapping worked. Refer to the next section for a general reference guide to the mapping.)

{
  "info": {
    "title": "MovieModel", 
    "version": "0.0.1", 
    "baseUri": "http://com.marklogic.es.uml.movie"}, 
  "definitions": {
    "Movie": {
      "properties": {
        "movieId": { "datatype": "string"}, 
        "seriesId": {"datatype": "string"}, 
        "countries": {"datatype": "array", "items": {"datatype": "string"}}, 
        "imdbUserRating": {"datatype": "float"}, 
        "parentalCerts": {"datatype": "array", 
          "items": {"$ref": "#/definitions/ParentalCertificate"}
        }, 
        "cast": {"datatype": "array", 
          "items": {"$ref": "#/definitions/Role"}
        }
     }, 
     "required": ["movieId", "seriesType", "releaseYear", "runningTime", "imdbUserRating"], 
     "primaryKey": "movieId", 
     "elementRangeIndex": ["seriesType", "releaseYear", "genres", "runningTime", 
       "imdbUserRating"]
    },  
    "Role": {
      "properties": {
        "roleType": {"datatype": "string"}, 
        "roleNames": {"datatype": "array", "items": {"datatype": "string"}}, 
        "contribClass": {"datatype": "string"}, 
        "refMovieContributor": {"datatype": "string"}, 
        "refMovie": {"datatype": "string"}
      }, 
      "required": ["roleType", "contribClass"]
    }
  }
}

The most important artifact that the Entity Services library generates is the conversion module. It is expected that the developer will modify this generated code. We modify the movie conversion module as follows:

  1. We modify the source mappings. The conversion module assumes the source data field names match the property names in the model. We change the source mappings to use the source data field names from our source data.
  2. We modify the conversion of Movie, PersonContributor, and CompanyContributor to find and add roles. In our sample, role data is loaded separately, into its own XML documents. When it comes time to load movie and contributor data, we search MarkLogic for matching roles and add them to the movie or contributor XML.

The modified conversion module is here.

With these changes in place, we proceed to ingest data. The gradle toolkit provides sample movie data. It shows how to use the gradle MarkLogic Content Pump (MLCP) plugin to ingest data from JSON files to MarkLogic. We use our conversion module as an MLCP transform, mapping the JSON source files to XML envelopes whose structure follows that of the model.

Exploring the Data

We conclude by running a few queries to explore the ingested movie data to verify that it meets the design goals of our UML model. We use the Query Console workspace.

“Movie Parentals, Cast, Docs” tab has a query to retrieve the details of a movie, its parental certificates, its roles (i.e., cast), and its user documents. Notice the parental certificates and roles are contained within the movie. For the user documents, we use cts:search() to find user documents that refer to the movie.

let $movie := fn:doc("/xmi2es/imdb/movie/movies1.xml")
let $docs := cts:search(fn:doc(), cts:and-query((
  cts:collection-query("movieDoc"),
  cts:element-value-query(xs:QName("movieDoc"), $movie//movieId)
)))
return ("Movie", $movie, "Parental", $movie//ParentalCertificate, "Cast", $movie//Role, "Docs", $docs)

Here is an excerpt of the output:

Movie:

<es:envelope xmlns:es="http://marklogic.com/entity-services">
    </es:info>
    <es:info>
      <es:title>Movie</es:title>
      <es:version>0.0.1</es:version>
    </es:info>
    <Movie>
      <movieId>Gut Fellas</movieId>
      <seriesType>feature</seriesType>
      <releaseYear>1987</releaseYear>
      <countries datatype="array">USA</countries>
      <countries datatype="array">UK</countries>
      <imdbUserRating>1.8</imdbUserRating>
      <parentalCerts datatype="array">
        <ParentalCertificate>
          <country>Chile</country>
          <currentCertificate>scandalous</currentCertificate>
        </ParentalCertificate>
      </parentalCerts>
      <cast datatype="array">
        <Role>
          <roleType>actor</roleType>
          <roleNames datatype="array">Tony Blair</roleNames>
          <contribClass>person</contribClass>
          <refMovieContributor>Billy Wonka</refMovieContributor>
          <refMovie>Gut Fellas</refMovie>
        </Role>
      </cast>
    </Movie>
  </es:instance>
</es:envelope>

Parental:

<ParentalCertificate xmlns:es="http://marklogic.com/entity-services">
  <country>Chile</country>
  <currentCertificate>scandalous</currentCertificate>
</ParentalCertificate>

Cast:

<Role xmlns:es="http://marklogic.com/entity-services">
  <roleType>actor</roleType>
  <roleNames datatype="array">Tony Blair</roleNames>
  <contribClass>person</contribClass>
  <refMovieContributor>Billy Wonka</refMovieContributor>
  <refMovie>Gut Fellas</refMovie>
</Role>

Docs:

<?xml version="1.0" encoding="UTF-8"?>
<es:envelope xmlns:es="http://marklogic.com/entity-services">
  <es:instance>
    <es:info>
      <es:title>UserDocument</es:title>
      <es:version>0.0.1</es:version>
    </es:info>
    <UserDocument>
      <docId>92d38bed-275b-4074-9e92-5adcdef175aa</docId>
      <authorId>Happy Cross</authorId>
      <docText>A satire of politics in a post-truth world</docText>
      <docType>plot</docType>
      <docSubType></docSubType>
      <movieDoc>Gut Fellas</movieDoc>
    </UserDocument>
  </es:instance>
</es:envelope>

We leverage the TDE template generated when we deployed the model to run SQL queries against our data. “Company and Filmography SQL” tab has a query to find a company and its filmography. The SQL is a join of CompanyContributor and its contained filmography. Under the covers, there is nothing to join: one document has all. Document-structured data is made to look relational!

Finally, the “Person and Bios SQL” tab has a SQL query to show person contributors and their bios. This query joins PersonContributor and UserDocument, which really are separate documents. Recall UserDocument has a reference to PersonContributor.

Next, I walk through how to use the toolkit for UML modeling with the data hub using semantics.

Further Learning:

Mike Havey