Enterprise Ready Mortgage Document & Metadata Repositories

March 09, 2016 Data & AI, MarkLogic

This is the third in a series of posts on building both mortgage document or metadata repositories to enable firms in the mortgage industry handle today’s needs — and exploit opportunities.

Mortgage document and metadata repositories are very demanding in terms of data complexity, scalability, security, and data reliability. While there are many technologies that can handle one or two of these demands, there are few that can handle all of them. I am calling the collection of these capabilities as Enterprise Ready

As discussed in the first entry in this series, mortgage document and metadata repositories often contain billions of documents with widely-varying, implied schemas — which can require many terabytes to handle. Relational-based approaches have great difficulty handling these levels of complexity and scale. Newer NoSQL and Hadoop technologies can often handle the scale needed by a modern mortgage document or metadata repository. They are also often better than relational approaches in handling complexity — although querying the wide variety of data types (document, geospatial, etc.) found in a modern repository can require combining multiple technologies and extensive development and design before queries can be performed. See the second entry in this series “Building the Repository” for an in-depth description of a modern mortgage document or metadata repository, or watch a webinar where we demo it here.

The rest of this blog entry will focus on the enterprise functionality needed to build a modern mortgage document or metadata repository.

Enterprise Ready — Security

Security is vital in any system that contains customer PII (personally identifiable information) data. PII data is both protected by privacy laws and, when hacked or exposed, can result in the firm being on the front page of major newspapers and other media in very uncomplimentary ways. In recent years there have been many major breaches of customer data, which have been extremely damaging to the firms involved.

To have a fully secure system requires encryption, user authentication, and document & cell-level security. Old school relational based approaches can generally handle these kinds of security requirements. However, security and reliability is where many of the new NoSQL and Hadoop-based technologies fall down.

The database with the best security of all the NoSQL databases is MarkLogic. MarkLogic is the only NoSQL database with NIAP Common Criteria certification. It is extensively used by intelligence agencies. In fact, MarkLogic has had government-grade security from the start. For a more in-depth look, please see our MarkLogic Server security page.

Enterprise Ready – Data Integrity

If you are storing billions of documents in a repository you need to be sure that all the data that is supposed to be there is there. Providing inaccurate responses to subpoenas or customer inquiries can lead to fines and damaged reputations. The basic key to data integrity is ACID (Atomicity, Consistency, Isolation, Durability) transactions. ACID is the key to transactional consistency and data integrity. With ACID, if a database claims a record is committed, it is.

MarkLogic has had ACID transactions since it was first released. In fact, it is not even possible to turn ACID off. This is not the case with most other products in the NoSQL and Hadoop space. Instead, of ACID-based transactions they tend to use BASE (Basic Availability, Soft state, and Eventual consistency) for data integrity.

The problem with BASE is that unless you have settings that cripple processing performance, you can lose data. For most settings, with BASE the database can claim that a record has been saved while it is only exists in the memory of a single machine. If that machine crashes before the data is written to disk it can be lost without the database realizing it.

In large clusters with many nodes, as will be the case with large mortgage document and metadata repositories, machines will occasionally crash. BASE-based systems run the risk of data corruption, stale reads, and inconsistent data. The risk is likely closer to near certainty because in a large system with lots of computers, eventually a machine will fail before it writes data to disk and the data will be lost. This impacts not just the standard operation of the repository but also high availability and disaster recovery.

If you want to be sure your queries return accurate results and you get the same answer when accessing different machines then you need ACID transactions. Fortunately for our clients, MarkLogic’s data consistency guarantees, high availability, and disaster recovery all operate at the enterprise level of performance needed by a modern mortgage document or metadata repository.

David Kaaret

David Kaaret has worked with major investment banks, mutual funds, and online brokerages for over 15 years in technical and sales roles.

He has helped clients design and build high performance and cutting edge database systems and provided guidance on issues including performance, optimal schema design, security, failover, messaging, and master data management.