In a previous post, I discussed the limitations found in many existing mortgage infrastructures, which leaves them unable to handle the potentially thousands of different variants of documents that cause headaches for mortgage origination, processing or securitization.
To get past these limitations it is necessary to build an infrastructure that includes:
Combining all these into a universal mortgage repository is not a utopian fantasy. In this entry, I will delve into work being done at major banking institutions to show how a next-generation system should work.
The biggest issue in building a repository is the number of different types of documents involved in tracking a mortgage over its possibly 30-year lifespan. Just scratching the surface, there are documents on customer information, checking and savings statements, check images, credit reports, and mortgage applications. The formats of many of these documents vary from state to state or possibly even from company to company. The versions of the documents used today may look very different than the versions existed in 1995 or 2005 and it is necessary to store them all. Every time regulations change new versions of many of the documents will be created.
A large bank that originates and processes millions of mortgages will store many, many thousands of variants of different document types — each of which has a different implied schema.
In some cases a firm does not want a complete document repository — where the actual documents are centrally stored and queried. It sometimes prefers to keep the actual documents with the primary systems that create and maintain them. Instead of creating a document repository, a metadata repository is maintained. This makes it possible to determine which documents meet a set of search or query parameters and quickly locate retrieve them from the primary owner’s systems.
In the past, documents were paper documents and were stored in filing cabinets. Each primary system would maintain a metadata database which allowed users to query metadata to determine the ID numbers of required documents. Today scanned images and PDF or Word documents have largely replaced the file cabinets making it easier to access the full document. As time has passed firms have gradually increased the metadata stored on individual documents often making it possible to answer some questions with just the metadata.
Metadata repositories can be easier to build than document repositories as it is not necessary to change metadata formats every time there is a small change to the underlying document.
However, even with a metadata repository, there is still a lot of work to do. To start with, there are many different business processes involved in the life of a mortgage, including check processing, monthly statement creation, and mortgage origination processing. In fact, an infrastructure may contain 35 or more primary systems. These systems have largely been developed independently of each other and the metadata varies from system to system. Mergers and acquisitions mean that different metadata approaches coexist in a single firm. These days, after origination, mortgages are regularly bought and sold so documents, and perhaps metadata, created by many different firms may coexist in the same database. All this means that even with a metadata repository there can still be thousands of variants of documents types coexisting within a firm.
Trying to pull all these primary or metadata documents together with ETL and a relational database is a recipe for disaster. The large existing inventory of document types, and rates at which new document types and variants of existing types grow, means that any attempt to engage in a systematic modeling and ETL effort will likely never finish.
When discussing the mechanics of building a new repository we will use MarkLogic as the database backing the repository as it has all the needed functionality.
The first requirement in building a document or metadata repository is to just load the data. Instead of defining a schema and then engaging in extensive ETL to force underlying data sources to fit into that schema, you just load the data “as is.”
As it is being ingested, MarkLogic’s Universal index makes it immediately available for searching without any ETL required. Additionally, structured queries can be performed against the existing metadata in the primary PDF, Word and other documents, or against the XML descriptors or JSON tags found in metadata documents — again without any modeling or transformations.
This ability to powerfully access the data on load means that the repository can provide value from day one. Users can search and query and get results faster and more accurately than even before development begins.
Once data is loaded the repository can be optimized and improved upon in a variety of ways. There are two key factors to keep in mind during this process. First, optimization can be done in an incremental fashion. Data can be continually loaded and the repository can be searched and queried while repository enhancement is constantly making the system ever more powerful. As a result, time to measurable results can be a fraction of that needed for relational/ETL based projects.
Second, all the techniques discussed in this post can be used simultaneously in a single search or query (searches and queries can be performed together as well).
Some specific approaches to enhance the repository include:
Building a universal mortgage document or metadata repository has been nearly impossible with relational technologies, and major banking institutions are finding it fairly easy to do with MarkLogic. We have not covered it in this blog entry but all of the capabilities we have discussed are done with enterprise level security, high availability and disaster recovery, ACID transactions, in a clustered environment that can scale to many billions of documents.
We do a pretty good job though, of showing the power of a universal mortgage repository in a world where mortgages are constantly changing and where the way mortgage data is accessed grows ever more demanding.
Dave Grant put together a terrific demo that let’s you draw polygons in a geographic area and see the risks to your portfolio — should there be a flood or plant closings. You can see exactly what I am referring to in this webinar where we show the very un-uniform metadata from varying documents and how it seamlessly joins using semantic triples.
In future blog posts, I’ll show how many of the issues and challenges caused by today’s legacy infrastructures melt away.
David Kaaret has worked with major investment banks, mutual funds, and online brokerages for over 15 years in technical and sales roles.
He has helped clients design and build high performance and cutting edge database systems and provided guidance on issues including performance, optimal schema design, security, failover, messaging, and master data management.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites