Data & AI MarkLogic

The Essential Guide to Enterprise Search

bloghero-announcing-data-hub-5-0-and-marklogic-10-now-with-machine-learning

by Stephen Buxton Posted on April 02, 2021

This guide is a primer for enterprise search. It covers the basics about what to look for when choosing an enterprise-search solution and some expert tips for developing it so it meets the unique needs of your organization.

This content will be useful for both those on the business side who want to better understand enterprise search and also for a technical audience such as engineers who want to get a summary of key concepts before diving into the weeds.

What Is Enterprise Search?

It’s more than just googling something. A search is a way to locate information (documents, records, facts) to answer questions about the world. But, enterprise search goes far beyond that.

Enterprise search falls into four categories:

Find: I know there is a document, record or fact out there, and I need to find it.
- Get the latest version of my company’s travel policy documents.
- Show me the minutes from that staff meeting where we discussed enterprise search.
- Show me John Smith’s address and phone number.
- And, a more advanced use case: Show me the name and phone number of all customers in my area affected by the recall of a defective machine part. And, include all customers that bought any product that includes that part after the date the defect started. Also, order the results by year-to-date total order value for the customer so I can see the most valuable customers first.
Discover: I’m researching a specific topic. Help me navigate the information space.
- What are we doing with “machine learning”? I need to prepare for a product meeting this afternoon.
- Which compliance rules apply to the project I’m about to start?
- Which of my company’s manufacturing techniques could be improved?
Monitor: Watch a topic and keep me informed as things change (don’t make me keep doing the same search).
- When I log in in the morning, show me news articles and company memos that are pertinent to my project.
- When there’s a new crime report about burglaries in my precinct, e-mail it to me.
- When a researcher publishes new findings about the aerodynamics of the wing on the airplane I’m working on, send me a notification.
Analytics: I need to obtain a slice of relevant data (based on some set of criteria) to feed into analytics pipelines, including machine learning algorithms and BI tools.
- I want to apply my machine learning model to master some customers, but constrain the application to just a plausible set of candidates.
- I need to pull a random sample of financial trades meeting some criteria to test my machine learning model.
- I want to analyze (min/max/median) sales and maintenance renewal figures for all sales in my region by offering type, and show the results in a grouped vertical bar chart.

What Makes Enterprise Search Great?

When we’re talking about enterprise search, we’re talking about more than a basic indexing tool.

For us, enterprise search means building a first-class search application tailored to your data, with a custom-developed UI, designed as a highly personalized search experience that delivers answers within the context of your enterprise.

For example, one customer we work with is Springer Nature. Their search application, SpringerLink, is a good example since it’s publicly accessible (most of our customers’ search applications are secured behind firewalls). You can see at a glance that it looks more like Amazon.com than like Google. That’s because it’s tailored to their content and designed to drive revenue.

SpringerLink search powered by MarkLogic — SpringerLink

Another example is the BBC Sport website. It surprisingly doesn’t look much like a search application—there’s only a tiny search box in the top right corner. And yet, the site content is powered by semantic search and is highly customized to the needs of its users.

BBC Sport powered by MarkLogic — BBC Sport

A third example is our recently introduced Pharma Research Hub. It’s built on the same underlying technology, the MarkLogic^®Data Hub platform, but is tailored for searching across pharma research data such as genes, compounds, publications, etc.

These examples are very different, and I highlight them here in order to show the power and versatility you can achieve as a search developer.

So, where do you start?

Here’s the summary of steps required on which we will go into greater detail:

Process searches
Process the data
Invest in the UI
Enable notifications
Monitor and improve
Ensure end-to-end security

1. Process searches

When you build a data query application, you get lots of help from users. It’s okay to get users to fill out a form with named fields, and maybe even ask them to write (pseudo-) SQL. However, in a search application, users expect to type in a few words and get a perfect result (like Google). Here are some ways in which a search application can figure out what the user wants, given just those few clues:

Query expansion

For each word or phrase, the application adds words that the target document may contain. Common query expansion techniques include:

Stemming: User types in “mouse,” and the application also searches for words with the same stem, such as “mice.”
Synonyms: User types in “account,” and the application also searches for “ledger” (if the context is banking) or “description” (if the context is journalism).
Broader/narrower terms: User types in “Tylenol,” and the application also searches for “Acetaminophen,” and possibly “pain reliever.”

Semantic search

Leverage your knowledge of the world and your domain by expanding users’ search terms to include related terms.

For example, suppose your users want to search for regulations and safety standards that apply to some product they are about to make. If you type in “cardiac catheter” to a search engine, you may only get back standards that contain that phrase. But a semantic search can leverage the knowledge that a cardiac catheter is an implantable device, and that implantable devices need to work in a sterile environment. This is important so that a search for cardiac catheter returns standards related to implantable devices and other devices that need to work in a sterile environment.

Auto-complete

When the user starts typing, the application offers a list of possible completed words and phrases.

Users like auto-complete because it means less typing. The application likes auto-complete because it yields correctly spelled (indexed) words. Auto-complete is also an opportunity for the application to guide the user (e.g., by ordering alternatives by popularity, recency or importance to the user’s domain or task).

Facets

Add a set of facets to the search screen so users can explicitly narrow the scope of the search. Most users are familiar with facets. For example, you’ve been to amazon.com, where you can scope a search to some brands or a price range.

With facets, the application is asking the user to do some more work; a bit like the form-filling expected of a data query user. But, it gives a much clearer idea of what the user is actually looking for. Faceted searches are great for a more precise search. For insights (I get an overview of my search results, with counts); and, for navigation around a search space.

Suggestions

Did you mean [x]? With suggestions, the application gives users a chance to say more about what they want. For example, if a user types in “Apache,” the application can ask, “Did you mean Apache helicopter, Apache tribe or Apache software foundation?”

Fine-tuning

Developers can fine-tune the ordering of search results using term weights, field weights, document quality and more. It is important to order search results with the most relevant first. “Most relevant” is somewhat subjective—it depends on the user, their role, the task in hand, the search domain and of course, the search terms typed in and the content of the results.

Typically a search index calculates a relevance score for each result, based on TFIDF.

For example, if a user searches for “president,” the search index counts the number of times “president” appears in each result—that is the term frequency (TF). It also counts how many documents in the whole corpus contain the word—that is the document frequency (DF). Unusual words are more important, so the index calculates the inverse document frequency (IDF) of the term, as a measure of its unusualness. (A word that appears in only two documents has an IDF of 1/2, while a document that appears in 1,000 documents has an IDF of 1/100). In this way, when a user searches for “president OR Nixon” he should expect documents with “Nixon” to appear higher than documents with “president.”

In the search application, developers can fine-tune this default score calculation, and hence change the order of results based on the context and use case.

Here are some of the ways to fine-tune a search application:

Apply term weights to some of the terms the user typed in. If you know that some term is more important than the others, because your semantic graph tells you it is important or because the user told you, then you can apply a weight to that term.
Apply field weights to some sections of the document. For example, weight the title so that when a term appears in the title, it contributes more to the score than when the same term appears in the document body. You can do this in the query, or you can set the index to always weight some field.
Document quality is described in more detail below. The idea is that some documents are more important (more authoritative) than others, independent of the search terms.
Note that values (dates and numbers) affect relevance too. For example, you probably want newer results to show higher in the list, all other things being equal, but you do not want the list ordered by date. Depending on your use case and domain, you may want a value to contribute a lot or a little to the relevance score (a search of news articles vs a search of research articles). And, you may even want older results listed first (a search of historical accounts or patent applications).

For each of these techniques the context—the kind of information being searched, and the user’s role, task and interest—is important. As you build out your search application, you can make more use of context to delight your users.

2. Process the data

Just as users want to find just the right documents and data to answer a question, give insights or explore a space, so the data wants to be found. You can make sure the right information is found by the right searches by annotating it.

If your information is sacrosanct—that is, if you need to keep a pristine, unchanged copy—that is okay. You can store annotation data separately along with a link to the original. Better yet, if your information is stored as XML or JSON documents, you can wrap the original in a logical envelope so the annotations are physically in the same place as the original, but the original is preserved intact. This is called the envelope pattern.

Find me! Find me!

You start with the annotations that let the document say which searches should match it. Here the document is saying “if you’re looking for information about dogs (or Nixon or China), find me!” You are leveraging the fact that, with a search application, you have complete control over the data (or at the very least have complete control over the metadata).

You are turning the search paradigm upside down and asking of each document, “tell me more about yourself. In particular, tell me what users would be searching for when they should find you.”

Useful “Find Me” annotations include:

Keywords – This is a list of words that users might enter when searching for this information. This is the most obvious example of the “Find Me!” principle, where the document gets to say which searches should match it. By putting “dogs” in a keywords section of the document metadata, you are saying, “This document is all about dogs, trust me. It does not matter that the body text does not mention that word. If the user is looking for dogs, provide this document!” Back in the day when storage was expensive and indexing was primitive, the only way to do a full-text search was to add keywords manually. Nowadays, keywords provide a backstop—a way to make sure that a record is found when it is supposed to be.
Metadata – Information about the information, such as provenance (where the information came from), author and last changed date.
Classification – Tagging which sets the information belongs in. Classification may also be inferred. For example, you can infer that if a document came from Bob (who works in the lab) and the title includes the name of a drug, then it is likely that the information fits in the categories “lab” and “drugs.”
Entities – Things that are mentioned in free-flowing text, that we know a lot about. For example, in the sentence “Nixon went to China,” “Nixon” refers to a person, the person is probably Richard Nixon and Richard Nixon used to be President of the United States. If you annotate the record containing this sentence appropriately, users can find it when searching for people, for Richard Nixon, for ex-presidents or for politicians.

Other common entity types are companies/organizations, places, things (weapons, drugs, etc.) and patterns such as numbers, dates and credit card numbers. Now you can build rich searches such as “find me all records that talk about a visit by a US president to an Asian country within 1,000 miles of North Korea in the ’70s.”

Note that entities occur in data too. A common data integration problem is that often you do not know what data you have. Entity identification can help figure out which (parts of) records represent a place, person, or thing (weapon, drug, etc.).

So far so good. Your search is shaping up, and you are getting some good results.

But, some searches still do not return the results that, to the user, are “obviously” the best. For example, if a user searches for “Citizens United” in a legal database, the very first result should be “Citizens United v. Federal Election Commission, 558 U.S. 310”—regardless of how many times the phrase “Citizens United” occurs in the document.

You can fine-tune search results to address this challenge in a number of ways:

Add a new keywords field, like super-keywords. In our example, add “Citizens United” to that field, and always search that field first. (You will see later that you can also weight searches that match anything in that field).
Add a quality field, and push high-quality documents to the top of the results.

In any corpus, there are some records that are higher quality than others. In a legal database, the ruling on Citizens United is of higher quality than a discussion about the ruling. You can also say it is more authoritative.

In the world of web search, Larry Page and Sergey Brin invented the PageRank algorithm that was in large part responsible for the success of Google search. PageRank says a web page gets more authoritative as more authoritative pages link to it. Unfortunately, this link analysis does not translate to enterprise search.

Quality may be an aggregation of many things. For example, the source, the author and the document type all impact quality. So, a PDF report written by the head of a large company and published to the “things everyone should know” company wiki page will have a higher quality than an unattributed notepad file posted on a page titled “sandbox.”

In a search application, you have complete control over the quality setting of a document, and you can take into account your knowledge of the search corpus and your users’ roles and tasks as well as the content of the document.

Note that quality is entirely an attribute of a document or record, independent of the user’s search.

Add a dynamic star rating to each document. Ideally, users would give direct, concrete feedback to the application by rating search results from “great” to “useless.” Unfortunately, experience shows that users do not spend the extra click (plus a few brain cycles) to give direct feedback, so manual “star rating” schemes generally fail. But, this is an application! Your application can figure out the results documents that are successful against particular search terms (and which are successful independent of the search). See “Track Everything” below.
Cheat! You can program for “best bets.” When writing code, developers have the freedom to make the application work exactly as they want it to. So, if every search result before 9:00 a.m. should begin with “news of the day,” or if you want every e-commerce search to show the day’s specials first, you can do that!

Search developers often spend a lot of time and effort trying to reverse-engineer a search engine and its indexes to get just the result their users want. With a search application, you can break into the user experience at any point and make it do whatever you want.

You can introduce a manual mapping from anything to anything. For example, whenever the user’s search includes the phrase “Citizens United,” you can make “Citizens United v. Federal Election Commission, 558 U.S. 310” the top result, and add a blurb that says “this is the Supreme Court opinion of Jan 21 2010.”

This is very much user-/domain-/role-/task-specific. If the user is a law student researching a case, they probably want to be pointed directly to the court opinion. But, if the user is a reporter writing a piece on a speech that mentions “Citizens United,” the reporter may prefer the wiki page.

The practice of programmatically assigning a best search result to a search, or part of a search, is sometimes known as “best bets.” Adding best bets will improve the quality of results and improve performance (since a best bet is a simple mapping, not a search).

3. Invest in the UI

The results page of a typical search engine is a list of blue links to documents. But, you can do so much better!

It’s important not to skip this step as it’s an opportunity to make full use of the real estate and visual capabilities in the browser, and the contextual information right at your fingertips. You want to give users insights into the area being searched, and tips on where to go next.

After a search, users want to know two things: What are the top results? What does the overall set of results look like?

Information about each result

Users want to see relevant information about each of the top results in order to know which ones to click on to answer a question. If the information about the result is really good, it may answer the question immediately—task complete.

Here are some ways to present information about each result (each document or record that may answer the question posed by the search):

Snippets

Often, results pages show a snippet below each document link—a few lines from the document text, often with search terms highlighted. The purpose of a snippet is to tell users at-a-glance what the document is and what it’s about, so they know whether it’s worth clicking on it.

Here are some ways to make the snippet richer:

Add metadata – If users can see who wrote the document and when, and possibly how long it is, that’s useful information that helps decide if that’s the document they are looking for.
Look for ways to say more – For example, if you know the author is your CTO, or if you know he has co-authored papers with luminaries in my domain, that makes the document stand out. The challenge is pulling in as much context as possible and displaying it (visualizing it) in a way that users can consume quickly and easily.
Show images and tables – A handful of thumbnails showing the images and tables in a document often gives a rich at-a-glance summary of the content.
Show headings, with links – The headings give another summary of the document and allow users to get right to the appropriate part of the document.
Show any of the “Find Me!” annotations – Keywords, super-keywords, classification, entities. Remember, users will get to glance at a list of search results and decide where to go next. The more information you can give users in that glance the better.
Cards and other per-result visualizations – For some kinds of results, it is more effective to show a card rather than a snippet. A card is generally a rectangle that contains the most effective at-a-glance components for your kind of search result—a picture, a star rating, a headline. You can group cards by type and present them on a carousel, whi h is an easy way to swipe through the group of results. For example, if you google for “Trump news,” users will see three carousels—top stories, twitter and videos—followed by a list of snippets. Cards can be richer than snippets, and they lend themselves to grouping by type. But, they take up a lot of space. Experiment with a mix of cards for commonly accessed, high-relevance, grouped results and use snippets for “the rest.”

Information about the results set

In addition to information about each result, users also want some context. What does the overall result set look like? What else does the application know that can help users with their task?

Facets

Most people are familiar with facets from Amazon.com and other retailing sites, where users can scope a search to some brands, a price range or some attribute. We talked about them above as a way to get users to narrow the scope of searches. Facets, with counts, play another important role in the results page—they give an overview of the entire results set.

Recommendations: More like this

Imagine that a user typed in a search and saw a result that is almost what they want. Or, users find results that they like, but want to explore more. For each result, you can add a recommendations link to find more results like this one. “Like” may be a composite of categories, metadata and entities.

Visualization: Word-clouds and more

A word-cloud is an image that shows the most important words that occur in the results set. More important words are shown in bigger letters. Some word-clouds also use color to represent some facet, such as sentiment or entity type.

Importance is generally the number of results the word appears in, possibly weighted by the number of occurrences and unusualness of the word (a very unusual word that occurs very many times is very important). A word-cloud can give the user an overall view of the results set—when I run this search, this is what the results look like. If it is clickable, it can also prompt the user to search for what she really wants.

You can extend the notion of a word-cloud in many ways:

A cloud of co-occurrences of terms – Discover connections and meaning in the results that are not apparent any other way. For example, discover that “Gatorade” and “flu” occur together more often than any other terms.
A cloud of entities – Show a cloud of all the people, places and things mentioned in the results set. More important (more common) entities are bigger. Show people, places and things in different colors.
A cloud of trending words or entities – Use color or size to show a word or entity that is becoming more common in the data and/or in searches.
A cloud of bucketed number and date values – Leverage your metadata and markup to show common ranges of values, such as published date or star rating.
Data visualizations: Barcharts, pie charts, graphs – The answer to a question is not always a set of documents. The answer to a search may be better delivered as a data visualization using a bar chart or pie chart.
Info panel – Add a panel to the right of the search results that tells the user all about the search context. If a user searches for “multi-model,” the search application will show all the documents that contain that word. But, the search application knows lots of things about “multi-model.” Why not show that to the user in an info panel? Google does this a lot. Often, the information in the info panel answers the question the user meant to ask. At the very least, it gives the results set a baseline set of facts for context.
Semantic graphs – Show the main topic of the search as a semantic graph. While an info panel shows at-a-glance facts about the search, a semantic graph puts the search in context. Ideally, the graph has clickable links, allowing the user to navigate around the search space.

4. Enable notifications

Imagine that for some searches, users need to keep track of new information as it comes in. For example, imagine that an analyst is tracking news about oil rigs in Brazil. The analyst can just search for “oil rigs in Brazil” every day (or every hour) and when something new comes in, the analyst sees it.

Using saved searches is actually a poor user experience. First, the analyst has to keep running the same query over and over just in case something new has arrived. Second, if the analyst checks every morning, then they miss the news by as much as a day. And third, analysts may not see the new item at all since it may be buried deep down in a results set.

In a search application, a user can save a search to run later. Once a search is saved, the user can assign an action to be triggered when that search is satisfied. Taking our “oil rigs in Brazil” example, the user can tell the application to provide an alert (a flag on the screen or an email) whenever a new document arrives that would appear in the results of the “oil rigs in Brazil” search. The alert will appear immediately the document is ingested, with no user action required.

Note that for efficiency, your underlying search engine should have a special index type for alerts. Many search systems have only basic real-time alerting capabilities that fall over at scale.

5. Monitor and improve

What are your users searching for? How does that change over time? Are the searches yielding the right results? How often do searches fail? Are results returning fast enough?

As the builder of the search application, you must know the answers to these questions. Further, you should have a concrete, written-down SLA (Service-Level Agreement). Measure your application against this SLA regularly (and frequently), and increase the SLA wherever you can.

Let’s double-click on those questions:

What are your users searching for?

The baseline for monitoring your search application is to log the text of every search. Someone should eyeball the search log every day. It’s surprising how much information a person can glean by skimming through a list.

For example, the result the user wanted is often obvious to a human reader. You can make that result happen with better query/data processing or with best bets.

“Ah,” I hear you say, “but I have hundreds of users running thousands of searches. I’d need an army of people to monitor all searches manually!” Start monitoring searches as soon as you have some users in alpha testing. As patterns emerge, you can write scripts to help with the load. But, even with a mature set of scripts you should eyeball the most frequent searches every day. In most implementations, the top 20-50 searches represent 80% of user activity.

How do searches change over time?

Look for changes in search terms, such as spikes and new terms.

Spikes: You may see, for example, that suddenly most of your financial services users are searching for internal documents that mention Company X, where Company X is in the news for an earnings report, for fraud or for some other scandal. The first time you notice this kind of spike you can add a “best bet” (see above) where you map any search for Company X directly to a “perfect search” or to a pre-prepared list of documents.

Over time, you may add a script that looks for company names in searches and sends an alert whenever some company spikes.

Eventually, you may decide that a spike occurs every time a company in the financial services sector issues an earnings report. In that case, you should be proactive by creating a best bet the day before the earnings report is due.

New terms: Users may start searching for a new term because a new product has come onto the market, a company has done something to appear on people’s radars or because there’s a new buzzword in town. A simple script over the search log can identify new terms. Once identified, make sure your regime of query processing—synonyms, broader/narrower terms, and so on—includes the new term.

Are searches returning the right results?

A successful search answers the user’s question. It can be tricky to measure success or failure, but here are a few things to look for:

- Track when a user does not click on any result in the list. This may mean that you’ve done a perfect job of drawing a pretty results page – showing snippets and overall results. More likely, the user didn’t like any of the results and tried a different search.

Make sure your search log is indexed by user and time, so you can see the very next search by that user. If the user does two or three searches with no clicks and then abandons the search, that’s really bad. In that case, you should figure out how to improve that first search using the second and third as clues.

- Track which result a user clicked on. If it’s not on the first page of results, tweak your ranking so it appears higher. If it’s on the first page but not at the top, consider tweaking the ranking to push it higher.
- Track how long the user spends after clicking a result. If you’re doing document searches, longer is better. It probably means the user is actually reading the result. A short time spent on a long document is more likely to indicate that it’s the wrong one. In that case, look at ways to improve the snippet.
How often do searches fail?

Even in the best search application, some searches are going to “fail.” Set your SLA aggressively—certainly less than 10 percent. And, add a mechanism for users to give feedback. It’s really hard to get feedback from users most of the time, but they are generally willing to vent about the worst search results.

What about performance? Is the search fast?

When setting realistic goals for performance, you should ask yourself “… compared to what?”

Ideally, you would measure the end-to-end time taken to complete a task and compare it with the time taken to complete the same task without the search application. That’s your absolute baseline. Then compare the task time with your current application and platform to the (estimated) task time with some other application/platform obtained during the evaluation phase of your search application project. It’s important not to evaluate current performance against some vaguely similar application that is doing very different work on very different hardware (such as Google).

And, while it’s always good to know what users want, users’ wishes cannot be the basis of performance goals without reference to the laws of physics (“I want to search an infinite amount of data in a time approaching zero.”). It’s all about trade-offs. Part of the job of search developers is to formulate and communicate the trade-offs between functionality, performance and resource spending.

Note there are at least two important measures here: the time taken for the user to see some response and the time taken to completely draw the results page. Strive to make the initial response very fast but still useful, then you can take longer to tweak the rest.

Continuous improvement

The care and feeding of an enterprise search application is an ongoing effort. While machine learning continues to improve, you still need people to monitor SLAs and tweak the system as queries, data and the world change.

Be on the lookout for ROI on your efforts. Don’t strive for perfection, but look for small, easy changes that will yield big gains against your SLAs.

Follow the same rules of deployment as for any other enterprise application. Start small, keep your stakeholders close and continuously measure quality and performance against an agreed-upon goal (SLA). Continually look for ways to delight the users with more personalization, better results visualization and smarter recommendations.

6. Ensure end-to-end security

Don’t make the mistake of starting out by making all “public” information searchable, and worry about security later. First, you’ll hit security issues sooner than you can possibly imagine. An effective search application will uncover information you never knew was out there hiding in plain sight! Second, you need to be sure that when you do reach that point, the platform you’ve chosen can handle the security you need.

Don’t think about security as a way to prevent access to information. Instead, think about security in the context of how to manage data sharing. The better your security, the more information you can share. It’s very easy to prevent all access to all information. The job of search developers is to enable as much access to as much information as appropriate (but no more!).

Look for:

Role-Based Access Control (RBAC) so you can allow different access (read, write, update) to different users and roles.
Authentication and authorization that fits with your corporate standards and works with your corporate directory(ies).
Fine-grained security: Control access to the database, to a directory, a document and portions of a document (e.g., any user can see the title and abstract of an article, but only subscribers can see the full document).
Access security that is enforced at the database level (not at the application level) so that users cannot bypass security controls.
Encryption at the level of the file system so that even if someone has access to the computer the application is running on, they cannot read the platform’s underlying files.

Another aspect of security is operational security—you must be able to back up and restore information. This means keeping a synchronized copy for disaster recovery and running on a high-availability system so your users won’t be denied access in the event of a failure. And, you should be able to produce a redacted copy of the information for testing or to send to partners.

Why Not Just Use Google?

When we talk to customers, they often say that internal teams have given up on their old internal systems for finding information and often just go to Google. Of course, Google doesn’t handle private data, but it still leads to inquiries around why an organization’s enterprise search isn’t as good as Google. The problem is that enterprise search is completely different from a Google internet search.

How Google is different from enterprise search

Google made a serious play into enterprise search with the GSA (Google Search Appliance) in 2002. The promise of the GSA was the ultimate commoditized enterprise search. Google sends you a black box piece of hardware that you plug into a power supply and your intranet, and it discovers all documents within your organization and makes them available for search via a web page.

In reality, enterprise search is not that simple, and certainly not that homogeneous. Each organization’s needs are different, and enterprise search is very different from internet search in many ways:

Security – In an enterprise, the amount of generally available information that you can find on the internet is relatively small. Most information is proprietary and needs fine-grained security controls.
Relevance – Google’s killer feature was PageRank—a relevance algorithm that calculated how authoritative a web page was based on how many other pages are linked to it (and, recursively, how authoritative those pages were). This worked tremendously well on the internet, which is a set of linked documents (pages). But in the enterprise, there is little or no linking between documents. And, as we’ve seen, calculating the relevance of a result to some search (for some user in some domain doing some task) is hard, and is not the same for all users.

The Google Search Appliance is widely viewed as a failure, and in early 2016, Google announced its End Of Life. At the time, CMSWire reported, “This definitely seals the end of the era of commoditized search. It also highlights the fact that web search and Enterprise Search are a completely different ball game.”

Yes, Google is fast, but …

If your users use Google’s performance as a benchmark for expectations, here are a few responses. Yes, Google is fast, but …

It’s not doing what the Enterprise Search application is doing. It only has to search for one kind of document (a web page). While Google results pages have grown richer over the years—with cards, carousels, info panels and so on—they are still nowhere near as rich as the results page you can build in an Enterprise Search application. And all that richness—rich results snippets, facets, rich info panels and more—is customized for you personally, in your domain, in your role, doing your job. Google doesn’t do that!
The Google application is running on a vast amount of hardware (nobody knows how much, but estimates start at a million computers) on custom software (the whole stack is painstakingly custom-built by engineers with PhDs) with care and feeding from an army of people.

Why is this so hard? Google doesn’t do that!

Another pushback you may hear is, “Why are you telling me I have to process my queries, process my data, track everything, tweak relevance constantly—Google doesn’t do that!”

Actually, Google does do (almost) all those things. They certainly process queries to get the best results. They track everything. (Google’s biggest strength is they have bazillions of queries to feed their relevance tuning efforts). And, they tweak relevance and the presentation of results all the time.

The only thing they don’t do is process the data. Nobody at Google reaches out to the web pages they search and changes them to make them more findable. But, Google doesn’t have to. The content creators tweak their own content for search-engine optimization!

Google sets the bar

There’s no intent here to bash Google. Google internet search is awesome! It’s fast, it’s accurate, the results pages are rich and they are constantly enhancing existing capabilities and adding new ones (such as cards and carousels).

As a search developer, you can certainly use Google as a model. Track their feature set, and as they add new features (such as carousels), find ways to implement those features in your own search application, but with an enterprise twist. You can focus your efforts on providing a better, more personalized experience for your users. While Google optimizes for serving the masses, your search application is tailored for deep, personalized searches.

Next Steps

Now that you understand what makes a great enterprise search experience, it’s time to start building!

If you’re interested in learning more about MarkLogic’s built-in search, click here for an overview.

If you want to tip-toe into more technical content, you can watch a presentation from Mary Holstege about designing search from first principles.

And if you’re ready to get started, we have free training on-demand. Much of our training addresses search in the context of a MarkLogic Data Hub, but there are plenty of training classes specific to building search applications.

Stephen Buxton

Stephen Buxton is the president of BTC, an independent consulting firm. Previously he was the Product Manager for Search and Semantics at MarkLogic, where he was a member of the Product team since 2005.

Stephen is the co-author of "Querying XML" and a contributor to "Database Design", a book in Morgan Kaufman's "Know It All" series.

Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.

Related Tags

MarkLogic

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

The Essential Guide to Enterprise Search

What Is Enterprise Search?

What Makes Enterprise Search Great?

1. Process searches

Query expansion

Semantic search

Auto-complete

Facets

Suggestions

Fine-tuning

2. Process the data

Find me! Find me!

3. Invest in the UI

Information about each result

Snippets

Information about the results set

Facets

Recommendations: More like this

Visualization: Word-clouds and more

4. Enable notifications

5. Monitor and improve

Continuous improvement

6. Ensure end-to-end security

Why Not Just Use Google?

How Google is different from enterprise search

Yes, Google is fast, but …

Why is this so hard? Google doesn’t do that!

Google sets the bar

Next Steps

Stephen Buxton

Related Tags

Latest Stories in Your Inbox