Semantic Data Collections

Overview

A Semantic Data Collection is a collection of JSON documents that are searchable, along with some metadata that determines how to search the documents based on a query. You can upload a Semantic Data Collection changefile to add and remove documents to and from a Semantic Data Collection. The documents are stored in an Elasticsearch 2.1 server which has a very expressive query language. Once a document is found, it becomes a recognition result mapped to explicit recognition keys in the document, or a default recognition key specified in the Semantic Data Collection.

Since the documents stored in a Semantic Data Collection have explicit fields with semantic information, very informative Inline Answers can be created from their associated recognition results.

If you make your Semantic Data Collection public, all Solve for All users will gain the ability to search your data.

Semantic Data Collections can also provide query suggestions while the user is typing in the answer query input field.

To get started creating a Semantic Data Collection quickly, check out the guide to creating Content Template Answer Generators.

How they work

Answer Generators and Triggers may be associated with one or more required Semantic Data Collections. If a Semantic Data Collection is required by any Answer Generator or Trigger that the user has installed, the Semantic Data Collection will be searched. The Semantic Data Collection may specify a template for a query definition written in the Query DSL, or else a default query template will be used. The input query is substituted into the template to form the final query definition.

The results of searching a Semantic Data Collection are then transformed into recognition results by mapping the recognition keys in the matched document(s) to the document itself. If no recognition keys are specified in a matched document, the default recognition key of the Semantic Data Collection is used. The recognition results are also enhanced with the recognition level which is derived from the score of the document.

Properties

In addition to the common plugin properties, Semantic Data Collections have the following properties:

Name Example Value Description
Query Template
{
  "multi_match" : {
    "query": "{{query}}",
    "fields": ["name", "symbol"]
  }
}
Used to search the Semantic Data Collection. See the documentation for Query Templates below.
Boost Factor 2.0 A factor to multiply the recognition level by. Use this to ensure good matches have a recognition level near 1.0. For example, if you find that the recognition level of good matches is around 0.5 (using ??debug in your search query to inspect the recognition results), you could set the boost to 2.0. This boost is applied before the Removed Query Term Boosts. Defaults to 1.0 if not set.
Removed Query Terms chem,chemical,element A comma-separated list of lower-cased terms to be removed from the answer query before being passed to the query template.
Removed Query Term Boosts 0.1,0.2 A comma-separated list of boosts for each of the corresponding removed query terms. If the corresponding query term is found in the query, the boost will be added to the output recognition level. If a query term is found in a query, and it has no corresponding boost, the previously specified boost before its position will be used. If multiple query terms are found, the maximum of the corresponding boosts will be used. Each boost should be between -1.0 and 1.0. Removed Query Term Boosts are applied after the multiplicative boost. If no boosts are specified, the recognition level is not changed even if removed query terms are found.
Default Recognition Key com.solveforall.recognition.science.chemistry. ChemicalElement The recognition key a matched document will be mapped to, if it does not contain the recognitionKeys property.
Name Field(s) name The name of the field in each document that holds a string that should be similar to the answer query. If found, it will be compared to the answer query to adjust the recognition level of each matching document. You can set multiple fields by separating them with commas. If you set multiple fields, the closest match will be used. Defaults to name if not set.
Suggestion Icon URL https://www.data.com/favicon.ico If a competion suggestion matches a document, the URL of the icon that will be used to decorate the suggestion.

Query Templates

Query templates, written in the Mustache templating language, contains placeholders for the query string, and are used to form query objects in the Elasticsearch Query DSL. See the documentation for template query for more details.

Search Template Parameters

The following parameters are passed to query templates:

Name Example Value Description
query UC Berkeley The original query, stripped of leading and trailing whitespace, activation codes, and removed terms
englishLowerCasedQuery uc berkeley The value of query, but lower-cased in English
localLowerCasedQuery uc berkeley The value of query, but lower-cased in user's local language

Default Template

If the query template is not specified, Solve for All will use the following template by default:

{
  "common" : {
    "_all" : {
      "query" : "{{query}}"
    }
  }
}

Note that this requires that the _all field remain enabled. This is only used for completely entered queries.

Limitations

To maintain the performance of queries, Solve for All imposes some limitations on query templates. The following properties are not allowed anywhere in a template:

Solve for All is not yet sophisticated enough to determine if these property names are used to identify search methods or as references to fields in documents, so it is best to avoid using these property names in your documents.

Two additional limitations apply:

  • The maximum size of a query template may not exceed 2000 bytes
  • No property name can contain a variable placeholder, for example:
    {
      "match_{{query}}" : {
        "term" : {
           "name" : "Jeff"
        }
      }
    }
    would not be allowed.

If you have a good reason to use one or more of the disallowed features above, let us know at feedback@solveforall.com, and we'll consider allowing the feature(s).

Example Flow

As an example, consider the Python Standard Library Semantic Data Collection. Its query template is

{
  "multi_match" : {
    "query": "{{query}}",
    "fields": [
      "name^2",
      "simpleName"
    ]
  }
}

and its default recognition key is com.solveforall.recognition.programming.python.StandardLibrary.

Here is an example document in the collection:

{
  "_id" : "urllib.parse.urlencode",
  "name" : "urllib.parse.urlencode",
  "simpleName" : "urlencode",
  "params" : [
    "query",
    "doseq=False",
    "safe=''",
    "encoding=None",
    "errors=None"
  ],
  "kind" : "function",
  "path" : "urllib.parse.html#urllib.parse.urlencode",
  "summaryHtml" : "<p>Convert a mapping ...</p>"
}

Now, assume the user has the Python Standard Library Answer Generator installed, and enters the answer query urlencode. Here are the processing steps:

  1. Since the Python Standard Library Answer Generator declares that the Python Standard Library Semantic Data Collection is required, it will be searched with the above query template. The query template, when substituted with the answer query urlencode will evaluate to:
    {
      "multi_match" : {
        "query": "urlencode",
        "fields": ["name^2", "simpleName"]
      }
    }
  2. This matches the example document above, and since it does not contain the recognitionKeys property, the default recognition key of the Semantic Data Collection, com.solveforall.recognition.programming.python.StandardLibrary, is used. This recognition result will be output:
    {
      "com.solveforall.recognition.programming.python.StandardLibrary" : [
        {
          "name" : "urllib.parse.urlencode",
          "simpleName" : "urlencode",
          "params" : [
            "query",
            "doseq=False",
            "safe=''",
            "encoding=None",
            "errors=None"
          ],
          "kind" : "function",
          "path" : "urllib.parse.html#urllib.parse.urlencode",
          "summaryHtml" : "<p>Convert a mapping ...</p>",
          "recognitionLevel" : 0.7833126
        }, ...
      ]
    }
    where recognitionLevel is derived from the matching score for the document.
  3. The Python Standard Library Answer Generator can now easily take this information to create an inline answer.

Semantic Data Collection Changefiles

To add, update, or remove documents to a Semantic Data Collection, view the Semantic Data Collection in the developer section, and upload a Semantic Document Collection changefile. Changefiles are JSON documents in the following format:

{
  "metadata" : {
    "settings" : {
      "analysis": {
        "char_filter" : {
          "no_symbol" : {
            "type" : "mapping",
            "mappings" : [":=>", ".=>"]
          }
        },
        "analyzer" : {
          "lower_whitespace" : {
            "type" : "custom",
            "tokenizer": "whitespace",
            "filter" : ["lowercase"],
            "char_filter" : ["no_symbol"]
          }
        }
      }
    },
    "mapping" : {
      "_all" : {
        "enabled" : false
      },
      "properties" : {
        "name" : {
          "type" : "string",
          "index" : "not_analyzed"
        },
        "suggest" : {
          "type" : "string",
          "analyzer" : "lower_whitespace"
        },
        "lastUpdated" : {
          "type" : "date"
        },
        ...
      }
    }
  },

  "updates" : [
    {
      "name" : "doc1",
      "suggest" : {
        "input" : "doc1",
        "weight" : 5
      },
      "lastUpdated" : "2015-09-15T14:05:56",
      ...
    },
    {
      "_id" : "id_to_insert"
      "name" : "doc2",
      "suggest" : {
        "input" : "doc2",
        "weight" : 3
      },
      "lastUpdated" : "2015-10-04T06:14:44",
      ...
    },
    ...
  ],

  "deletions" : [
     "id_to_delete1", "id_to_delete2", ...
  ]
}

The following sections describe the top-level properties of a changefile document:

Metadata

The metadata property of a changefile document is mapped to metadata intended to be used to describe the format of the documents and how to search them. The supported metadata properties are:

  • settings
  • mapping

For Advanced Users Settings

The settings property of the metadata object has a value of a JSON settings document that allows you to change how the query and document fields are analyzed for searching. The only property of the settings document that is supported by Solve for All is analysis. Within the analysis object, you can define analyzers and character filters with the following limitations:

  • All names of analyzers, character filters, and aliases must be 50 characters or less
  • At most 20 analyzers can be defined per Semantic Data Collection
  • At most 20 character filters can be defined per Semantic Data Collection
  • Stopwords must be 40 characters or less and there can be a maximum of 500 stopwords
  • For language analyzers (like english or spanish), stem exclusions must be 40 characters or less and there can be a maximum of 500 stem exclusions
  • For custom analyzers,
    • At most 8 filters can be used
    • At most 8 char filters can be used
    • position_increment_gap cannot be set manually
  • The pattern analyzer type is not available
  • The pattern_replace character filter is not available

Mapping

The mapping property of the metadata object of has a value of a JSON mapping document specifying how to map field values in the search index. One useful property in a mapping is

{
  ...
  "_all" : {
     "enabled" : false
  },
  ...
}

which prevents creation of the implicit _all field, thus saving space if the search query does not use it. (The amount of space used for Semantic Data Collections is limited to 25 MB per user.) See the description of _all in the Elasticsearch documentation for more details. Again, if you disable _all be sure to specify a query template, since the default template depends on _all being enabled.

To maintain search performance for all users, Solve for All limits mapping properties to a subset of what is allowed by Elasticsearch. The following field mappings are not allowed:

Also, the following properties of field mapping objects are not allowed:

  • compress
  • compress_threshold
  • doc_values
  • doc_values_format
  • dynamic_date_formats
  • dynamic_templates
  • fielddata
  • index_name
  • ignore_malformed
  • numeric_detection
  • postings_format
  • precision
  • precision_step
  • tree
  • tree_levels

Finally, there are a few more restrictions:

  • The attachment type is not allowed
  • The norms.loading property of field mappings must either not be present, or mapped to "lazy".

If you have a good reason to use one or more of the disallowed features above, let us know at feedback@solveforall.com, and we'll consider allowing the feature(s).

Updates

The updates property of a changefile document is mapped to an array of documents to be inserted into the index. Documents in the collection may have the following optional properties, which are removed from recognition results:

Name Example Value Description
_id Walter White The identifier of the document. If an other document with the same identifier is processed in a later changefile, the document will be merged with the current one. This identifier can also be used to remove the document in the Deletions section.
recognitionKeys ["com.solveforall.recognition.tv.breakingbad.Character"] An array of recognition keys to map the document to. Each recognition key will be mapped to a copy of the document.

Deletions

The deletions property of a changefile document is mapped to an array identifier of documents to be deleted from the index. These identifiers come from the _id property of the documents. If a document did not have the _id property, the only way to delete it is to select the Remove existing data option when uploading a changefile, which removes all documents indiscriminately.

Upload Formats

To conserve bandwidth, Solve for All requires that a Semantic Data Collection changefile be compressed in one of the following formats before it is uploaded:

Format Extension Notes
gzip .gz Very common compression algorithm, compression ratios averaging 4x.
bzip2 .bz2 Less common, about 10x slower than GZIP, compression ratios averaging 5x.
ZIP .zip Archive must contain a single file, which is the Semantic Data Collection changefile. The worst compression ratio but fastest compression speed.

Autocomplete

Solve for All computes a set of Semantic Data Collections that are used for autocompletion when a user is typing in an answer query input box. The eligible set contains all Semantic Data Collections associated with all Answer Generators that are activated by activation codes, activation keywords, or that have triggers set by the user.

With each Semantic Data Collection in this set, Solve for All performs a Completion Suggester request, assuming that the field suggest has been mapped as the completion type. To ensure your Semantic Data Collection can be used for autocompletion, make sure you define such a mapping, and populate each document with the suggest property.

Pro Tip The suggested query will either be the value of the suggest property, if it is a string, or the value of suggest.output if the suggest property is an object. However, you can tell Solve for All to show different text describing the suggested query by enabling payloads in the mapping, and setting the suggest.payload property to an object that contains a desc. Solve for All will then use the value of suggest.payload.desc to display the suggestion, but when the suggestion is selected, the input box will be filled in with the normal suggestion value. See stock-symbol-populator for an example.

Limitations

Since it costs money to store documents, Solve for All imposes the following limits on Semantic Data Collections, per user:

  • The total number of all documents in all Semantic Data Collections put together must not exceed 5,000,000
  • The total space of all indexed documents in all Semantic Data Collections put together must not exceed 200 MB

As noted in Metadata, it may be useful to disable the _all field in the search index to avoid the space penalty for it.

If you need additional storage capacity, please contact us to inquire about pricing. If the data you are storing is very useful to the general public, we may grant you the capacity you need free of charge.

Example Populators

Some examples of projects that create Semantic Data Collection Changefiles can be found in our GitHub repository. All projects are written in Ruby unless noted otherwise.

Next Steps

Each matching document is converted to a recognition result, and all recognition results are reduced into Combined Content Recognition Results, which then are available for Triggers and Answer Generators.

To get started creating a Semantic Data Collection quickly, check out the guide to creating Content Template Answer Generators.