Semantic Data Collections
Overview
A Semantic Data Collection is a collection of JSON documents that are searchable, along with some metadata that determines how to search the documents based on a query. You can upload a Semantic Data Collection changefile to add and remove documents to and from a Semantic Data Collection. The documents are stored in an Elasticsearch 2.1 server which has a very expressive query language. Once a document is found, it becomes a recognition result mapped to explicit recognition keys in the document, or a default recognition key specified in the Semantic Data Collection.
Since the documents stored in a Semantic Data Collection have explicit fields with semantic information, very informative Inline Answers can be created from their associated recognition results.
If you make your Semantic Data Collection public, all Solve for All users will gain the ability to search your data.
Semantic Data Collections can also provide query suggestions while the user is typing in the answer query input field.
To get started creating a Semantic Data Collection quickly, check out the guide to creating Content Template Answer Generators.
How they work
Answer Generators and Triggers may be associated with one or more required Semantic Data Collections. If a Semantic Data Collection is required by any Answer Generator or Trigger that the user has installed, the Semantic Data Collection will be searched. The Semantic Data Collection may specify a template for a query definition written in the Query DSL, or else a default query template will be used. The input query is substituted into the template to form the final query definition.
The results of searching a Semantic Data Collection are then transformed into recognition results by mapping the recognition keys in the matched document(s) to the document itself. If no recognition keys are specified in a matched document, the default recognition key of the Semantic Data Collection is used. The recognition results are also enhanced with the recognition level which is derived from the score of the document.
Properties
In addition to the common plugin properties, Semantic Data Collections have the following properties:
Name | Example Value | Description |
---|---|---|
Query Template |
{ "multi_match" : { "query": "{{query}}", "fields": ["name", "symbol"] } } |
Used to search the Semantic Data Collection. See the documentation for Query Templates below. |
Boost Factor | 2.0 |
A factor to multiply the recognition level by. Use this to ensure
good matches have a recognition level near 1.0. For example, if you
find that the recognition level of good matches is around 0.5
(using ??debug in your search query to inspect the
recognition results), you could set the boost to 2.0. This boost
is applied before the Removed Query Term Boosts.
Defaults to 1.0 if not set.
|
Removed Query Terms | chem,chemical,element | A comma-separated list of lower-cased terms to be removed from the answer query before being passed to the query template. |
Removed Query Term Boosts | 0.1,0.2 |
A comma-separated list of boosts for each of the corresponding removed
query terms. If the corresponding query term is found in the query,
the boost will be added to the output recognition level. If a
query term is found in a query, and it has no corresponding boost,
the previously specified boost before its position will be used. If multiple
query terms are found, the maximum of the corresponding boosts will be used. Each boost
should be between -1.0 and 1.0. Removed Query Term Boosts are applied
after the multiplicative boost .
If no boosts are specified, the
recognition level is not changed even if removed query terms are
found.
|
Default Recognition Key | com.solveforall.recognition.science.chemistry. ChemicalElement |
The recognition key a matched document will be mapped to, if it
does not contain the recognitionKeys property.
|
Name Field(s) | name |
The name of the field in each document that holds a string that should
be similar to the answer query. If found, it will be compared to the
answer query to adjust the recognition level of each matching document.
You can set multiple fields by separating them with commas. If you set
multiple fields, the closest match will be used. Defaults to
name if not set.
|
Suggestion Icon URL | https://www.data.com/favicon.ico | If a competion suggestion matches a document, the URL of the icon that will be used to decorate the suggestion. |
Query Templates
Query templates, written in the Mustache templating language, contains placeholders for the query string, and are used to form query objects in the Elasticsearch Query DSL. See the documentation for template query for more details.
Search Template Parameters
The following parameters are passed to query templates:
Name | Example Value | Description |
---|---|---|
query | UC Berkeley | The original query, stripped of leading and trailing whitespace, activation codes, and removed terms |
englishLowerCasedQuery | uc berkeley |
The value of query , but lower-cased in English
|
localLowerCasedQuery | uc berkeley |
The value of query , but lower-cased in user's
local language
|
Default Template
If the query template is not specified, Solve for All will use the following template by default:
{ "common" : { "_all" : { "query" : "{{query}}" } } }
Note that this requires that the _all
field remain enabled.
This is only used for completely entered queries.
Limitations
To maintain the performance of queries, Solve for All imposes some limitations on query templates. The following properties are not allowed anywhere in a template:
- fuzziness
- has_child
- has_parent
- indices
- match_all
- max_expansions
- mlt
- more_like_this
- more_like_this_field
- query_string
- regexp
- rewrite
- script
- script_score
- simple_query_string
- template
- top_children
- wildcard
Solve for All is not yet sophisticated enough to determine if these property names are used to identify search methods or as references to fields in documents, so it is best to avoid using these property names in your documents.
Two additional limitations apply:
- The maximum size of a query template may not exceed 2000 bytes
-
No property name can contain a variable placeholder, for example:
{ "match_{{query}}" : { "term" : { "name" : "Jeff" } } }
would not be allowed.
If you have a good reason to use one or more of the disallowed features above, let us know at feedback@solveforall.com, and we'll consider allowing the feature(s).
Example Flow
As an example, consider the Python Standard Library Semantic Data Collection. Its query template is
{ "multi_match" : { "query": "{{query}}", "fields": [ "name^2", "simpleName" ] } }
and its default recognition key is com.solveforall.recognition.programming.python.StandardLibrary
.
Here is an example document in the collection:
{ "_id" : "urllib.parse.urlencode", "name" : "urllib.parse.urlencode", "simpleName" : "urlencode", "params" : [ "query", "doseq=False", "safe=''", "encoding=None", "errors=None" ], "kind" : "function", "path" : "urllib.parse.html#urllib.parse.urlencode", "summaryHtml" : "<p>Convert a mapping ...</p>" }
Now, assume the user has the Python Standard Library Answer Generator
installed, and enters the answer query urlencode
. Here are the processing
steps:
-
Since the Python Standard Library Answer Generator declares that the
Python Standard Library Semantic Data Collection is required, it will be searched
with the above query template. The query template, when substituted with the
answer query
urlencode
will evaluate to:{ "multi_match" : { "query": "urlencode", "fields": ["name^2", "simpleName"] } }
-
This matches the example document above, and since it does not contain
the
recognitionKeys
property, the default recognition key of the Semantic Data Collection,com.solveforall.recognition.programming.python.StandardLibrary
, is used. This recognition result will be output:{ "com.solveforall.recognition.programming.python.StandardLibrary" : [ { "name" : "urllib.parse.urlencode", "simpleName" : "urlencode", "params" : [ "query", "doseq=False", "safe=''", "encoding=None", "errors=None" ], "kind" : "function", "path" : "urllib.parse.html#urllib.parse.urlencode", "summaryHtml" : "<p>Convert a mapping ...</p>", "recognitionLevel" : 0.7833126 }, ... ] }
whererecognitionLevel
is derived from the matching score for the document. - The Python Standard Library Answer Generator can now easily take this information to create an inline answer.
Semantic Data Collection Changefiles
To add, update, or remove documents to a Semantic Data Collection, view the Semantic Data Collection in the developer section, and upload a Semantic Document Collection changefile. Changefiles are JSON documents in the following format:
{ "metadata" : { "settings" : { "analysis": { "char_filter" : { "no_symbol" : { "type" : "mapping", "mappings" : [":=>", ".=>"] } }, "analyzer" : { "lower_whitespace" : { "type" : "custom", "tokenizer": "whitespace", "filter" : ["lowercase"], "char_filter" : ["no_symbol"] } } } }, "mapping" : { "_all" : { "enabled" : false }, "properties" : { "name" : { "type" : "string", "index" : "not_analyzed" }, "suggest" : { "type" : "string", "analyzer" : "lower_whitespace" }, "lastUpdated" : { "type" : "date" }, ... } } }, "updates" : [ { "name" : "doc1", "suggest" : { "input" : "doc1", "weight" : 5 }, "lastUpdated" : "2015-09-15T14:05:56", ... }, { "_id" : "id_to_insert" "name" : "doc2", "suggest" : { "input" : "doc2", "weight" : 3 }, "lastUpdated" : "2015-10-04T06:14:44", ... }, ... ], "deletions" : [ "id_to_delete1", "id_to_delete2", ... ] }
The following sections describe the top-level properties of a changefile document:
Metadata
The metadata
property of a changefile document is mapped to
metadata intended to be used to describe the format of the documents
and how to search them. The supported metadata properties are:
settings
mapping
For Advanced Users Settings
The settings
property of the metadata object has a value of a JSON
settings
document that allows you to change how the query and document fields are analyzed
for searching. The only property of the settings document that is supported by
Solve for All is analysis
. Within the analysis object, you can
define analyzers and character filters with the following limitations:
- All names of analyzers, character filters, and aliases must be 50 characters or less
- At most 20 analyzers can be defined per Semantic Data Collection
- At most 20 character filters can be defined per Semantic Data Collection
- Stopwords must be 40 characters or less and there can be a maximum of 500 stopwords
-
For language analyzers (like
english
orspanish
), stem exclusions must be 40 characters or less and there can be a maximum of 500 stem exclusions -
For
custom
analyzers,- At most 8 filters can be used
- At most 8 char filters can be used
position_increment_gap
cannot be set manually
- The
pattern
analyzer type is not available - The
pattern_replace
character filter is not available
Mapping
The mapping
property of the metadata object of has a value of a JSON
mapping
document specifying how to map field values in the search index. One useful
property in a mapping is
{ ... "_all" : { "enabled" : false }, ... }
which prevents creation of the implicit _all
field,
thus saving space if the search query does not use it. (The amount of space
used for Semantic Data Collections is limited to 25 MB per user.) See the
description of _all
in the
Elasticsearch documentation
for more details. Again, if you disable _all
be sure to specify
a query template, since the default template depends on _all
being enabled.
To maintain search performance for all users, Solve for All limits mapping properties to a subset of what is allowed by Elasticsearch. The following field mappings are not allowed:
Also, the following properties of field mapping objects are not allowed:
- compress
- compress_threshold
- doc_values
- doc_values_format
- dynamic_date_formats
- dynamic_templates
- fielddata
- index_name
- ignore_malformed
- numeric_detection
- postings_format
- precision
- precision_step
- tree
- tree_levels
Finally, there are a few more restrictions:
- The attachment type is not allowed
- The
norms.loading
property of field mappings must either not be present, or mapped to "lazy".
If you have a good reason to use one or more of the disallowed features above, let us know at feedback@solveforall.com, and we'll consider allowing the feature(s).
Updates
The updates
property of a changefile document is mapped to an array
of documents to be inserted into the index. Documents in the collection may
have the following optional properties, which are removed from recognition
results:
Name | Example Value | Description |
---|---|---|
_id | Walter White | The identifier of the document. If an other document with the same identifier is processed in a later changefile, the document will be merged with the current one. This identifier can also be used to remove the document in the Deletions section. |
recognitionKeys | ["com.solveforall.recognition.tv.breakingbad.Character"] | An array of recognition keys to map the document to. Each recognition key will be mapped to a copy of the document. |
Deletions
The deletions
property of a changefile document is mapped to an array
identifier of documents to be deleted from the index. These identifiers come
from the _id
property of the documents. If a document did not
have the _id
property, the only way to delete it is to select the
Remove existing data option when uploading a
changefile, which removes all documents indiscriminately.
Upload Formats
To conserve bandwidth, Solve for All requires that a Semantic Data Collection changefile be compressed in one of the following formats before it is uploaded:
Format | Extension | Notes |
---|---|---|
gzip | .gz | Very common compression algorithm, compression ratios averaging 4x. |
bzip2 | .bz2 | Less common, about 10x slower than GZIP, compression ratios averaging 5x. |
ZIP | .zip | Archive must contain a single file, which is the Semantic Data Collection changefile. The worst compression ratio but fastest compression speed. |
Autocomplete
Solve for All computes a set of Semantic Data Collections that are used for autocompletion when a user is typing in an answer query input box. The eligible set contains all Semantic Data Collections associated with all Answer Generators that are activated by activation codes, activation keywords, or that have triggers set by the user.
With each Semantic Data Collection in this set, Solve for All performs
a Completion Suggester
request, assuming that the field suggest
has been mapped
as the completion
type. To ensure your Semantic Data Collection
can be used for autocompletion, make sure you define such a mapping,
and populate each document with the suggest
property.
Pro Tip The suggested query will
either be the value of the suggest
property, if it is a string,
or the value of suggest.output
if the suggest
property is an object. However, you can tell Solve for All to show different
text describing the suggested query by enabling payloads in the mapping,
and setting the suggest.payload
property to an object that
contains a desc
. Solve for All will then use the value of
suggest.payload.desc
to display the suggestion, but when
the suggestion is selected, the input box will be filled in with the
normal suggestion value.
See stock-symbol-populator
for an example.
Limitations
Since it costs money to store documents, Solve for All imposes the following limits on Semantic Data Collections, per user:
- The total number of all documents in all Semantic Data Collections put together must not exceed 5,000,000
- The total space of all indexed documents in all Semantic Data Collections put together must not exceed 200 MB
As noted in Metadata, it may be useful to disable the
_all
field in the search index to avoid the space penalty for
it.
If you need additional storage capacity, please contact us to inquire about pricing. If the data you are storing is very useful to the general public, we may grant you the capacity you need free of charge.
Example Populators
Some examples of projects that create Semantic Data Collection Changefiles can be found in our GitHub repository. All projects are written in Ruby unless noted otherwise.
- wordnet-populator uses the WordNet database of English words
- stock-symbol-populator uses stock information downloaded from NASDAQ
- http-status-code-populator uses HTTP status code specifications from W3C. Written in python by a non-python programmer.
- mdn-html-populator: uses HTML element documentation from the Mozilla Developer Network
- css-properties-populator: uses documentation from the Mozilla Developer Network and WebPlatform.org for CSS properties, functions, at-rules, pseudo-classes, pseudo-elements, and data types
- mdn-javascript-doc-populator: uses documentation of the Javascript standard built-in objects from the Mozilla Developer Network
- npm-populator: uses the repository of Node Package Modules (npm)
- bower-populator: uses the Bower repository of packages for web development
- javadoc-populator: uses Javadocs for the Java SDK
- python-library-populator: uses the official Python documentation for classes and functions in the standard library. Ironically, the code is written in Ruby.
- underscore-docs-populator: uses the documentation of Underscore.js functions
- font-awesome-populator: lists the icons available in Font Awesome
- chemical-elements-populator: uses properties of the chemical elements. It's a modification of an existing Python script we shamelessly stole from somewhere.
Next Steps
Each matching document is converted to a recognition result, and all recognition results are reduced into Combined Content Recognition Results, which then are available for Triggers and Answer Generators.
To get started creating a Semantic Data Collection quickly, check out the guide to creating Content Template Answer Generators.