Package com.basistech.rosette.dm
AnnotatedText
The root of the model is the AnnotatedText
class.
KnownAttribute
The annotations are represented as objects that inherit from BaseAttribute
.
The base attribute is the simplest attribute; all this class provides is a map of extended properties
that are used, as described below, as an extensibility mechanism.
Most attribute classes inherit from Attribute
. This class adds a start offset and an end offset.
Thus, attributes that refer to the AnnotatedText
as a whole inherit from BaseAttribute
, while attributes
that refer to subsequences of text inherit from Attribute
.
RawData
In some cases, applications of this data model may also need to represent initial raw data.
The RawData
class supports that usage. RawData
stores a ByteBuffer
and a Map<String, <List<String>>
of
metadata. There is no connection in the code between AnnotatedText
and RawData
.
Immutability
All of the classes in this package are immutable. If a program needs to modify, it needs to construct new classes.
This 'functional' approach avoids any possibility of concurrent access problems. Creating a new AnnotatedText
over all the attributes of an old AnnotatedText
plus a new set is not particularly costly compared to whatever
actual NLP task is producing the annotations.
Serializable
The classes in this data model implement java.io.Serializable
. Each class has a
serialVersionId
, the ID is derived from the version number of the library in which the
a change was made. Serializable support was added in version 2.2.2, so all the classes started
with version 222
.
Builders
Because these classes are immutable, they have many arguments to their constructors. Each class has a
nested Builder
class to avoid this inconvenience; the constructors are thus not public.
Extensibility Model
We could have designed this data model to defer all the binding until runtime -- essentially, a giant collection of maps and arrays. This would have allowed any program at any time to define a new annotation, and would have made it very difficult to encounter a version skew amongst libraries compiled to different versions of the model. Programming to that sort of data model is painful, so we chose to write specific classes for specific annotations.
To mitigate the possible unpleasant consequences resulting from version skew, this model includes an extensibility
mechanism. BaseAttribute
contains a Map<String, Object>
. This allows programs
that have differing sets of annotations to communicate via Json. The JsonAnySetter
and JsonAnyGetter
annotations cause any items in the Json object to be mapped to
entries in the map. Entries in the map are serialized as keys in the object. Thus, a program can read in a
serialized AnnotatedText
that contains attributes with fields that it does not know about.
Serialization
All of the classes in here support json serialization and deserialization via Jackson 2.4.x. However, they require some customization to get a correct and efficient representation. This customization is provided in a separate module: adm-json.
Null Values
Logically empty lists, sets, and maps are usually represented by null
instead of by actual empty collections.
The fields of any attributes may be null
, unless documented otherwise for a specific field.
-
ClassDescriptionThis abstract class provides the canonical mapping from annotating with string input to annotating with
AnnotatedText
input.The root of the data model.Builder class forAnnotatedText
objects.AnAnnotator
annotates text with attributes.Arabic morphological analysis.Builder class forArabicMorphoAnalysis
.Base class for attributes that span a range of text.Base class for builders for attributes that inherit fromAttribute
.Base class for attributes that annotate text.Base class for builders for the subclasses ofBaseAttribute
.A base noun phrase.Builder for base noun phrase attributes.Associates a label with a document.A builder for classifier results.A reference to a high-level "concept" of a document.A Builder for conceptsAn inter-token dependency from a parser.Builder for Dependency.A vector of embeddings for some vector of items in anAnnotatedText
.Builder class for EmbeddingsCollection.Embeddings for a text.The embedding name.A reference to a "real world" entity.A builder for resolved entities.Deprecated.A builder for entity mentions.An evidence for a relationship mention component, pointing to the exact span in the raw text that implies the existence of this component The offsets refer to a half-open range of characters (UTF-16 elements) Note that Extents have no properties of their own.Builder for Extent attributes.Morphological analysis objects for Chinese and Japanese.A builder forHanMorphoAnalysis
.A reference to a "keyphrase" of a document.A builder for keyphrasesMorphological analysis objects for Korean.A builder forKoreanMorphoAnalysis
.The results of running language detection on a region of text.A builder for language detection results.A single result from language detection.Builder for detection results.Layout defines text as spans defined by structured or unstructured regions.Builder for layout regions.Layout typesListAttribute<Item extends BaseAttribute>A container for an ordered collection of attributes of a type.ListAttribute.Builder<Item extends BaseAttribute>A builder for lists.MapAttribute<K,V extends BaseAttribute> A container for a keyed collection of attributes of a type.MapAttribute.Builder<K,V extends BaseAttribute> A mention of a entity in the text.A builder for entity mentions.A MorphoAnalysis contains all the results of analyzing a word, or something like a word.Builder forMorphoAnalysis
.A name of something in the world.Builder forName
.A container for incoming raw data (bytes).A Relationship Component: a building block of a relationship mention, such as an argument, predicate or adjunct.A Relationship Mention describes arguments in a sentence and a predicate that connects them.Deprecated.replaced byEntity
.A builder for resolved entities.A script region.Builder for script regions.A Sentence.Builder for Sentence attributes.A term with some semantic similarity to anAnnotatedText
.Builder class for SimilarTermEnumeration for part of speech tag sets used in Basis products.The token.Builder for tokens.A translation of the text.Builder class for TranslatedData.A list of translations for the tokens.Builder class for TranslatedTokens.Builder for immutableTransliterationResults
Class used for future-proof representation of attributes in json that we don't have classes for.
Mention
.