Package com.basistech.rosette.dm


package com.basistech.rosette.dm
Rosette Data Model. This package contains a set of classes that define a data model that represents annotations over text. The data model is a Java (and Json) representation of some text and some annotations on the text.

AnnotatedText

The root of the model is the AnnotatedText class.

KnownAttribute

The annotations are represented as objects that inherit from BaseAttribute. The base attribute is the simplest attribute; all this class provides is a map of extended properties that are used, as described below, as an extensibility mechanism.

Most attribute classes inherit from Attribute. This class adds a start offset and an end offset. Thus, attributes that refer to the AnnotatedText as a whole inherit from BaseAttribute, while attributes that refer to subsequences of text inherit from Attribute.

RawData

In some cases, applications of this data model may also need to represent initial raw data. The RawData class supports that usage. RawData stores a ByteBuffer and a Map<String, <List<String>> of metadata. There is no connection in the code between AnnotatedText and RawData.

Immutability

All of the classes in this package are immutable. If a program needs to modify, it needs to construct new classes. This 'functional' approach avoids any possibility of concurrent access problems. Creating a new AnnotatedText over all the attributes of an old AnnotatedText plus a new set is not particularly costly compared to whatever actual NLP task is producing the annotations.

Serializable

The classes in this data model implement java.io.Serializable. Each class has a serialVersionId, the ID is derived from the version number of the library in which the a change was made. Serializable support was added in version 2.2.2, so all the classes started with version 222.

Builders

Because these classes are immutable, they have many arguments to their constructors. Each class has a nested Builder class to avoid this inconvenience; the constructors are thus not public.

Extensibility Model

We could have designed this data model to defer all the binding until runtime -- essentially, a giant collection of maps and arrays. This would have allowed any program at any time to define a new annotation, and would have made it very difficult to encounter a version skew amongst libraries compiled to different versions of the model. Programming to that sort of data model is painful, so we chose to write specific classes for specific annotations.

To mitigate the possible unpleasant consequences resulting from version skew, this model includes an extensibility mechanism. BaseAttribute contains a Map<String, Object>. This allows programs that have differing sets of annotations to communicate via Json. The JsonAnySetter and JsonAnyGetter annotations cause any items in the Json object to be mapped to entries in the map. Entries in the map are serialized as keys in the object. Thus, a program can read in a serialized AnnotatedText that contains attributes with fields that it does not know about.

Serialization

All of the classes in here support json serialization and deserialization via Jackson 2.4.x. However, they require some customization to get a correct and efficient representation. This customization is provided in a separate module: adm-json.

Null Values

Logically empty lists, sets, and maps are usually represented by null instead of by actual empty collections. The fields of any attributes may be null, unless documented otherwise for a specific field.