cleanupMode can take the following values:
cleanupMode can take the following values:
disabled
: keep original. Useful if need to head back to source laterinplace
: newlines and tabs into whitespaces, not stringified ones, don't triminplace_full
: newlines and tabs into whitespaces, including stringified, don't trimshrink
: all whitespaces, newlines and tabs to a single whitespace, but not stringified, do trimshrink_full
: all whitespaces, newlines and tabs to a single whitespace, stringified ones too, trim alleach
: newlines and tabs to one whitespace eacheach_full
: newlines and tabs, stringified ones too, to one whitespace eachdelete_full
: remove stringified newlines and tabs (replace with nothing)
cleanupMode to pre-process text
Id column for row reference
Input text column for processing
Metadata for document column
Gets annotation column name going to generate
Gets annotation column name going to generate
Id column for row reference
Input text column for processing
Metadata for document column
Output Annotator Type: DOCUMENT
Output Annotator Type: DOCUMENT
cleanupMode to pre-process text
Id column for row reference
Input text column for processing
Metadata for document column
Overrides annotation column name when transforming
Overrides annotation column name when transforming
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
required uid for storing annotator to disk
required uid for storing annotator to disk
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Required input and expected output annotator types
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The
DocumentAssembler
can read either aString
column or anArray[String]
. Additionally, setCleanupMode can be used to pre-process the text (Default:disabled
). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example