Package

com.johnsnowlabs.nlp.annotators

er

Permalink

package er

Visibility
  1. Public
  2. All

Type Members

  1. case class EntityPattern(label: String, patterns: Seq[String], id: Option[String] = None) extends Product with Serializable

    Permalink
  2. class EntityRulerApproach extends AnnotatorApproach[EntityRulerModel] with HasStorage

    Permalink

    Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.

    Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

    There are multiple ways and formats to set the extraction resource. It is possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the file needs to be provided to setPatternsResource. The file format needs to be set as the "format" field in the option parameter map and depending on the file type, additional parameters might need to be set.

    To enable regex extraction, setEnablePatternRegex(true) needs to be called.

    If the file is in a JSON format, then the rule definitions need to be given in a list with the fields "id", "label" and "patterns":

     [
      {
        "id": "person-regex",
        "label": "PERSON",
        "patterns": ["\\w+\\s\\w+", "\\w+-\\w+"]
      },
      {
        "id": "locations-words",
        "label": "LOCATION",
        "patterns": ["Winterfell"]
      }
    ]

    The same fields also apply to a file in the JSONL format:

    {"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow"]}
    {"id": "names-with-s", "label": "PERSON", "patterns": ["Stark", "Snow"]}
    {"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}

    In order to use a CSV file, an additional parameter "delimiter" needs to be set. In this case, the delimiter might be set by using .setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))

    PERSON|Jon
    PERSON|John
    PERSON|John Snow
    LOCATION|Winterfell

    Example

    In this example, the entities file as the form of

    PERSON|Jon
    PERSON|John
    PERSON|John Snow
    LOCATION|Winterfell

    where each line represents an entity and the associated string delimited by "|".

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.er.EntityRulerApproach
    import com.johnsnowlabs.nlp.util.io.ReadAs
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val entityRuler = new EntityRulerApproach()
      .setInputCols("document", "token")
      .setOutputCol("entities")
      .setPatternsResource(
        path = "src/test/resources/entity-ruler/patterns.csv",
        readAs = ReadAs.TEXT,
        options = Map("format" -> "csv", "delimiter" -> "\\|")
      )
      .setEnablePatternRegex(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      entityRuler
    ))
    
    val data = Seq("Jon Snow wants to be lord of Winterfell.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(entities)").show(false)
    +--------------------------------------------------------------------+
    |col                                                                 |
    +--------------------------------------------------------------------+
    |[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []]           |
    |[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
    +--------------------------------------------------------------------+
  3. case class EntityRulerFeatures(patterns: Map[String, String], regexPatterns: Map[String, Seq[String]]) extends Serializable with Product

    Permalink
  4. class EntityRulerModel extends AnnotatorModel[EntityRulerModel] with HasSimpleAnnotate[EntityRulerModel] with HasStorageModel

    Permalink

    Instantiated model of the EntityRulerApproach.

    Instantiated model of the EntityRulerApproach. For usage and examples see the documentation of the main class.

  5. class PatternsReadWriter extends PatternsReader with StorageReadWriter[String]

    Permalink
  6. class PatternsReader extends StorageReader[String]

    Permalink
  7. trait ReadablePretrainedEntityRuler extends StorageReadable[EntityRulerModel] with HasPretrained[EntityRulerModel]

    Permalink
  8. class RegexPatternsReadWriter extends RegexPatternsReader with StorageReadWriter[Seq[String]]

    Permalink
  9. class RegexPatternsReader extends StorageReader[Seq[String]]

    Permalink

Value Members

  1. object EntityRulerModel extends ReadablePretrainedEntityRuler with Serializable

    Permalink

Ungrouped