Package org.apache.tika.sax
Class StandardsText
- java.lang.Object
-
- org.apache.tika.sax.StandardsText
-
public class StandardsText extends java.lang.Object
StandardText relies on regular expressions to extract standard references from text.This class helps to find the standard references from text by performing the following steps:
- searches for headers;
- searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
- each potential standard reference starts with score equal to 0.25;
- increases by 0.25 the score of references which include the name of a
known standard organization (
StandardOrganizations
); - increases by 0.25 the score of references which include the word Publication or Standard;
- increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
- returns the standard references along with scores.
-
-
Constructor Summary
Constructors Constructor Description StandardsText()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.util.ArrayList<StandardReference>
extractStandardReferences(java.lang.String text, double threshold)
Extracts the standard references found within the given text.
-
-
-
Method Detail
-
extractStandardReferences
public static java.util.ArrayList<StandardReference> extractStandardReferences(java.lang.String text, double threshold)
Extracts the standard references found within the given text.- Parameters:
text
- the text from which the standard references are extracted.threshold
- the lower bound limit to be used in order to select only the standard references with score greater than or equal to the threshold. For instance, using a threshold of 0.75 means that only the patterns with score greater than or equal to 0.75 will be returned.- Returns:
- the list of standard references extracted from the given text.
-
-