Class TRv2PorterStemmer

java.lang.Object
org.terrier.terms.StemmerTermPipeline
org.terrier.terms.TRv2PorterStemmer
All Implemented Interfaces:
Stemmer, TermPipeline
Direct Known Subclasses:
TRv2WeakPorterStemmer

public class TRv2PorterStemmer
extends StemmerTermPipeline
This is the Porter stemming algorithm, coded up in JAVA by Gianni Amati. All comments were made by Porter, but few ones due to some implementation choices. For Porter's implementation in Java, see PorterStemmer
Porter says "It may be be regarded as canonical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked --DEPARTURE-- below. The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!".
This class is not thread safe.
Author:
Gianni Amati, modified into a TermPipeline and (Java) optimised by Craig Macdonald
  • Field Summary

    Fields 
    Modifier and Type Field Description
    protected char[] b
    A buffer for word to be stemmed.
    protected int j
    A general offset into the string.
    protected int k  
    protected int k0  

    Fields inherited from class org.terrier.terms.StemmerTermPipeline

    next
  • Constructor Summary

    Constructors 
    Constructor Description
    TRv2PorterStemmer​(TermPipeline next)
    Constructs an instance of the TRv2PorterStemmer.
  • Method Summary

    Modifier and Type Method Description
    protected boolean cons​(int i)
    cons(i) is TRUE <=> b[i] is a consonant.
    protected boolean consonantinstem()  
    protected boolean cvc​(int i)
    Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y.
    protected void defineBuffer​(java.lang.String s)  
    protected boolean doublec​(int _j)
    Returns true if j,(j-1) contain a double consonant.
    protected boolean ends​(java.lang.String s)
    Returns true if k0,...k ends with the string s.
    protected int m()
    Measures the number of consonant sequences between k0 and j.
    static void main​(java.lang.String[] args)
    main
    protected void setto​(int i1, int i2, java.lang.String str)
    Sets (j+1),...k to the characters in the string s, readjusting k and j.
    java.lang.String stem​(java.lang.String s)
    Returns the stem of a given term
    protected void step1ab()
    Removes the plurals and -ed or -ing.
    protected void step1c()
    Turns terminal y to i when there is another vowel in the stem.
    protected void step2()
    Maps double suffices to single ones.
    protected void step3()
    Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
    protected void step4()
    Takes off -ant, -ence etc., in context vcvc.
    protected void step5()
    Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
    protected boolean vowelinstem()
    Returns TRUE if k0,...j contains a vowel.

    Methods inherited from class org.terrier.terms.StemmerTermPipeline

    processTerm, reset

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • b

      protected char[] b
      A buffer for word to be stemmed.
    • k

      protected int k
    • k0

      protected int k0
    • j

      protected int j
      A general offset into the string.
  • Constructor Details

  • Method Details

    • cons

      protected boolean cons​(int i)
      cons(i) is TRUE <=> b[i] is a consonant.
    • consonantinstem

      protected boolean consonantinstem()
    • cvc

      protected final boolean cvc​(int i)
      Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y. This is used when trying to restore an e at the end of a short word. For example:
      • cav(e)
      • lov(e)
      • hop(e)
      • crim(e)
      but keep terms snow, box, tray as they are.
    • defineBuffer

      protected final void defineBuffer​(java.lang.String s)
    • doublec

      protected final boolean doublec​(int _j)
      Returns true if j,(j-1) contain a double consonant.
    • ends

      protected final boolean ends​(java.lang.String s)
      Returns true if k0,...k ends with the string s.
    • m

      protected final int m()
      Measures the number of consonant sequences between k0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence:
      • <c><v> gives 0
      • <c>vc<v> gives 1
      • <c>vcvc<v> gives 2
      • <c>vcvcvc<v> gives 3
    • setto

      protected final void setto​(int i1, int i2, java.lang.String str)
      Sets (j+1),...k to the characters in the string s, readjusting k and j.
    • stem

      public java.lang.String stem​(java.lang.String s)
      Returns the stem of a given term
      Parameters:
      s - String the term to be stemmed.
      Returns:
      String the stem of a given term.
    • step1ab

      protected final void step1ab()
      Removes the plurals and -ed or -ing. For example,
      • caresses becomes caress
      • ponies becomes poni
      • ties becomes ti
      • caress becomes caress
      • cats becomes cat
      • feed becomes feed
      • agreed becomes agree
      • disabled becomes disable
      • matting becomes mat
      • mating becomes mate
      • meeting becomes meet
      • milling becomes mill
      • messing becomes mess
      • meetings becomes meet
    • step1c

      protected final void step1c()
      Turns terminal y to i when there is another vowel in the stem.
    • step2

      protected final void step2()
      Maps double suffices to single ones. So -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
    • step3

      protected final void step3()
      Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
    • step4

      protected final void step4()
      Takes off -ant, -ence etc., in context vcvc.
    • step5

      protected final void step5()
      Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
    • vowelinstem

      protected final boolean vowelinstem()
      Returns TRUE if k0,...j contains a vowel.
      Returns:
      true if k0,...,j contains a vowel.
    • main

      public static void main​(java.lang.String[] args)
      main
      Parameters:
      args -