Class SitemapFilter

  • All Implemented Interfaces:
    Configurable

    public class SitemapFilter
    extends URLFilter
    URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site. This allows to restrict the crawl to pages found in the sitemaps but won't affect sites which do not have sitemaps.
      {
        "class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
        "name": "SitemapFilter"
      }
     
    Will be replaced by MetadataFilter to filter based on multiple key values
    Since:
    1.14
    • Constructor Detail

      • SitemapFilter

        public SitemapFilter()
    • Method Detail

      • filter

        @Nullable
        public @Nullable String filter​(@Nullable
                                       @Nullable URL sourceUrl,
                                       @Nullable
                                       @Nullable Metadata sourceMetadata,
                                       @NotNull
                                       @NotNull String urlToFilter)
        Description copied from class: URLFilter
        Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
        Specified by:
        filter in class URLFilter
        Parameters:
        sourceUrl - the URL of the page where the URL was found. Can be null.
        sourceMetadata - the metadata collected for the page
        urlToFilter - the URL to be filtered
        Returns:
        null if the url is to be removed or a normalised representation which can correspond to the input URL