Class SitemapFilter

  • All Implemented Interfaces:

    public class SitemapFilter
    extends URLFilter
    URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site. This allows to restrict the crawl to pages found in the sitemaps but won't affect sites which do not have sitemaps.
        "class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
        "name": "SitemapFilter"
    Will be replaced by MetadataFilter to filter based on multiple key values
    • Constructor Detail

      • SitemapFilter

        public SitemapFilter()
    • Method Detail

      • filter

        public @Nullable String filter​(@Nullable
                                       @Nullable URL sourceUrl,
                                       @Nullable Metadata sourceMetadata,
                                       @NotNull String urlToFilter)
        Description copied from class: URLFilter
        Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
        Specified by:
        filter in class URLFilter
        sourceUrl - the URL of the page where the URL was found. Can be null.
        sourceMetadata - the metadata collected for the page
        urlToFilter - the URL to be filtered
        null if the url is to be removed or a normalised representation which can correspond to the input URL