Class SitemapFilter
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.filtering.URLFilter
-
- com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter
-
- All Implemented Interfaces:
Configurable
public class SitemapFilter extends URLFilter
URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site. This allows to restrict the crawl to pages found in the sitemaps but won't affect sites which do not have sitemaps.{ "class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter", "name": "SitemapFilter" }
Will be replaced by MetadataFilter to filter based on multiple key values- Since:
- 1.14
-
-
Constructor Summary
Constructors Constructor Description SitemapFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description @Nullable String
filter(@Nullable URL sourceUrl, @Nullable Metadata sourceMetadata, @NotNull String urlToFilter)
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.util.Configurable
configure
-
-
-
-
Method Detail
-
filter
@Nullable public @Nullable String filter(@Nullable @Nullable URL sourceUrl, @Nullable @Nullable Metadata sourceMetadata, @NotNull @NotNull String urlToFilter)
Description copied from class:URLFilter
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL- Specified by:
filter
in classURLFilter
- Parameters:
sourceUrl
- the URL of the page where the URL was found. Can be null.sourceMetadata
- the metadata collected for the pageurlToFilter
- the URL to be filtered- Returns:
- null if the url is to be removed or a normalised representation which can correspond to the input URL
-
-