Interface URLBuffer

  • All Known Implementing Classes:
    AbstractURLBuffer, PriorityURLBuffer, SchedulingURLBuffer, SimpleURLBuffer

    public interface URLBuffer
    Buffers URLs to be processed into separate queues; used by spouts. Guarantees that no URL can be put in the buffer more than once.

    Configured by setting

    urlbuffer.class: "com.digitalpebble.stormcrawler.persistence.SimpleURLBuffer"

    in the configuration

    Since:
    1.15
    • Field Detail

      • bufferClassParamName

        static final String bufferClassParamName
        Implementation to use for URLBuffer. Must implement the interface URLBuffer.
        See Also:
        Constant Field Values
    • Method Detail

      • createInstance

        @NotNull
        static @NotNull URLBuffer createInstance​(@NotNull
                                                 @NotNull Map<String,​Object> stormConf)
        Returns a URLBuffer instance based on the configuration *
      • add

        boolean add​(String URL,
                    Metadata m,
                    String key)
        Stores the URL and its Metadata under a given key.

        Implementations of this method should be synchronised

        Returns:
        false if the URL was already in the buffer, true if it wasn't and was added
      • add

        default boolean add​(String URL,
                            Metadata m)
        Stores the URL and its Metadata using the hostname as key.

        Implementations of this method should be synchronised

        Returns:
        false if the URL was already in the buffer, true if it wasn't and was added
      • size

        int size()
        Total number of URLs in the buffer *
      • numQueues

        int numQueues()
        Total number of queues in the buffer *
      • next

        org.apache.storm.tuple.Values next()
        Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled

        Implementations of this method should be synchronised

      • hasNext

        boolean hasNext()
        Implementations of this method should be synchronised
      • acked

        default void acked​(String url)
        Notify the buffer that a URL has been successfully processed used e.g to compute an ideal delay for a host queue