net.ruippeixotog.scalascraper.browser

Members list

Type members

Classlikes

trait Browser

A client able to retrieve and parse HTML pages from the web and from local resources.

A client able to retrieve and parse HTML pages from the web and from local resources.

An implementation of Browser can fetch pages via HTTP GET or POST requests, parse the downloaded page and return a net.ruippeixotog.scalascraper.model.Document instance, which can be queried via the scraper DSL or using its methods.

Different net.ruippeixotog.scalascraper.browser.Browser implementations can embed pages with different runtime behavior. For example, some browsers may limit themselves to parse the HTML content inside the page without executing any scripts inside, while others may run JavaScript and allow for Document instances with dynamic content. The documentation of each implementation should be read for more information on the semantics of its Document and net.ruippeixotog.scalascraper.model.Element implementations.

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
class HtmlUnitBrowser(browserType: BrowserVersion, proxy: Option[ProxyConfig]) extends Browser

A Browser implementation based on HtmlUnit, a GUI-less browser for Java programs. HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages besides parsing and modelling its HTML content. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.

A Browser implementation based on HtmlUnit, a GUI-less browser for Java programs. HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages besides parsing and modelling its HTML content. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.

Both the net.ruippeixotog.scalascraper.model.Document and the net.ruippeixotog.scalascraper.model.Element instances obtained from HtmlUnitBrowser can be mutated in the background. JavaScript code can at any time change attributes and the content of elements, reflected both in queries to Document and on previously stored references to Elements. The Document instance will always represent the current page in the browser's "window". This means the Document's location value can change, together with its root element, in the event of client-side page refreshes or redirections. However, Element instances belong to a fixed DOM tree and they stop being meaningful as soon as they are removed from the DOM or a client-side page reload occurs.

Value parameters

browserType

the browser type and version to simulate

proxy

an optional proxy configuration to use

Attributes

Companion
object
Supertypes
trait Browser
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
class JsoupBrowser(val userAgent: String, val proxy: Proxy) extends Browser

A Browser implementation based on jsoup, a Java HTML parser library. JsoupBrowser provides powerful and efficient document querying, but it doesn't run JavaScript in the pages. As such, it is limited to working strictly with the HTML send in the page source.

A Browser implementation based on jsoup, a Java HTML parser library. JsoupBrowser provides powerful and efficient document querying, but it doesn't run JavaScript in the pages. As such, it is limited to working strictly with the HTML send in the page source.

Currently, JsoupBrowser does not keep separate cookie stores for different domains and paths. In each request all cookies set previously will be sent, regardless of the domain they were set on. If you do requests to different domains and do not want this behavior, use different JsoupBrowser instances.

As the documents parsed by JsoupBrowser instances are not changed after loading, Document and Element instances obtained from them are guaranteed to be immutable.

Value parameters

proxy

an optional proxy configuration to use

userAgent

the user agent with which requests should be made

Attributes

Companion
object
Supertypes
trait Browser
class Object
trait Matchable
class Any
object JsoupBrowser

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
case class Proxy(host: String, port: Int, proxyType: Type)

A proxy configuration to be used by Browsers.

A proxy configuration to be used by Browsers.

Value parameters

host

the proxy host

port

the proxy port

proxyType

the protocol used by a proxy (e.g. HTTP, SOCKS)

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object Proxy

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
Proxy.type