Skip to content

Web API Reference

Pranay Prakash edited this page Aug 30, 2020 · 36 revisions

To use browse with the Web API, simply pass the --web argument to browse

# for an interactive repl
browse --web    

# to run a browse script    
browse --web ./script.browse

To turn off headless mode for the browser, just set a global called headless to false at the start of your code:

set headless false

page ...

A Complete Example

page https://en.wikipedia.org/wiki/:slug {
  @string title `#firstHeading`
  @arr(string) paragraphs `div.mw-parser-output`
  out title paragraphs
  crawl `a`
}
visit https://en.wikipedia.org/wiki/Kevin_Bacon

Let's unpack the example with the API reference below. Also, please check the examples folder in this repo for a ways to use the Web API

Rules

The following Rules are available inside the browser scope

visit <string: url>

Visit the given url. This opens a new headless browser when evaluated the very first time.

If any handlers have been setup (using page), then that RuleSet is evaluated before visit completes

Note: When used inside a script (not the repl), you must have a handler defined to do anything useful with the browser window. If no handlers match, nothing will happen. However, when used inside the repl, we setup a special page that's available throughout the repl session and visit calls this lets you easily prototype actions using the repl before porting it over to a script

Additionally, when run inside a repl, the browser doesn't use headless mode to make it easier to prototype

Example

visit https://windsor.io

page <string: pattern> <RuleSet: handler>

page sets up handler (using the handler property) for the given pattern. Whenever a page navigation event occurs, like visit, click, crawl, etc., if the URL of the page is matched by any pattern, then the handler gets executed in that page's context.

Example

page https://windsor.io/:page {
  print $url
  print $page
}

<string: pattern>

The pattern is a string that uses a special url-pattern syntax. When the url of a page matches the pattern, the handler is executed. For matching, the url is stripped of everything after a ? or # (i.e. query parameters and hash targets are ignored)

The simplest example of a url pattern is just a plain url

page https://windsor.io/some/path ...
# this will match 
#   https://windsor.io/some/path
#   https://windsor.io/some/path?key=value
#     If this url triggers the handler execution, 
#        then the variable `$query` is available in the handler, with the value `"key=value"`
#   https://windsor.io/some/path#someHash
#     If this url triggers the handler execution, 
#        then the variable `$hash` is available in the handler, with the value `"someHash"`
# Additionally, in all cases, you can access the entire url using `$url` inside the `handler`

patterns can also have named segments using :name syntax. when matching, a named segment consumes all characters till it hits a / (or end of the string)

page https://www.linkedin.com/in/:username ...
# this will match 
#   https://www.linkedin.com/in/pranaygp
#   https://www.linkedin.com/in/barackobama
#   etc.
# Inside the handler, the `username` can be accessed using the `$username` variable

To make a segment optional, just wrap it in ( and ) (if using implicit strings, escape the ( and ) with \( and \))

page "https://example.com/(:page)" ...
# this will match 
#   https://example.com/       (`$page` is set to `nil`)
#   https://example.com/path   (`$page` is set to `"path"`)

You can also use * as a wild card to consume anything at all. It's only really useful as the last character in a pattern

page "https://en.wikipedia.org(/*)" ...
# this will match any page on the english version of wikipedia
# https://en.wikipedia.org/
# https://en.wikipedia.org/wiki
# https://en.wikipedia.org/wiki/Covid_19
#   The data matched by the wildcard is accessible using the variable `$_`, although 
#   this might change so we recommend using `$url` which is the entire url of the page

<RuleSet: handler>

The handler is executed in a special page scope, that has access to the Rules listed below

Page Rules

@number <string: key> <string: selector>

Get the textContent of the first element that matches the css selector and attempt to parse it as a number. If textContent is null or can't be parsed as a number, innerText will be used. If both textContext and innerText are null or can't be parsed as numbers, this rule will throw an error. (For values that are optional, use @number? to silence the error and use nil as the value). This data will be stored into a variable called key.

Example

page ... {
  @number pageNum `#page-number`
}

@string <string: key> <string: selector>

Get the textContent of the first element that matches the css selector. If textContent is null or empty, innerText will be used. If both textContext and innerText are null or empty, this rule will throw an error. (For values that are optional, use @string? to silence the error and use nil as the value). This data will be stored into a variable called key.

Example

page ... {
  @string name `h1.title`
}

@url <string: key> <string: selector>

Get the href of the first element that matches the css selector. If href is null this rule will throw an error. (For values that are optional, use @url? to silence the error and use nil as the value). This data will be stored into a variable called key.

Example

page ... {
  @url website `.product-metrics__stat--website`
}

@arr(string) <string: key> <string: selector>

  • The string option indicates whether the @string extractor should be used for each element. Currently this is the only option so make sure to set this to true

Get the values, using the selected extractor, for every element that matches the css selector. An array with this data will be stored into a variable called key.

out <...string: variables>

Set the output of the page call to a JSON object that includes key-value pairs using the variable names provided. This is primary rule used for web scraping

click <string: selector>

Click on the first element that matches the css selector. Any valid CSS Selector can be used.

Example

page ... {
  click `ul li.someClass a#someID`
}

config <RuleSet: rules>

Config executes a ruleset with an overloaded implementation of set which sets the below configuration options for the page.

Config Options

  • output
    • <string>
    • The name of the output file that JSON data should be written to instead of 'stdout'

Example

page https://www.thegazette.co.uk/:ignore/issue/:ignore/page/:page {
  ...
  config {
    set output "/home/foo/bar/" + $page + ".txt"
  }
  ...
}

crawl <string: selector>

For each element that matches the given selector, if the element's href value is non-null visit is called with the element's href value as the argument. If the element's href value is null nothing happens.

Example

crawl `article div header h3.title a`

screenshot <string: path>

Take a screenshot of the current page and save it to a file at path

Example

page ... {
  screenshot "screenshot.png"
}

press <string: key>

Simulating pressing a key on the keyboard. Check this list to see all the possible values you can use for key

Example

page ... {
  press a
  press Space
}

type [...string: values]

Type all the strings provided as values, separated by " " (space). This is useful when entering information into inputs and forms

Example

page ... {
  click input
  type hello world
}

wait <string|number: value>

If value is a number, then wait for that amount of time in milliseconds

If value is a string, then wait till an element appears which matches the css selector value

  • If the the selector already matches something, this resolves instantly
  • If the selector does not appear for 30s (the default timeout), then this function throws an Error

Example

page ... {
  wait 2000
  wait `ul#list li.someClass span > a`
}