-
Notifications
You must be signed in to change notification settings - Fork 5
Web API Reference
To use browse with the Web API, simply pass the --web argument to browse
# for an interactive repl
browse --web
# to run a browse script
browse --web ./script.browseTo turn off headless mode for the browser, just set a global called headless to false at the start of your code:
set headless false
page ...page https://en.wikipedia.org/wiki/:slug {
@string title `#firstHeading`
@arr(string) paragraphs `div.mw-parser-output`
out title paragraphs
crawl `a`
}
visit https://en.wikipedia.org/wiki/Kevin_Bacon
Let's unpack the example with the API reference below. Also, please check the examples folder in this repo for a ways to use the Web API
The following Rules are available inside the browser scope
Visit the given url. This opens a new headless browser when evaluated the very first time.
If any handlers have been setup (using page), then that RuleSet is evaluated before visit completes
Note: When used inside a script (not the
repl), you must have a handler defined to do anything useful with the browser window. If no handlers match, nothing will happen. However, when used inside the repl, we setup a special page that's available throughout the repl session and visit calls this lets you easily prototype actions using the repl before porting it over to a script
Additionally, when run inside a repl, the browser doesn't use headless mode to make it easier to prototype
visit https://windsor.io
page sets up handler (using the handler property) for the given pattern. Whenever a page navigation event occurs, like visit, click, crawl, etc., if the URL of the page is matched by any pattern, then the handler gets executed in that page's context.
page https://windsor.io/:page {
print $url
print $page
}
The pattern is a string that uses a special url-pattern syntax. When the url of a page matches the pattern, the handler is executed. For matching, the url is stripped of everything after a ? or # (i.e. query parameters and hash targets are ignored)
The simplest example of a url pattern is just a plain url
page https://windsor.io/some/path ...
# this will match
# https://windsor.io/some/path
# https://windsor.io/some/path?key=value
# If this url triggers the handler execution,
# then the variable `$query` is available in the handler, with the value `"key=value"`
# https://windsor.io/some/path#someHash
# If this url triggers the handler execution,
# then the variable `$hash` is available in the handler, with the value `"someHash"`
# Additionally, in all cases, you can access the entire url using `$url` inside the `handler`patterns can also have named segments using :name syntax.
when matching, a named segment consumes all characters till it hits a / (or end of the string)
page https://www.linkedin.com/in/:username ...
# this will match
# https://www.linkedin.com/in/pranaygp
# https://www.linkedin.com/in/barackobama
# etc.
# Inside the handler, the `username` can be accessed using the `$username` variableTo make a segment optional, just wrap it in ( and ) (if using implicit strings, escape the ( and ) with \( and \))
page "https://example.com/(:page)" ...
# this will match
# https://example.com/ (`$page` is set to `nil`)
# https://example.com/path (`$page` is set to `"path"`)You can also use * as a wild card to consume anything at all. It's only really useful as the last character in a pattern
page "https://en.wikipedia.org(/*)" ...
# this will match any page on the english version of wikipedia
# https://en.wikipedia.org/
# https://en.wikipedia.org/wiki
# https://en.wikipedia.org/wiki/Covid_19
# The data matched by the wildcard is accessible using the variable `$_`, although
# this might change so we recommend using `$url` which is the entire url of the pageThe handler is executed in a special page scope, that has access to the Rules listed below
Get the textContent of the first element that matches the css selector and attempt to parse it as a number. If textContent is null or can't be parsed as a number, innerText will be used. If both textContext and innerText are null or can't be parsed as numbers, this rule will throw an error. (For values that are optional, use @number? to silence the error and use nil as the value). This data will be stored into a variable called key.
page ... {
@number pageNum `#page-number`
}
Get the textContent of the first element that matches the css selector. If textContent is null or empty, innerText will be used. If both textContext and innerText are null or empty, this rule will throw an error. (For values that are optional, use @string? to silence the error and use nil as the value). This data will be stored into a variable called key.
page ... {
@string name `h1.title`
}
Get the href of the first element that matches the css selector. If href is null this rule will throw an error. (For values that are optional, use @url? to silence the error and use nil as the value). This data will be stored into a variable called key.
page ... {
@url website `.product-metrics__stat--website`
}
- The string option indicates whether the
@stringextractor should be used for each element. Currently this is the only option so make sure to set this to true
Get the values, using the selected extractor, for every element that matches the css selector. An array with this data will be stored into a variable called key.
Set the output of the page call to a JSON object that includes key-value pairs using the variable names provided. This is primary rule used for web scraping
Click on the first element that matches the css selector. Any valid CSS Selector can be used.
page ... {
click `ul li.someClass a#someID`
}
Config executes a ruleset with an overloaded implementation of set which sets the below configuration options for the page.
-
output<string>- The name of the output file that JSON data should be written to instead of 'stdout'
page https://www.thegazette.co.uk/:ignore/issue/:ignore/page/:page {
...
config {
set output "/home/foo/bar/" + $page + ".txt"
}
...
}
For each element that matches the given selector, if the element's href value is non-null visit is called with the element's
href value as the argument. If the element's href value is null nothing happens.
crawl `article div header h3.title a`
Take a screenshot of the current page and save it to a file at path
page ... {
screenshot "screenshot.png"
}
Simulating pressing a key on the keyboard. Check this list to see all the possible values you can use for key
page ... {
press a
press Space
}
Type all the strings provided as values, separated by " " (space). This is useful when entering information into inputs and forms
page ... {
click input
type hello world
}
If value is a number, then wait for that amount of time in milliseconds
If value is a string, then wait till an element appears which matches the css selector value
- If the the selector already matches something, this resolves instantly
- If the selector does not appear for 30s (the default timeout), then this function throws an Error
page ... {
wait 2000
wait `ul#list li.someClass span > a`
}