Crawler: Login
object
login: { fetchRequest: { url: 'your_url', requestOptions: { ... } } } login: { browserRequest: { url: 'your_login_page', username: 'login', password: 'password', } } login: { oauthRequest: { accessTokenRequest: { url: 'url_of_access_token_endpoint', grant_type: 'client_credentials', client_id: 'client-identifier', client_secret: 'client-secret', } } }
About this parameter
This property defines how the crawler acquires a session to access protected content.
The crawler supports multiple ways to authenticate to protected websites, such as:
- Basic auth through an HTTP request
- Basic auth by visiting a login page through a web browser and sending the login form like a human would do
- OAuth 2.0
Basic auth
The crawler extracts the Set-Cookie
response header from the login page, stores that cookie and sends it in a Cookie
header when crawling all pages of the website defined in the configuration.
This cookie is only fetched at the beginning of each complete crawl. If it expires, we won’t renew it automatically.
There are two ways the crawler can interact with your login page:
- By doing a direct request with the credentials to your login endpoint, like a standard curl command.
- By emulating a web browser, loading your login page, entering the credentials and validating the login form.
OAuth 2.0
The crawler supports OAuth 2.0 Client Credentials Grant flow.
It performs an Access Token Request using the provided credentials, stores the fetched token in an Authorization
header and sends it when crawling all pages of the website that are defined in the configuration.
This token is only fetched at the beginning of each complete crawl. If it expires, it won’t be renewed automatically.
Client authentication is performed by passing the client credentials (client_id
/ client_secret
) in the request-body as described in the RFC.
The following providers are supported. You can reach out if you need to add others.
Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
{
login: {
fetchRequest: {
url: 'https://example.com/secure/login-with-post',
requestOptions: {
method: 'POST',
headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
body: 'id=my-id&password=my-password',
timeout: 5000 // in milliseconds
}
}
}
}
1
2
3
4
5
6
7
8
9
{
login: {
browserRequest: {
url: 'https://example.com/secure/login-page',
username: 'my-id',
password: 'my-password',
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
login: {
oauthRequest: {
accessTokenRequest: {
url: 'https://example.com/oauth2/token',
grant_type: 'client_credentials',
client_id: 'my-client-id',
client_secret: 'my-client-secret',
extraParameters: {
resource: 'https://protected.example.com/'
}
}
}
}
}
Parameters
fetchRequest
This allows you to manually craft the login request that the crawler sends.
url
|
type: string
Required
The URL to target. |
requestOptions
|
type: Object
This object is passed to our extended version of the request library. |
fetchRequest ➔ requestOptions
method
|
type: string
default: GET
The HTTP method to use. |
headers
|
type: object
default: {}
HTTP headers to pass. |
body
|
type: string
The body of the request. |
timeout
|
type: number
Time to wait before aborting the request (in milliseconds). |
requestOptions ➔ headers
Content-Type
|
type: string
|
Authorization
|
type: string
|
Cookie
|
type: string
|
browserRequest
Make the crawler use a web browser to visit your login page and validate the login form like a human would do.
url
|
type: string
Required
The URL of the login page. The HTML elements expected on this page are |
username
|
type: string
Required
The username |
password
|
type: string
Required
The password |
waitTime
|
type: object
Optional
Determines the shortest and longest wait time before considering the login done. |
browserRequest ➔ waitTime
min
|
type: number
default: 0
Optional
If the login ends faster than this minimum execution time, the browser remains open at least this long before returning the cookies. |
max
|
type: number
default: 20000
Optional
At this maximum execution time threshold, the execution stops and the cookies are returned as is. |
oauthRequest
Make the crawler use OAuth 2.0 Client Credentials Grant flow to generate an Authorization
header.
accessTokenRequest
|
type: object
Required
Object containing the parameters needed to perform an Access Token Request. |
oauthRequest ➔ accessTokenRequest
url
|
type: string
Required
The URL of the access token endpoint. |
grant_type
|
type: string
Required
OAuth grant type. Must be “client_credentials”. |
client_id
|
type: string
Required
The client identifier. |
client_secret
|
type: string
Required
The client secret. |
scope
|
type: string
|
extraParameters
|
type: object
Object containing implementation-specific parameters, that aren’t part of the RFC. |
accessTokenRequest ➔ extraParameters
resource
|
type: string
Required parameter for Azure AD v1.0 implementations. |