Tools / Crawler / APIs / Configuration
Type: object
Parameter syntax
login: {
  fetchRequest: {
    url: 'your_url',
    requestOptions: {
      ...
    }
  }
}

login: {
  browserRequest: {
    url: 'your_login_page',
    username: 'login',
    password: 'password',
  }
}

login: {
  oauthRequest: {
    accessTokenRequest: {
      url: 'url_of_access_token_endpoint',
      grant_type: 'client_credentials',
      client_id: 'client-identifier',
      client_secret: 'client-secret',
    }
  }
}

About this parameter

This property defines how the crawler acquires a session to access protected content.

The crawler supports multiple ways to authenticate to protected websites, such as:

  • Basic auth through an HTTP request
  • Basic auth by visiting a login page through a web browser and sending the login form like a human would do
  • OAuth 2.0

Basic auth

The crawler extracts the Set-Cookie response header from the login page, stores that cookie and sends it in a Cookie header when crawling all pages of the website defined in the configuration.

This cookie is only fetched at the beginning of each complete crawl. If it expires, we won’t renew it automatically.

There are two ways the crawler can interact with your login page:

  • By doing a direct request with the credentials to your login endpoint, like a standard curl command.
  • By emulating a web browser, loading your login page, entering the credentials and validating the login form.

OAuth 2.0

The crawler supports OAuth 2.0 Client Credentials Grant flow. It performs an Access Token Request using the provided credentials, stores the fetched token in an Authorization header and sends it when crawling all pages of the website that are defined in the configuration.

This token is only fetched at the beginning of each complete crawl. If it expires, it won’t be renewed automatically.

Client authentication is performed by passing the client credentials (client_id / client_secret) in the request-body as described in the RFC.

The following providers are supported. You can reach out if you need to add others.

Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  login: {
    fetchRequest: {
      url: 'https://example.com/secure/login-with-post',
      requestOptions: {
        method: 'POST',
        headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
        body: 'id=my-id&password=my-password',
        timeout: 5000 // in milliseconds
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
  {
    login: {
      browserRequest: {
        url: 'https://example.com/secure/login-page',
        username: 'my-id',
        password: 'my-password',
      }
    }
  }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  {
    login: {
      oauthRequest: {
        accessTokenRequest: {
          url: 'https://example.com/oauth2/token',
          grant_type: 'client_credentials',
          client_id: 'my-client-id',
          client_secret: 'my-client-secret',
          extraParameters: {
            resource: 'https://protected.example.com/'
          }
        }
      }
    }
  }

Parameters

fetchRequest

This allows you to manually craft the login request that the crawler sends.

url
type: string
Required

The URL to target.

requestOptions
type: Object

This object is passed to our extended version of the request library.

fetchRequest ➔ requestOptions

method
type: string
default: GET

The HTTP method to use.

headers
type: object
default: {}

HTTP headers to pass.

body
type: string

The body of the request.

timeout
type: number

Time to wait before aborting the request (in milliseconds).

requestOptions ➔ headers

Content-Type
type: string
Authorization
type: string
type: string

browserRequest

Make the crawler use a web browser to visit your login page and validate the login form like a human would do.

url
type: string
Required

The URL of the login page. The HTML elements expected on this page are input[type=text] or input[type=email] for the username and input[type=password] for the password.

username
type: string
Required

The username

password
type: string
Required

The password

waitTime
type: object
Optional

Determines the shortest and longest wait time before considering the login done.

browserRequest ➔ waitTime

min
type: number
default: 0
Optional

If the login ends faster than this minimum execution time, the browser remains open at least this long before returning the cookies.

max
type: number
default: 20000
Optional

At this maximum execution time threshold, the execution stops and the cookies are returned as is.

oauthRequest

Make the crawler use OAuth 2.0 Client Credentials Grant flow to generate an Authorization header.

accessTokenRequest
type: object
Required

Object containing the parameters needed to perform an Access Token Request.

oauthRequest ➔ accessTokenRequest

url
type: string
Required

The URL of the access token endpoint.

grant_type
type: string
Required

OAuth grant type. Must be “client_credentials”.

client_id
type: string
Required

The client identifier.

client_secret
type: string
Required

The client secret.

scope
type: string

The scope of the access request.

extraParameters
type: object

Object containing implementation-specific parameters, that aren’t part of the RFC.

accessTokenRequest ➔ extraParameters

resource
type: string

Required parameter for Azure AD v1.0 implementations.

Did you find this page helpful?