Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add adaptive context helpers #964

Merged
merged 87 commits into from
Feb 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
d8b2438
WIP
Pijukatel Dec 24, 2024
c12acda
More feasible version of composition.
Pijukatel Dec 31, 2024
623d341
Pass properly all kwargs to subcrawlers
Pijukatel Dec 31, 2024
548349e
Add run decision logic from JS version
Pijukatel Jan 1, 2025
29d510a
Statistics class change lead too many ripple changes.
Pijukatel Jan 1, 2025
04eefd9
Handle sub crawlers loggers
Pijukatel Jan 1, 2025
33efed3
Statistics change to be usable without ignores.
Pijukatel Jan 2, 2025
202aceb
Align use_state with JS implementation.
Pijukatel Jan 2, 2025
d474a94
Remove "fake generics" from Statistics.
Pijukatel Jan 2, 2025
e95ff19
Align result comparator witrh JS implementation.
Pijukatel Jan 2, 2025
2408d85
Add doc strings.
Pijukatel Jan 2, 2025
d345259
use_state through RequestHandlerRunResult
Pijukatel Jan 3, 2025
bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel Jan 6, 2025
63f278a
Add basic delegation test.
Pijukatel Jan 6, 2025
e190788
Add context test.
Pijukatel Jan 6, 2025
0ecb137
Add tests for use_state and predictor.
Pijukatel Jan 6, 2025
b73c702
Remove unintended edit.
Pijukatel Jan 6, 2025
5c79d0d
Add tests for statistics.
Pijukatel Jan 7, 2025
957915a
Add test for error handling adn commiting correct results.
Pijukatel Jan 7, 2025
f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel Jan 7, 2025
5256af2
Lint
Pijukatel Jan 7, 2025
2fd7aae
Remove BasicCrawler modifications.
Pijukatel Jan 8, 2025
714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel Jan 8, 2025
b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel Jan 8, 2025
ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel Jan 8, 2025
3b05228
Keep predictor related functions on predictor_state
Pijukatel Jan 8, 2025
dc06490
Unify pre-nav hooks.
Pijukatel Jan 8, 2025
0766b7a
Simplify pre-nav hook common context.
Pijukatel Jan 9, 2025
4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel Jan 9, 2025
bd72a84
Update tests to remove bs references.
Pijukatel Jan 9, 2025
c964d44
Revert accidental Lint edits to website/*.py
Pijukatel Jan 10, 2025
a471395
Review comments.
Pijukatel Jan 14, 2025
dbf6310
Sub crawler timeout handling + test
Pijukatel Jan 15, 2025
8ee8f99
Simplify prenav hooks
janbuchar Jan 15, 2025
c291d3e
Simplify context manager handling
janbuchar Jan 15, 2025
2c23238
Review comments - _run_request_handler + timeouts
Pijukatel Jan 15, 2025
b2a29c1
Statistics.
Pijukatel Jan 15, 2025
0786f87
Make statistics generic again!
Pijukatel Jan 15, 2025
4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 15, 2025
1234ea7
Mock requests in tests.
Pijukatel Jan 16, 2025
e85ec9f
Improve error readability.
Pijukatel Jan 17, 2025
6e8635a
Mock both static and browser requests in tests.
Pijukatel Jan 17, 2025
0e9146a
Create proper example code
Pijukatel Jan 17, 2025
a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 17, 2025
44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel Jan 20, 2025
f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel Jan 21, 2025
64d9e54
WIP
Pijukatel Jan 21, 2025
08fc81f
Update options typed dicts
Pijukatel Jan 21, 2025
9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel Jan 21, 2025
9a14569
Make crawl_one_with private.
Pijukatel Jan 21, 2025
fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel Jan 22, 2025
4221219
Review comments
Pijukatel Jan 22, 2025
565d36b
Remove _run_subcrawler_pipeline
Pijukatel Jan 22, 2025
4bd8251
Remove Orphans
Pijukatel Jan 22, 2025
a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 22, 2025
fc95132
Move SubCrawlerRun to where it is used
Pijukatel Jan 22, 2025
4f316b5
Use custom _TestInput dataclass
Pijukatel Jan 22, 2025
949c4ff
Review comments
Pijukatel Jan 23, 2025
14d6f4f
WIP
Pijukatel Jan 24, 2025
56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel Jan 27, 2025
781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel Jan 27, 2025
a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 27, 2025
23f1494
Context helper for bs + test.
Pijukatel Jan 27, 2025
2ae55ac
parsel wip
Pijukatel Jan 27, 2025
b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel Jan 28, 2025
0b7bfca
Draft of context helpers both for Parsel and BeautifulSoup
Pijukatel Jan 28, 2025
8bce425
Use context result map for handling request handler results
Pijukatel Jan 29, 2025
5f8c26c
Review comments based comments
Pijukatel Jan 30, 2025
e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 30, 2025
aacb90a
Integrate RenderingTypePredictor
Pijukatel Jan 30, 2025
1970276
Finalize tests for context helpers.
Pijukatel Jan 29, 2025
dc286be
Parser with two generics variant
Pijukatel Jan 30, 2025
e164fa1
Wrapping parser only for adaptive crawler variant
Pijukatel Jan 30, 2025
e65d883
Revert "Wrapping parser only for adaptive crawler variant"
Pijukatel Jan 30, 2025
ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel Jan 30, 2025
2bf43f6
Finalize exports and re exports
Pijukatel Jan 30, 2025
9de0b5c
Docs (guide, examples, docstrings)
Pijukatel Jan 30, 2025
8ffa499
Merge branch 'adaptive-PwCrawler' into adaptive-context-helpers
Pijukatel Jan 31, 2025
e133712
Docs after merge from main adaptive-crawler branch
Pijukatel Jan 31, 2025
cdb3181
Merge remote-tracking branch 'origin/master' into adaptive-context-he…
Pijukatel Feb 7, 2025
4effa19
Docs, docstrings and comments edtis.
Pijukatel Feb 7, 2025
410dcff
Review comments
Pijukatel Feb 10, 2025
109847d
TODO: Add test for autoparsing
Pijukatel Feb 10, 2025
39bb7f8
Review comments
Pijukatel Feb 11, 2025
3c5a2ae
Merge remote-tracking branch 'origin/master' into adaptive-context-he…
Pijukatel Feb 11, 2025
3aff36f
Review comments
Pijukatel Feb 13, 2025
760f272
Update type hint to not brak docusaurus
Pijukatel Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions docs/examples/code/adaptive_playwright_crawler.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import asyncio
from datetime import timedelta

from playwright.async_api import Route

Expand All @@ -10,40 +11,43 @@


async def main() -> None:
# Crawler created by following factory method will use `beautifulsoup`
# for parsing static content.
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False}
)

@crawler.router.handler(label='label')
@crawler.router.default_handler
async def request_handler_for_label(
context: AdaptivePlaywrightCrawlingContext,
) -> None:
# Do some processing using `page`
some_locator = context.page.locator('div').first
await some_locator.wait_for()
# Do stuff with locator...
context.log.info(f'Playwright processing of: {context.request.url} ...')

@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
context.log.info(f'User handler processing: {context.request.url} ...')
# Do some processing using `parsed_content`
context.log.info(context.parsed_content.title)

# Locate element h2 within 5 seconds
h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
# Do stuff with element found by the selector
context.log.info(h2)

# Find more links and enqueue them.
await context.enqueue_links()
await context.push_data({'Top crawler Url': context.request.url})
# Save some data.
await context.push_data({'Visited url': context.request.url})

@crawler.pre_navigation_hook
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed both in static sub crawler and playwright sub crawler."""
# Trying to access context.page in this hook would raise `AdaptiveContextError`
# for pages crawled without playwright.
"""Hook executed both in static sub crawler and playwright sub crawler.

Trying to access `context.page` in this hook would raise `AdaptiveContextError`
for pages crawled without playwright."""
context.log.info(f'pre navigation hook for: {context.request.url} ...')

@crawler.pre_navigation_hook(playwright_only=True)
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed only in playwright sub crawler."""
"""Hook executed only in playwright sub crawler.

It is safe to access `page` object.
"""

async def some_routing_function(route: Route) -> None:
await route.continue_()
Expand Down
20 changes: 20 additions & 0 deletions docs/examples/playwright_crawler_adaptive.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
id: adaptive-playwright-crawler
title: AdaptivePlaywrightCrawler
---

import ApiLink from '@site/src/components/ApiLink';
import CodeBlock from '@theme/CodeBlock';

import AdaptivePlaywrightCrawlerExample from '!!raw-loader!./code/adaptive_playwright_crawler.py';

This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.

A [pre-navigation hook](/python/docs/guides/adaptive-playwright-crawler#page-configuration-with-pre-navigation-hooks) can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only=True` to mark hooks that should be executed only for playwright sub crawler.

For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'AdaptivePlaywrightCrawler guide')

<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerExample}
</CodeBlock>
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from datetime import timedelta

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()


@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
# Locate element h2 within 5 seconds
h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
# Do stuff with element found by the selector
context.log.info(h2)
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from crawlee.crawlers import AdaptivePlaywrightCrawler

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
# Arguments relevant only for PlaywrightCrawler
playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
# Arguments relevant only for BeautifulSoupCrawler
static_crawler_specific_kwargs={'additional_http_error_status_codes': [204]},
# Common arguments relevant to all crawlers
max_crawl_depth=5,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from crawlee.crawlers import AdaptivePlaywrightCrawler

crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
# Arguments relevant only for PlaywrightCrawler
playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
# Arguments relevant only for ParselCrawler
static_crawler_specific_kwargs={'additional_http_error_status_codes': [204]},
# Common arguments relevant to all crawlers
max_crawl_depth=5,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from crawlee import Request
from crawlee._types import RequestHandlerRunResult
from crawlee.crawlers import (
AdaptivePlaywrightCrawler,
RenderingType,
RenderingTypePrediction,
RenderingTypePredictor,
)


class CustomRenderingTypePredictor(RenderingTypePredictor):
def __init__(self) -> None:
self._learning_data = list[tuple[Request, RenderingType]]()

def predict(self, request: Request) -> RenderingTypePrediction:
# Some custom logic that produces some `RenderingTypePrediction`
# based on the `request` input.
rendering_type: RenderingType = (
'static' if 'abc' in request.url else 'client only'
)

return RenderingTypePrediction(
# Recommends `static` rendering type -> HTTP-based sub crawler will be used.
rendering_type=rendering_type,
# Recommends that both sub crawlers should run with 20% chance. When both sub
# crawlers are running, the predictor can compare results and learn.
# High number means that predictor is not very confident about the
# `rendering_type`, low number means that predictor is very confident.
detection_probability_recommendation=0.2,
)

def store_result(self, request: Request, rendering_type: RenderingType) -> None:
# This function allows predictor to store new learning data and retrain itself
# if needed. `request` is input for prediction and `rendering_type` is the correct
# prediction.
self._learning_data.append((request, rendering_type))
# retrain


def result_checker(result: RequestHandlerRunResult) -> bool:
# Some function that inspects produced `result` and returns `True` if the result
# is correct.
return bool(result) # Check something on result


def result_comparator(
result_1: RequestHandlerRunResult, result_2: RequestHandlerRunResult
) -> bool:
# Some function that inspects two results and returns `True` if they are
# considered equivalent. It is used when comparing results produced by HTTP-based
# sub crawler and playwright based sub crawler.
return (
result_1.push_data_calls == result_2.push_data_calls
) # For example compare `push_data` calls.


crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
rendering_type_predictor=CustomRenderingTypePredictor(),
result_checker=result_checker,
result_comparator=result_comparator,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from playwright.async_api import Route

from crawlee.crawlers import (
AdaptivePlaywrightCrawler,
AdaptivePlaywrightPreNavCrawlingContext,
)

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()


@crawler.pre_navigation_hook
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed both in static sub crawler and playwright sub crawler.

Trying to access `context.page` in this hook would raise `AdaptiveContextError`
for pages crawled without playwright."""

context.log.info(f'pre navigation hook for: {context.request.url}')


@crawler.pre_navigation_hook(playwright_only=True)
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed only in playwright sub crawler."""

async def some_routing_function(route: Route) -> None:
await route.continue_()

await context.page.route('*/**', some_routing_function)
context.log.info(f'Playwright only pre navigation hook for: {context.request.url}')
96 changes: 96 additions & 0 deletions docs/guides/playwright_crawler_adaptive.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
id: adaptive-playwright-crawler
title: AdaptivePlaywrightCrawler
description: How to use the AdaptivePlaywrightCrawler.
---

import ApiLink from '@site/src/components/ApiLink';
import CodeBlock from '@theme/CodeBlock';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

import AdaptivePlaywrightCrawlerInitBeautifulSoup from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_beautifulsoup.py';
import AdaptivePlaywrightCrawlerInitParsel from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_parsel.py';
import AdaptivePlaywrightCrawlerInitPrediction from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_prediction.py';
import AdaptivePlaywrightCrawlerHandler from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_handler.py';
import AdaptivePlaywrightCrawlerPreNavHooks from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_pre_nav_hooks.py';



An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.

Detection is done based on the <ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> with default implementation <ApiLink to="class/DefaultRenderingTypePredictor">`DefaultRenderingTypePredictor`</ApiLink>. It predicts which crawling method should be used and learns from already crawled pages.

## When to use AdaptivePlaywrightCrawler

Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one use case, the other is that you can use it when you just want to perform selector-based data extraction and don't want to investigate the target website to find out the rendering type 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


Another use case is performing selector-based data extraction without prior knowledge of whether the selector exists in the static page or is dynamically added by a code executed in a browsing client.

## Request handler and adaptive context helpers

Request handler for <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> works on special context type - <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink>. This context is sometimes created by HTTP-based sub crawler and sometimes by playwright based sub crawler. Due to its dynamic nature, you can't always access [page](https://playwright.dev/python/docs/api/class-page) object. To overcome this limitation, there are three helper methods on this context that can be called regardless of how the context was created.

<ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> accepts `css` selector as first argument and timeout as second argument. The function will try to locate this selector a return once it is found(within timeout). In practice this means that if HTTP-based sub crawler was used, the function will find the selector only if it is part of the static content. If not, the adaptive crawler will fall back to the playwright sub crawler and will wait try to locate the selector within the timeout using playwright.

<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink> accepts `css` selector as first argument and timeout as second argument. This function acts similar to `wait_for_selector`, but it also returns one selector if any selector is found. Return value type is determined by used HTTP-based sub crawler. For example, it will be `Selector` for <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and `Tag` for <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.

<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_all`</ApiLink> same as <ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink>, but returns all found selectors.

<ApiLink to="class/AdaptivePlaywrightCrawlingContext#parse_with_static_parser">`parse_with_static_parser`</ApiLink> will re-parse the whole page. Return value type is determined by used HTTP-based sub crawler. It has optional arguments: `selector` and `timeout`. If those optional arguments are used then the function first calls <ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> and then do the parsing. This can be used in scenario where some specific element can signal, that page is already complete.

See the following example about how to create request handler and use context helpers:

<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerHandler}
</CodeBlock>


## Crawler configuration
To use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is recommended to use one of the prepared factory methods that will create the crawler with specific HTTP-based sub crawler variant: <ApiLink to="class/AdaptivePlaywrightCrawler#with_beautifulsoup_static_parser">`AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser`</ApiLink> or <ApiLink to="class/AdaptivePlaywrightCrawler#with_parsel_static_parser">`AdaptivePlaywrightCrawler.with_parsel_static_parser`</ApiLink>.

<ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is internally composed of two sub crawlers and you can do a detailed configuration of both of them. For detailed configuration options of the sub crawlers, please refer to their pages: <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink>, <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.

In the following example you can see how to create and configure <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> with two different HTTP-based sub crawlers:


<Tabs>
<TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler" default>
<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerInitBeautifulSoup}
</CodeBlock>
</TabItem>
<TabItem value="ParselCrawler" label="ParselCrawler" default>
<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerInitParsel}
</CodeBlock>
</TabItem>
</Tabs>

### Prediction related arguments

To control which pages are crawled by which method you can use following arguments:

<ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> - Class that can give recommendations about which sub crawler should be used for specific url. Predictor will also recommend to use both sub crawlers for some page from time to time, to check that the given recommendation was correct. Predictor should be able to learn from previous results and gradually give more reliable recommendations.

`result_checker` - Is a function that checks result created from crawling a page. By default, it always returns `True`.

`result_comparator` - Is a function that compares two results (HTTP-based sub crawler result and playwright based sub crawler result) and returns `True` if they are considered the same. By default, this function compares calls of context helper `push_data` by each sub crawler. This function is used by `rendering_type_predictor` to evaluate whether HTTP-based crawler has the same results as playwright based sub crawler.

See the following example about how to pass prediction related arguments:

<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerInitPrediction}
</CodeBlock>



## Page configuration with pre-navigation hooks
In some use cases, you may need to configure the [page](https://playwright.dev/python/docs/api/class-page) before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the <ApiLink to="class/AdaptivePlaywrightCrawler#pre_navigation_hook">`pre_navigation_hook`</ApiLink> method of the <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. This method is called before the page navigates to the target URL and allows you to configure the page instance. Due to the dynamic nature of <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is possible that the hook will be executed for HTTP-based sub crawler or playwright-based sub crawler. Using [page](https://playwright.dev/python/docs/api/class-page) object for hook that will be executed on HTTP-based sub crawler will raise an exception. To overcome this you can use optional argument `playwright_only` = `True` when registering the hook.

See the following example about how to register the pre navigation hooks:

<CodeBlock className="language-python">
{AdaptivePlaywrightCrawlerPreNavHooks}
</CodeBlock>
Loading
Loading