feat: Add adaptive context helpers #964

Pijukatel · 2025-02-07T07:33:15Z

Description

Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler.

Issues

Closes: Adaptive playwright crawler #249

Add method to BasicCrawler to handle just one request.

Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)

Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.

Handle use state.

Pre-navigation hooks delegation to sub crawler hooks.

Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.

WIP KVS handling. Currently it does not go through Result handler.

This reverts commit d345259.

Fix wrong id for predictor_state persistence.

Add test for pre nav hook Add test for statistics in crawler init

…nternals of sub crawlers) Cleanup commit results.

Add it in adaptive crawler instead at the cost of accessing many private members.

…andled.

Use different url for unit tests.

(By temporal wrapper context.)

This adds some complexity, but adds more flexibility.

Add one sanity check test for parsel variant. Update some doc strings.

…lpers

vdusek · 2025-02-07T15:21:00Z

docs/examples/playwright_crawler_adaptive.mdx

+This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
+It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.


There is missing explanation what AdaptivePlaywrightCrawler is doing. It seems like you assume the readers already know what it is.

This is just an example though... A link to the appropriate guide would be nice though.

edit - I see it's at the bottom of the example.

docs/examples/playwright_crawler_adaptive.mdx

vdusek · 2025-02-07T15:25:18Z

docs/examples/playwright_crawler_adaptive.mdx

+This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
+It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.
+
+A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler.


pre-navigation hook - maybe link to API docs?

In this rare case I think it would not be suitable to add the link to API docs. AdaptivePlaywrightCrawler#pre_navigation_hook is showing the parametrized decorator doc page and even though it is correct, I think it would be confusing to regular users.
For example return type is Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None] :-)

So I would suggest to show link to the guide page that shows how to use hooks and describes them in more detail:
...python/docs/guides/adaptive-playwright-crawler#page-configuration-with-pre-navigation-hooks

docs/examples/playwright_crawler_adaptive.mdx

vdusek · 2025-02-07T15:31:14Z

src/crawlee/crawlers/_abstract_http/_abstract_http_parser.py

+            parsed_content: content where the page element will be located
+            selector: css selector


Use sentences please.

vdusek · 2025-02-07T15:35:54Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py

    """Adaptive crawler that uses both specific implementation of `AbstractHttpCrawler` and `PlaywrightCrawler`.

-    It tries to detect whether it is sufficient to crawl without browser (which is faster) or if
-    `PlaywrightCrawler` should be used (in case previous method did not work as expected for specific url.).
+    It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it
+    may be possible.


This is important docstring and this is really not much helpful if I do not know already what it shoudl be.

I would imagine something like:

The first line - one sentence description of the class without referencing other things.

Multiline paragraph description, with other references.

Check the others for inspiration.

vdusek · 2025-02-07T15:36:35Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+            selector: css selector to be used to locate specific element on page.
+            timeout: timeout that defines how long the function wait for the selector to appear.


vdusek · 2025-02-07T15:36:46Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+            selector: css selector to be used to locate specific element on page.
+            timeout: timeout that defines how long the function wait for the selector to appear.


vdusek · 2025-02-07T15:37:12Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+            timeout: timeout that defines how long the function wait for the selector to appear.
+
+        Returns:
+            `TStaticSelectResult` which is result of used static parser `select` method.


Don't be rendundant, the type is already in tyhe annotation and will be rendered anyways.

vdusek · 2025-02-07T15:37:26Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+        Args:
+            selector: css selector to be used to locate specific element on page.
+            timeout: timeout that defines how long the function wait for the selector to appear.
+
+        Returns:
+            `TStaticParseResult` which is result of used static parser `parse_text` method.


same as above

janbuchar · 2025-02-07T16:21:17Z

docs/examples/playwright_crawler_adaptive.mdx

+
+A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler.
+
+For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'Network payload investigation.')


Not sure what the 'Network payload investigation.' is?

Neither do I :-)
Fixed.

janbuchar · 2025-02-07T16:23:24Z

docs/guides/playwright_crawler_adaptive.mdx

+An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
+It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.


Suggested change

An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.

It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.

An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.

It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.

janbuchar · 2025-02-07T16:40:04Z

docs/guides/playwright_crawler_adaptive.mdx

+
+## When to use AdaptivePlaywrightCrawler
+
+Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.


This is one use case, the other is that you can use it when you just want to perform selector-based data extraction and don't want to investigate the target website to find out the rendering type 🙂

janbuchar · 2025-02-07T20:00:38Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

@@ -73,17 +76,76 @@ def response(self) -> Response:
            raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.')
        return self._response

+    async def wait_for_selector(self, selector: str, timeout: timedelta = timedelta(seconds=5)) -> None:
+        """Locate element by css selector a return once it is found.


I think this does not return anythin... right?

It says just return ,which is subtle, but important difference to return it, but doc string is bad if you misinterpret it.
Rephrased to be more explicit and prevent miss interpretation.

janbuchar · 2025-02-07T20:11:50Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+        static_content = await self._static_parser.select(self.parsed_content, selector)
+        if static_content is not None:
+            return static_content


I'm not sure about this. If the element changes dynamically, this method will always return the version from the initial HTTP response. That might surprise users.

In scenario where element exists in static page, but will change dynamically in the future, the adaptive crawler has no chance of seeing this future change. So it will pick the static value from the presence.
I am having hard time imagining the scenario for this.

The only solution that comes to my mind is keeping optional attribute "latest_parsed_content" that would default to parsed_content, but would be updated each time someone already explicitly re-parsed the content or you would have to re-parse the page in the beginning of each call to make sure it is doing the select on the most up-to-date version. So there could be quite a performance penalty for this.

I was under the impression that this is the reason for having the parse_with_static_parser context helper.

Updated with automatic re-parsing

janbuchar · 2025-02-07T20:12:37Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+        )
+        if parsed_selector is not None:
+            return parsed_selector
+        raise AdaptiveContextError('Used selector is not a valid static selector')


Can't this mean that the selector simply did not match?

If selector did not match, then you will get playwright._impl._errors.TimeoutError even before, when waiting for it.

This AdaptiveContextError happens in very rare case:
Waiting for the selector is done through Playwright.
Parsing is done through static parser which then uses the selector as well.

So in theory it can happen that user will pick selector supported by Playwright, but not the static parser. So Playwright will find the selector and wait for it, static parser will manage to parse the page,but it will fail to do select with such unsupported selector.

Playwright supported selectors and static parser supported selectors are two distinct sets with some overlap.

So you will get this error only if you picked invalid static parser selector, but valid Playwright selector that actually matched on something.

I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None

I'm sorry if I caused a misunderstanding how this should work.

There are following possible scenarios:

static_parser does not find the selector and browser does not find the selector -> will raise playwright._impl._errors.TimeoutError

static_parser finds the selector immediately (whether browser finds it as well is irrelevant then, so two scenarios collapse to one) -> will return parsed selector

static_parser does not find the selector immediately and browser finds it after a while. static_parser finds it after the wait time as well -> will return parsed selector

static_parser does not find the selector immediately and browser finds it after a while. static_parser does not find it even after wait time-> will raise AdaptiveContextError('Used selector is not a valid static selector') . Because the element actually exists in the page, it was found by the browser by used selector, but it was not found using static_parser, because it does not support such selector.

Branch where self._page is None and the static parser does not find anything is "hidden" to the user as this will raise AdaptiveContextError, this will force the AdaptiveCrawler to use Playwright instead and this will lead to scenario 1, 3 or 4.

vdusek

LGTM

…lpers

janbuchar · 2025-02-12T12:36:47Z

@TC-MO could you please check the guide for style issues and general readability? I'm heavily biased since I already understand the feature 🙂

janbuchar

Nice, I have a couple of nitpicks and one more serious comment.

docs/examples/code/adaptive_playwright_crawler.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

janbuchar · 2025-02-12T16:15:41Z

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py

+        )
+        if parsed_selector is not None:
+            return parsed_selector
+        raise AdaptiveContextError('Used selector is not a valid static selector')


I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None

I'm sorry if I caused a misunderstanding how this should work.

janbuchar · 2025-02-13T16:37:04Z

src/crawlee/crawlers/_beautifulsoup/_beautifulsoup_parser.py

@@ -32,8 +32,8 @@ def is_matching_selector(self, parsed_content: Tag, selector: str) -> bool:
        return parsed_content.select_one(selector) is not None

    @override
-    async def select(self, parsed_content: Tag, selector: str) -> Tag | None:
-        return parsed_content.select_one(selector)
+    async def select(self, parsed_content: Tag, selector: str) -> tuple[Tag, ...]:


this is breaking - are you sure we want to go through with it? why not expose the one/all variants while we're at it?

It is not breaking, this method is added in this PR. I am not sure about benefit of adding one/all variants to Parser. BeautifulSoup has it, Parsel does not have it on Selector level. It returns SelectorList and one/all choice is done on next level, in data extraction. SelectorList.get() / SelectorList.getAll().

In the end it is just a convenience method, so I think it is fine to have it only on the final user facing context.

Right, sorry! In the longer run, I'd like to see a unified interface for picking an element by selector on all crawling context variants. I'm not sure if this is the right PR to do this, but we should be careful not to shoot ourselves in the foot here... so that we can design something uniform later.

### Description Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler. ### Issues - Closes: apify#249 --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Jan Buchar <[email protected]>

Pijukatel added 30 commits December 24, 2024 09:38

WIP

d8b2438

More feasible version of composition.

c12acda

Add method to BasicCrawler to handle just one request.

Pass properly all kwargs to subcrawlers

623d341

Add run decision logic from JS version

548349e

Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)

Statistics class change lead too many ripple changes.

29d510a

Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.

Handle sub crawlers loggers

04eefd9

Handle use state.

Statistics change to be usable without ignores.

33efed3

Pre-navigation hooks delegation to sub crawler hooks.

Align use_state with JS implementation.

202aceb

Remove "fake generics" from Statistics.

d474a94

Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.

Align result comparator witrh JS implementation.

e95ff19

Add doc strings.

2408d85

WIP KVS handling. Currently it does not go through Result handler.

use_state through RequestHandlerRunResult

d345259

Revert "use_state through RequestHandlerRunResult"

bfa9290

This reverts commit d345259.

Add basic delegation test.

63f278a

Add context test.

e190788

Add tests for use_state and predictor.

0ecb137

Remove unintended edit.

b73c702

Add tests for statistics.

5c79d0d

Fix wrong id for predictor_state persistence.

Add test for error handling adn commiting correct results.

957915a

Add test for pre nav hook Add test for statistics in crawler init

Add crawl_one_required_contexts property. (Alternative to accessing i…

f12f605

…nternals of sub crawlers) Cleanup commit results.

Lint

5256af2

Remove BasicCrawler modifications.

2fd7aae

Add it in adaptive crawler instead at the cost of accessing many private members.

Make _commit_result consistent with how other result components are h…

714b5bd

…andled.

Remove subcrawlers and add _OrphanPipeline

b38dda1

Use dummy statistics in subcrawlers.

ffb2a78

Use different url for unit tests.

Keep predictor related functions on predictor_state

3b05228

Unify pre-nav hooks.

dc06490

(By temporal wrapper context.)

Simplify pre-nav hook common context.

0766b7a

Make static crawling part of AdaptiveCrawler generic.

4a63c2c

This adds some complexity, but adds more flexibility.

Update tests to remove bs references.

bd72a84

Add one sanity check test for parsel variant. Update some doc strings.

Pijukatel added 3 commits January 31, 2025 08:22

Merge branch 'adaptive-PwCrawler' into adaptive-context-helpers

8ffa499

Docs after merge from main adaptive-crawler branch

e133712

Merge remote-tracking branch 'origin/master' into adaptive-context-he…

cdb3181

…lpers

github-actions bot assigned Pijukatel Feb 7, 2025

github-actions bot added this to the 107th sprint - Tooling team milestone Feb 7, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 7, 2025

Docs, docstrings and comments edtis.

4effa19

Pijukatel marked this pull request as ready for review February 7, 2025 07:50

Pijukatel requested review from vdusek and janbuchar February 7, 2025 07:51

vdusek reviewed Feb 7, 2025

View reviewed changes

janbuchar reviewed Feb 7, 2025

View reviewed changes

Pijukatel added 3 commits February 10, 2025 13:52

Review comments

410dcff

TODO: Add test for autoparsing

109847d

Review comments

39bb7f8

Pijukatel requested review from janbuchar and vdusek February 11, 2025 09:10

vdusek approved these changes Feb 11, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into adaptive-context-he…

3c5a2ae

…lpers

janbuchar requested a review from TC-MO February 12, 2025 12:37

janbuchar reviewed Feb 12, 2025

View reviewed changes

Pijukatel added 2 commits February 13, 2025 17:05

Review comments

3aff36f

Update type hint to not brak docusaurus

760f272

Pijukatel requested a review from janbuchar February 13, 2025 16:22

janbuchar reviewed Feb 13, 2025

View reviewed changes

Pijukatel merged commit e248f17 into master Feb 18, 2025
23 checks passed

Pijukatel deleted the adaptive-context-helpers branch February 18, 2025 07:56

		This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
		It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.

		parsed_content: content where the page element will be located
		selector: css selector

		selector: css selector to be used to locate specific element on page.
		timeout: timeout that defines how long the function wait for the selector to appear.


		A pre-navigation hook can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler.

		For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'Network payload investigation.')

		An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
		It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.


		## When to use AdaptivePlaywrightCrawler

		Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.

feat: Add adaptive context helpers #964

feat: Add adaptive context helpers #964

Conversation

Pijukatel commented Feb 7, 2025 • edited Loading

Description

Issues

Choose a reason for hiding this comment

janbuchar Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pijukatel Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pijukatel Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

janbuchar commented Feb 12, 2025

janbuchar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pijukatel Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pijukatel commented Feb 7, 2025 •

edited

Loading

janbuchar Feb 7, 2025 •

edited

Loading

Pijukatel Feb 10, 2025 •

edited

Loading

Pijukatel Feb 13, 2025 •

edited

Loading

Pijukatel Feb 14, 2025 •

edited

Loading