Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add adaptive context helpers #964

Merged
merged 87 commits into from
Feb 18, 2025
Merged

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Feb 7, 2025

Description

Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler.

Issues

Add method to BasicCrawler to handle just one request.
Add statistics

TODO:
Make mypy happy about statistics.
Wrap existing statistics from init in adaptive statistics.
Silent subcrawler statistics and loggers in general. (Set level to error?)
Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.
Handle use state.
Pre-navigation hooks delegation to sub crawler hooks.
Statistics were marked as generics, but in reality were not.
Hardcoding state_model to make it explicit and clear.
WIP KVS handling. Currently it does not go through Result handler.
Fix wrong id for predictor_state persistence.
Add test for pre nav hook
Add test for statistics in crawler init
…nternals of sub crawlers)

Cleanup commit results.
Add it in adaptive crawler instead at the cost of accessing many private members.
Use different url for unit tests.
(By temporal wrapper context.)
This adds some complexity, but adds more flexibility.
Add one sanity check test for parsel variant.
Update some doc strings.
@github-actions github-actions bot added this to the 107th sprint - Tooling team milestone Feb 7, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 7, 2025
@Pijukatel Pijukatel marked this pull request as ready for review February 7, 2025 07:50
@Pijukatel Pijukatel requested review from vdusek and janbuchar February 7, 2025 07:51
Comment on lines 11 to 12
This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is missing explanation what AdaptivePlaywrightCrawler is doing. It seems like you assume the readers already know what it is.

Copy link
Collaborator

@janbuchar janbuchar Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an example though... A link to the appropriate guide would be nice though.

edit - I see it's at the bottom of the example.

This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.

A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-navigation hook - maybe link to API docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this rare case I think it would not be suitable to add the link to API docs. AdaptivePlaywrightCrawler#pre_navigation_hook is showing the parametrized decorator doc page and even though it is correct, I think it would be confusing to regular users.
For example return type is Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None] :-)

So I would suggest to show link to the guide page that shows how to use hooks and describes them in more detail:
...python/docs/guides/adaptive-playwright-crawler#page-configuration-with-pre-navigation-hooks

Comment on lines 49 to 50
parsed_content: content where the page element will be located
selector: css selector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sentences please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 93 to 96
"""Adaptive crawler that uses both specific implementation of `AbstractHttpCrawler` and `PlaywrightCrawler`.

It tries to detect whether it is sufficient to crawl without browser (which is faster) or if
`PlaywrightCrawler` should be used (in case previous method did not work as expected for specific url.).
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it
may be possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important docstring and this is really not much helpful if I do not know already what it shoudl be.

image

I would imagine something like:

  • The first line - one sentence description of the class without referencing other things.
  • Multiline paragraph description, with other references.

Check the others for inspiration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 85 to 86
selector: css selector to be used to locate specific element on page.
timeout: timeout that defines how long the function wait for the selector to appear.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 98 to 99
selector: css selector to be used to locate specific element on page.
timeout: timeout that defines how long the function wait for the selector to appear.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

timeout: timeout that defines how long the function wait for the selector to appear.

Returns:
`TStaticSelectResult` which is result of used static parser `select` method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't be rendundant, the type is already in tyhe annotation and will be rendered anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 125 to 130
Args:
selector: css selector to be used to locate specific element on page.
timeout: timeout that defines how long the function wait for the selector to appear.

Returns:
`TStaticParseResult` which is result of used static parser `parse_text` method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler.

For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'Network payload investigation.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the 'Network payload investigation.' is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither do I :-)
Fixed.

Comment on lines 20 to 21
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible.
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


## When to use AdaptivePlaywrightCrawler

Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one use case, the other is that you can use it when you just want to perform selector-based data extraction and don't want to investigate the target website to find out the rendering type 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -73,17 +76,76 @@ def response(self) -> Response:
raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.')
return self._response

async def wait_for_selector(self, selector: str, timeout: timedelta = timedelta(seconds=5)) -> None:
"""Locate element by css selector a return once it is found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not return anythin... right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says just return ,which is subtle, but important difference to return it, but doc string is bad if you misinterpret it.
Rephrased to be more explicit and prevent miss interpretation.

Comment on lines 104 to 106
static_content = await self._static_parser.select(self.parsed_content, selector)
if static_content is not None:
return static_content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. If the element changes dynamically, this method will always return the version from the initial HTTP response. That might surprise users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scenario where element exists in static page, but will change dynamically in the future, the adaptive crawler has no chance of seeing this future change. So it will pick the static value from the presence.
I am having hard time imagining the scenario for this.

The only solution that comes to my mind is keeping optional attribute "latest_parsed_content" that would default to parsed_content, but would be updated each time someone already explicitly re-parsed the content or you would have to re-parse the page in the beginning of each call to make sure it is doing the select on the most up-to-date version. So there could be quite a performance penalty for this.

I was under the impression that this is the reason for having the parse_with_static_parser context helper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with automatic re-parsing

)
if parsed_selector is not None:
return parsed_selector
raise AdaptiveContextError('Used selector is not a valid static selector')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this mean that the selector simply did not match?

Copy link
Contributor Author

@Pijukatel Pijukatel Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If selector did not match, then you will get playwright._impl._errors.TimeoutError even before, when waiting for it.

This AdaptiveContextError happens in very rare case:
Waiting for the selector is done through Playwright.
Parsing is done through static parser which then uses the selector as well.

So in theory it can happen that user will pick selector supported by Playwright, but not the static parser. So Playwright will find the selector and wait for it, static parser will manage to parse the page,but it will fail to do select with such unsupported selector.

Playwright supported selectors and static parser supported selectors are two distinct sets with some overlap.

So you will get this error only if you picked invalid static parser selector, but valid Playwright selector that actually matched on something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None

I'm sorry if I caused a misunderstanding how this should work.

Copy link
Contributor Author

@Pijukatel Pijukatel Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are following possible scenarios:

  1. static_parser does not find the selector and browser does not find the selector -> will raise playwright._impl._errors.TimeoutError
  2. static_parser finds the selector immediately (whether browser finds it as well is irrelevant then, so two scenarios collapse to one) -> will return parsed selector
  3. static_parser does not find the selector immediately and browser finds it after a while. static_parser finds it after the wait time as well -> will return parsed selector
  4. static_parser does not find the selector immediately and browser finds it after a while. static_parser does not find it even after wait time-> will raise AdaptiveContextError('Used selector is not a valid static selector') . Because the element actually exists in the page, it was found by the browser by used selector, but it was not found using static_parser, because it does not support such selector.

Branch where self._page is None and the static parser does not find anything is "hidden" to the user as this will raise AdaptiveContextError, this will force the AdaptiveCrawler to use Playwright instead and this will lead to scenario 1, 3 or 4.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janbuchar
Copy link
Collaborator

@TC-MO could you please check the guide for style issues and general readability? I'm heavily biased since I already understand the feature 🙂

@janbuchar janbuchar requested a review from TC-MO February 12, 2025 12:37
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I have a couple of nitpicks and one more serious comment.

)
if parsed_selector is not None:
return parsed_selector
raise AdaptiveContextError('Used selector is not a valid static selector')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None

I'm sorry if I caused a misunderstanding how this should work.

@Pijukatel Pijukatel requested a review from janbuchar February 13, 2025 16:22
@@ -32,8 +32,8 @@ def is_matching_selector(self, parsed_content: Tag, selector: str) -> bool:
return parsed_content.select_one(selector) is not None

@override
async def select(self, parsed_content: Tag, selector: str) -> Tag | None:
return parsed_content.select_one(selector)
async def select(self, parsed_content: Tag, selector: str) -> tuple[Tag, ...]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is breaking - are you sure we want to go through with it? why not expose the one/all variants while we're at it?

Copy link
Contributor Author

@Pijukatel Pijukatel Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not breaking, this method is added in this PR. I am not sure about benefit of adding one/all variants to Parser. BeautifulSoup has it, Parsel does not have it on Selector level. It returns SelectorList and one/all choice is done on next level, in data extraction. SelectorList.get() / SelectorList.getAll().

In the end it is just a convenience method, so I think it is fine to have it only on the final user facing context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry! In the longer run, I'd like to see a unified interface for picking an element by selector on all crawling context variants. I'm not sure if this is the right PR to do this, but we should be careful not to shoot ourselves in the foot here... so that we can design something uniform later.

@Pijukatel Pijukatel merged commit e248f17 into master Feb 18, 2025
23 checks passed
@Pijukatel Pijukatel deleted the adaptive-context-helpers branch February 18, 2025 07:56
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Feb 19, 2025
### Description

Add adaptive context helpers and documentation for
AdaptivePlaywrightCrawler.

### Issues

- Closes: apify#249

---------

Co-authored-by: Jan Buchar <[email protected]>
Co-authored-by: Jan Buchar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adaptive playwright crawler
3 participants