-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add adaptive context helpers #964
Conversation
Add method to BasicCrawler to handle just one request.
Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)
Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.
Handle use state.
Pre-navigation hooks delegation to sub crawler hooks.
Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.
WIP KVS handling. Currently it does not go through Result handler.
This reverts commit d345259.
Fix wrong id for predictor_state persistence.
Add test for pre nav hook Add test for statistics in crawler init
…nternals of sub crawlers) Cleanup commit results.
Add it in adaptive crawler instead at the cost of accessing many private members.
Use different url for unit tests.
(By temporal wrapper context.)
This adds some complexity, but adds more flexibility.
Add one sanity check test for parsel variant. Update some doc strings.
This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. | ||
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is missing explanation what AdaptivePlaywrightCrawler
is doing. It seems like you assume the readers already know what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an example though... A link to the appropriate guide would be nice though.
edit - I see it's at the bottom of the example.
This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. | ||
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible. | ||
|
||
A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-navigation hook - maybe link to API docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this rare case I think it would not be suitable to add the link to API docs. AdaptivePlaywrightCrawler#pre_navigation_hook is showing the parametrized decorator doc page and even though it is correct, I think it would be confusing to regular users.
For example return type is Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]
:-)
So I would suggest to show link to the guide page that shows how to use hooks and describes them in more detail:
...python/docs/guides/adaptive-playwright-crawler#page-configuration-with-pre-navigation-hooks
parsed_content: content where the page element will be located | ||
selector: css selector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sentences please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
"""Adaptive crawler that uses both specific implementation of `AbstractHttpCrawler` and `PlaywrightCrawler`. | ||
|
||
It tries to detect whether it is sufficient to crawl without browser (which is faster) or if | ||
`PlaywrightCrawler` should be used (in case previous method did not work as expected for specific url.). | ||
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it | ||
may be possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important docstring and this is really not much helpful if I do not know already what it shoudl be.
I would imagine something like:
- The first line - one sentence description of the class without referencing other things.
- Multiline paragraph description, with other references.
Check the others for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
selector: css selector to be used to locate specific element on page. | ||
timeout: timeout that defines how long the function wait for the selector to appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
selector: css selector to be used to locate specific element on page. | ||
timeout: timeout that defines how long the function wait for the selector to appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
timeout: timeout that defines how long the function wait for the selector to appear. | ||
|
||
Returns: | ||
`TStaticSelectResult` which is result of used static parser `select` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't be rendundant, the type is already in tyhe annotation and will be rendered anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Args: | ||
selector: css selector to be used to locate specific element on page. | ||
timeout: timeout that defines how long the function wait for the selector to appear. | ||
|
||
Returns: | ||
`TStaticParseResult` which is result of used static parser `parse_text` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
A **pre-navigation hook** can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only`=`True` to mark hooks that should be executed only for playwright sub crawler. | ||
|
||
For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'Network payload investigation.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the 'Network payload investigation.' is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither do I :-)
Fixed.
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. | ||
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler like <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. | |
It uses more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible. | |
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. | |
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
## When to use AdaptivePlaywrightCrawler | ||
|
||
Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one use case, the other is that you can use it when you just want to perform selector-based data extraction and don't want to investigate the target website to find out the rendering type 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -73,17 +76,76 @@ def response(self) -> Response: | |||
raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.') | |||
return self._response | |||
|
|||
async def wait_for_selector(self, selector: str, timeout: timedelta = timedelta(seconds=5)) -> None: | |||
"""Locate element by css selector a return once it is found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does not return anythin... right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says just return
,which is subtle, but important difference to return it
, but doc string is bad if you misinterpret it.
Rephrased to be more explicit and prevent miss interpretation.
static_content = await self._static_parser.select(self.parsed_content, selector) | ||
if static_content is not None: | ||
return static_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. If the element changes dynamically, this method will always return the version from the initial HTTP response. That might surprise users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scenario where element exists in static page, but will change dynamically in the future, the adaptive crawler has no chance of seeing this future change. So it will pick the static value from the presence.
I am having hard time imagining the scenario for this.
The only solution that comes to my mind is keeping optional attribute "latest_parsed_content" that would default to parsed_content
, but would be updated each time someone already explicitly re-parsed the content or you would have to re-parse the page in the beginning of each call to make sure it is doing the select on the most up-to-date version. So there could be quite a performance penalty for this.
I was under the impression that this is the reason for having the parse_with_static_parser
context helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with automatic re-parsing
) | ||
if parsed_selector is not None: | ||
return parsed_selector | ||
raise AdaptiveContextError('Used selector is not a valid static selector') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this mean that the selector simply did not match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If selector did not match, then you will get playwright._impl._errors.TimeoutError
even before, when waiting for it.
This AdaptiveContextError
happens in very rare case:
Waiting for the selector is done through Playwright
.
Parsing is done through static parser
which then uses the selector as well.
So in theory it can happen that user will pick selector supported by Playwright
, but not the static parser
. So Playwright will find the selector and wait for it, static parser
will manage to parse the page,but it will fail to do select
with such unsupported selector.
Playwright supported selectors and static parser supported selectors are two distinct sets with some overlap.
So you will get this error only if you picked invalid static parser
selector, but valid Playwright
selector that actually matched on something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None
I'm sorry if I caused a misunderstanding how this should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are following possible scenarios:
- static_parser does not find the selector and browser does not find the selector -> will raise
playwright._impl._errors.TimeoutError
- static_parser finds the selector immediately (whether browser finds it as well is irrelevant then, so two scenarios collapse to one) -> will return parsed selector
- static_parser does not find the selector immediately and browser finds it after a while. static_parser finds it after the wait time as well -> will return parsed selector
- static_parser does not find the selector immediately and browser finds it after a while. static_parser does not find it even after wait time-> will raise
AdaptiveContextError('Used selector is not a valid static selector')
. Because the element actually exists in the page, it was found by the browser by used selector, but it was not found using static_parser, because it does not support such selector.
Branch where self._page is None
and the static parser does not find anything is "hidden" to the user as this will raise AdaptiveContextError
, this will force the AdaptiveCrawler to use Playwright instead and this will lead to scenario 1, 3 or 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@TC-MO could you please check the guide for style issues and general readability? I'm heavily biased since I already understand the feature 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I have a couple of nitpicks and one more serious comment.
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py
Outdated
Show resolved
Hide resolved
) | ||
if parsed_selector is not None: | ||
return parsed_selector | ||
raise AdaptiveContextError('Used selector is not a valid static selector') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're not completely aligned on how the method should work. The JS version waits for the element when it's running in the browser and delegates to the static parser otherwise. If the static parser doesn't find anything, that's a valid result. It should be perfectly possible to make two branches here based on self._page is None
I'm sorry if I caused a misunderstanding how this should work.
@@ -32,8 +32,8 @@ def is_matching_selector(self, parsed_content: Tag, selector: str) -> bool: | |||
return parsed_content.select_one(selector) is not None | |||
|
|||
@override | |||
async def select(self, parsed_content: Tag, selector: str) -> Tag | None: | |||
return parsed_content.select_one(selector) | |||
async def select(self, parsed_content: Tag, selector: str) -> tuple[Tag, ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is breaking - are you sure we want to go through with it? why not expose the one/all variants while we're at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not breaking, this method is added in this PR. I am not sure about benefit of adding one/all variants to Parser
. BeautifulSoup
has it, Parsel
does not have it on Selector
level. It returns SelectorList
and one/all choice is done on next level, in data extraction. SelectorList.get()
/ SelectorList.getAll()
.
In the end it is just a convenience method, so I think it is fine to have it only on the final user facing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry! In the longer run, I'd like to see a unified interface for picking an element by selector on all crawling context variants. I'm not sure if this is the right PR to do this, but we should be careful not to shoot ourselves in the foot here... so that we can design something uniform later.
### Description Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler. ### Issues - Closes: apify#249 --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Jan Buchar <[email protected]>
Description
Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler.
Issues