Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio #390

Merged
merged 16 commits into from
Feb 13, 2025

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Feb 6, 2025

Description

  • Apify (asyncio) and Scrapy (Twisted) now run on a single event loop.
    • nest-asyncio has been completely removed.
    • It seems that this change also improved the performance.
  • The ApifyScheduler, which is synchronous, now executes asyncio coroutines (communication with RQ) in a separate thread with its own asyncio event loop.
  • Logging setup has to be adjusted and I moved to a dedicated file in the SDK.
  • The try-import functionality for optional dependecies from Crawlee was added to scrapy subpackage.
  • A new integration test for Scrapy Actor has been added.

Issues

Tests

  • A new integration test for Scrapy Actor has been added.
  • And of course, it was tested manually using the Actor from guides/templates.

Next steps

Follow-up issues

@github-actions github-actions bot added this to the 107th sprint - Tooling team milestone Feb 6, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 6, 2025
@vdusek vdusek changed the title fix: fix Scrapy integration fix: Unify Apify and Scrapy to use single event loop & remove nest-asyncio Feb 6, 2025
@vdusek vdusek had a problem deploying to fork-pr-integration-tests February 6, 2025 15:35 — with GitHub Actions Error
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Feb 7, 2025
@vdusek vdusek requested review from janbuchar and Pijukatel February 7, 2025 11:41
@vdusek vdusek marked this pull request as ready for review February 7, 2025 11:41
Copy link
Contributor

@honzajavorek honzajavorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice beautiful! I scanned the changes and they look like the stuff I've been through for the past two weeks. I didn't spot anything suspicious except two things where I commented.

This wasn't a thorough review of all the lines, so I'm just "commenting", not approving or disapproving.

@vdusek vdusek requested a review from Pijukatel February 10, 2025 13:48
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid job! I didn't manage to read the whole thing yet, here's a bunch of random comments.

@janbuchar janbuchar self-requested a review February 10, 2025 17:43
@Pijukatel
Copy link
Contributor

Pijukatel commented Feb 11, 2025

Please merge from latest master to your branch and double check that the code in examples is not longer than 90.

@vdusek
Copy link
Contributor Author

vdusek commented Feb 11, 2025

Please merge from latest master to your branch and double check that the code in examples is not longer than 90.

@Pijukatel So please provide the same PR as for the Crawlee (apify/crawlee-python#973) here and to the Apify Python client as well 🙂.

@Pijukatel
Copy link
Contributor

Please merge from latest master to your branch and double check that the code in examples is not longer than 90.

@Pijukatel So please provide the same PR as for the Crawlee (apify/crawlee-python#973) here and to the Apify Python client as well 🙂.

Haha, I see this is different repo. I will prepare the PR for this repo as well and merge it after you to not block you.

@vdusek vdusek changed the title fix: Unify Apify and Scrapy to use single event loop & remove nest-asyncio feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio Feb 12, 2025
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two documentation-related notes, otherwise it looks good!

@@ -270,8 +270,8 @@ async def finalize() -> None:
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in IPython')
elif os.getenv('PYTEST_CURRENT_TEST', default=False): # noqa: PLW1508
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in an unit test')
elif hasattr(asyncio, '_nest_patched'):
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in a nested event loop')
elif os.getenv('SCRAPY_SETTINGS_MODULE'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that we now depend on this to be present in the upgrading guide, just to be sure.

Copy link
Contributor Author

@vdusek vdusek Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the upgrading guide mentioning minor versions. Is it okay if I add it to the Scrapy guide doc page?

Edit: it is there


The fastest way to start using Scrapy in Apify Actors is by leveraging the [Scrapy Actor template](https://apify.com/templates/categories/python). This template provides a pre-configured structure and setup necessary to integrate Scrapy into your Actors seamlessly. It includes: setting up the Scrapy settings, `asyncio` reactor, Actor logger, and item pipeline as necessary to make Scrapy spiders run in Actors and save their outputs in Apify datasets.
## Key integration components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome if we could have a nice diagram here 🙂 In the ideal case, it would show how the asyncio event loop and twisted reactor interact. We can definitely postpone that though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that Asyncio and Twisted share the same event loop. I'm not familiar with the details of how Asyncio and Twisted exactly interact together. Other than that, we have a separate thread with an Asyncio event loop for the Scheduler's synchronous calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We gave it two weeks and made it work, but that doesn't mean we understand the sorcery enough to draw diagrams 😅 But I think it's something like...

diagram

(It's black on transparent SVG, so you probably won't see it well in dark mode)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I chose the best diagram for this, maybe https://swimlanes.io/ would be better

@janbuchar janbuchar self-requested a review February 13, 2025 09:06
@vdusek vdusek merged commit 96949be into master Feb 13, 2025
23 of 26 checks passed
@vdusek vdusek deleted the fixing-scrapy branch February 13, 2025 09:07
logger.exception('Exception occurred while shutting down.')

finally:
logger.debug(f'{self.__class__.__name__} closed successfully.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully 🤔 I think inside finally we shouldn't say that word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
4 participants