Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate logs from apify logger in Scrapy-Apify integration #391

Open
vdusek opened this issue Feb 6, 2025 · 0 comments
Open

Duplicate logs from apify logger in Scrapy-Apify integration #391

vdusek opened this issue Feb 6, 2025 · 0 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Contributor

vdusek commented Feb 6, 2025

Description

  • The Apify logger (apify) is producing duplicate log entries when running a Scrapy-Actor integration.

Log

$ apify run
[apify] INFO  Initializing Actor...
[apify] INFO  Initializing Actor... ({"message": "Initializing Actor..."})
[apify] INFO  System info ({"apify_sdk_version": "2.2.1", "apify_client_version": "1.8.1", "crawlee_version": "0.5.2", "python_version": "3.13.1", "os": "linux"})
[apify] INFO  System info ({"apify_sdk_version": "2.2.1", "apify_client_version": "1.8.1", "crawlee_version": "0.5.2", "python_version": "3.13.1", "os": "linux", "message": "System info"})
[scrapy.addons] INFO  Enabled addons:
[] ({"crawler": "<scrapy.crawler.Crawler object at 0x7fd081db0ec0>"})
[scrapy.middleware] INFO  Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7fd081db0ec0>"})
[scrapy.crawler] INFO  Overridden settings:
{'BOT_NAME': 'titlebot',
 'DEPTH_LIMIT': 1,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'src.spiders',
 'ROBOTSTXT_OBEY': True,
 'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
 'SPIDER_MODULES': ['src.spiders'],
 'TELNETCONSOLE_ENABLED': False,
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
[apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False.
[apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False. ({"message": "ApifyHttpProxyMiddleware is not going to be used. Actor input field \"proxyConfiguration.useApifyProxy\" is set to False."})
[scrapy.middleware] INFO  Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'src.middlewares.TitleDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7fd081db0ec0>"})
[scrapy.middleware] INFO  Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'src.middlewares.TitleSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7fd081db0ec0>"})
[scrapy.middleware] INFO  Enabled item pipelines:
['src.pipelines.TitleItemPipeline',
 'apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7fd081db0ec0>"})
[scrapy.core.engine] INFO  Spider opened ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})

...

[title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/upgrading/upgrading-to-v3>... ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[scrapy.core.engine] INFO  Closing spider (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[scrapy.statscollectors] INFO  Dumping Scrapy stats:
{'downloader/request_bytes': 13234,
 'downloader/request_count': 49,
 'downloader/request_method_count/GET': 49,
 'downloader/response_bytes': 1384477,
 'downloader/response_count': 49,
 'downloader/response_status_count/200': 49,
 'elapsed_time_seconds': 3.552904,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 2, 6, 15, 53, 14, 193680, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 8748919,
 'httpcompression/response_count': 49,
 'item_scraped_count': 48,
 'items_per_minute': None,
 'log_count/INFO': 58,
 'memusage/max': 111607808,
 'memusage/startup': 111607808,
 'offsite/domains': 10,
 'offsite/filtered': 20,
 'request_depth_max': 1,
 'response_received_count': 49,
 'responses_per_minute': None,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'start_time': datetime.datetime(2025, 2, 6, 15, 53, 10, 640776, tzinfo=datetime.timezone.utc),
 'urllength/request_ignored_count': 1} ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7fd081db12b0>"})
[apify] INFO  Exiting Actor ({"exit_code": 0})
[apify] INFO  Exiting Actor ({"exit_code": 0, "message": "Exiting Actor"})
@vdusek vdusek added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Feb 6, 2025
@vdusek vdusek added this to the 108th sprint - Tooling team milestone Feb 10, 2025
vdusek added a commit that referenced this issue Feb 13, 2025
…asyncio` (#390)

### Description

- Apify (asyncio) and Scrapy (Twisted) now run on a single event loop.
  - `nest-asyncio` has been completely removed.
  - It seems that this change also improved the performance.
- The `ApifyScheduler`, which is synchronous, now executes asyncio
coroutines (communication with RQ) in a separate thread with its own
asyncio event loop.
- Logging setup has to be adjusted and I moved to a dedicated file in
the SDK.
- The try-import functionality for optional dependecies from Crawlee was
added to `scrapy` subpackage.
- A new integration test for Scrapy Actor has been added.

### Issues

- Closes: #148
- Closes: #176
- Closes: #392
- Relates: apify/actor-templates#303
- This issue will be closed once the corresponding PR in
`actor-templates` is merged.

### Tests

- A new integration test for Scrapy Actor has been added.
- And of course, it was tested manually using the Actor from
guides/templates.

### Next steps

- Update Scrapy Actor template in `actor-templates`.
- Update [Actor Scrapy Books
Example](https://github.com/apify/actor-scrapy-books-example).
- Add HTTP cache storage for KVS, @honzajavorek will provide his
implementation.

### Follow-up issues

- There are still a few issues to be resolved.
- #391
- #395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant