Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The python tornado integration causes the ioloop to stall #191

Closed
krish203 opened this issue Jun 29, 2020 · 4 comments
Closed

The python tornado integration causes the ioloop to stall #191

krish203 opened this issue Jun 29, 2020 · 4 comments
Labels
awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks.

Comments

@krish203
Copy link

krish203 commented Jun 29, 2020

Description

Recently, we have been facing an issue where a large number of TIME_WAIT connections causes the tornado ioloop to stall.

Issue

When the application starts, initially it runs fine, but when an error gets reported, a large number of connections start getting queued in TIME_WAIT state to this ip: 35.186.205.6. A netstat command output gave me ~6300 connections in this state. This eventually caused the ioloop of the tornado application to lock up, causing timeouts in other parts of the application where we make use of the tornado Async HTTP Client.

Environment

Library versions:

  • python version: 2.7.17
  • bugsnag-python version: 3.6.1
  • integration version(s)
    • tornado: 5.1.1
  • Logger snippet:
log_format = ('time="%(asctime)-15s" module=%(module)s '
                    'level=%(levelname)-7s %(message)s')
logging.basicConfig(format=log_format, level=logging.INFO)
logger_obj = logging.getLogger(__name__)
handler = BugsnagHandler()
handler.setLevel(logging.ERROR)
logger_obj.addHandler(handler)
return logger_obj   
netstat -anp | grep '35.186.205.6' | grep 'TIME_WAIT' | wc -l
  • Is bugsnag.tornado.BugsnagRequestHandler and BugsnagHandler Async, do they need to be called from an Executor instance?
@steve-nester-uk
Copy link
Contributor

@krish203 "TIME_WAIT indicates that the local endpoint (your side) has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately. The connections will be removed when they time out within four minutes."

As this is a function of how TCP works, I do not believe this is caused by Bugsnag's notifier but just as a result of a high number of errors. I can think of two possible solutions: one would be to set SO_REUSEADDR to reuse the TIME_WAIT connections and another would be to load balance the application so there are less requests handled on each server. Do you think either of these would help?

Also, just to clarify, you mentioned ~6300 connections were made, I assume this is because 6300 errors occurred in the app for which Bugsnag raised each to notify.bugsnag.com, does that sound correct?

@steve-nester-uk steve-nester-uk added the awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks. label Jul 15, 2020
@steve-nester-uk
Copy link
Contributor

steve-nester-uk commented Jul 20, 2020

Closing issue for now as we haven’t heard back. If you wish to follow up then please comment below and we’ll reopen the issue

@krish203
Copy link
Author

krish203 commented Aug 10, 2020

Sorry was on parental leave, could not follow up.

I did try to reuse the TIME_WAIT connections, that did not help. For whatever reason, I removed the Bugsnag Handler from the logger, and the ioloop stalling problem goes away. Is there a way to internally debug the bugsnag library to figure out what might be going on.

~6300 connections were made, I assume this is because 6300 errors occurred in the app for which Bugsnag raised each to notify.bugsnag.com, does that sound correct?

Yes that's correct.

@snmaynard snmaynard reopened this Aug 11, 2020
@steve-nester-uk
Copy link
Contributor

@krish203 Thanks for the additional information. As far as I can tell everything is behaving as expected. The fact that so many errors are occurring per second is causing the TCP connections in TIME_WAIT to stall your app. My best suggestion would be to load balance the web app across additional servers to reduce load.

Closing for now as I cannot see any evidence that the Bugsnag notifier has an issue, but please comment and we can reopen should you have reason to believe that not to be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks.
Projects
None yet
Development

No branches or pull requests

3 participants