The python tornado integration causes the ioloop to stall #191

krish203 · 2020-06-29T18:16:09Z

Description

Recently, we have been facing an issue where a large number of TIME_WAIT connections causes the tornado ioloop to stall.

Issue

When the application starts, initially it runs fine, but when an error gets reported, a large number of connections start getting queued in TIME_WAIT state to this ip: 35.186.205.6. A netstat command output gave me ~6300 connections in this state. This eventually caused the ioloop of the tornado application to lock up, causing timeouts in other parts of the application where we make use of the tornado Async HTTP Client.

Environment

Library versions:

python version: 2.7.17
bugsnag-python version: 3.6.1
integration version(s)
- tornado: 5.1.1
Logger snippet:

log_format = ('time="%(asctime)-15s" module=%(module)s '
                    'level=%(levelname)-7s %(message)s')
logging.basicConfig(format=log_format, level=logging.INFO)
logger_obj = logging.getLogger(__name__)
handler = BugsnagHandler()
handler.setLevel(logging.ERROR)
logger_obj.addHandler(handler)
return logger_obj

Mainly ran into this issue: SimpleAsyncHTTPClient randomly blocking on "ValueError: IOStream is not idle; cannot convert to SSL" tornadoweb/tornado#2546
Output of the command below was ~6300:

netstat -anp | grep '35.186.205.6' | grep 'TIME_WAIT' | wc -l

Is bugsnag.tornado.BugsnagRequestHandler and BugsnagHandler Async, do they need to be called from an Executor instance?

The text was updated successfully, but these errors were encountered:

steve-nester-uk · 2020-07-14T16:33:36Z

@krish203 "TIME_WAIT indicates that the local endpoint (your side) has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately. The connections will be removed when they time out within four minutes."

As this is a function of how TCP works, I do not believe this is caused by Bugsnag's notifier but just as a result of a high number of errors. I can think of two possible solutions: one would be to set SO_REUSEADDR to reuse the TIME_WAIT connections and another would be to load balance the application so there are less requests handled on each server. Do you think either of these would help?

Also, just to clarify, you mentioned ~6300 connections were made, I assume this is because 6300 errors occurred in the app for which Bugsnag raised each to notify.bugsnag.com, does that sound correct?

steve-nester-uk · 2020-07-20T12:24:01Z

Closing issue for now as we haven’t heard back. If you wish to follow up then please comment below and we’ll reopen the issue

krish203 · 2020-08-10T22:44:04Z

Sorry was on parental leave, could not follow up.

I did try to reuse the TIME_WAIT connections, that did not help. For whatever reason, I removed the Bugsnag Handler from the logger, and the ioloop stalling problem goes away. Is there a way to internally debug the bugsnag library to figure out what might be going on.

~6300 connections were made, I assume this is because 6300 errors occurred in the app for which Bugsnag raised each to notify.bugsnag.com, does that sound correct?

Yes that's correct.

steve-nester-uk · 2020-08-17T18:15:14Z

@krish203 Thanks for the additional information. As far as I can tell everything is behaving as expected. The fact that so many errors are occurring per second is causing the TCP connections in TIME_WAIT to stall your app. My best suggestion would be to load balance the web app across additional servers to reduce load.

Closing for now as I cannot see any evidence that the Bugsnag notifier has an issue, but please comment and we can reopen should you have reason to believe that not to be the case.

steve-nester-uk added the awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks. label Jul 15, 2020

steve-nester-uk closed this as completed Jul 20, 2020

snmaynard reopened this Aug 11, 2020

steve-nester-uk closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The python tornado integration causes the ioloop to stall #191

The python tornado integration causes the ioloop to stall #191

krish203 commented Jun 29, 2020 •

edited

Loading

steve-nester-uk commented Jul 14, 2020

steve-nester-uk commented Jul 20, 2020 •

edited

Loading

krish203 commented Aug 10, 2020 •

edited

Loading

steve-nester-uk commented Aug 17, 2020

The python tornado integration causes the ioloop to stall #191

The python tornado integration causes the ioloop to stall #191

Comments

krish203 commented Jun 29, 2020 • edited Loading

Description

Issue

Environment

steve-nester-uk commented Jul 14, 2020

steve-nester-uk commented Jul 20, 2020 • edited Loading

krish203 commented Aug 10, 2020 • edited Loading

steve-nester-uk commented Aug 17, 2020

krish203 commented Jun 29, 2020 •

edited

Loading

steve-nester-uk commented Jul 20, 2020 •

edited

Loading

krish203 commented Aug 10, 2020 •

edited

Loading