Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to write logs with no transports #586

Closed
yinzara opened this issue Sep 18, 2019 · 29 comments
Closed

Attempt to write logs with no transports #586

yinzara opened this issue Sep 18, 2019 · 29 comments
Assignees
Labels
api: logging Issues related to the googleapis/nodejs-logging API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@yinzara
Copy link

yinzara commented Sep 18, 2019

This seems to be the issue reported as part of #381 however is not caused by the same problem and a serious problem that I have to put in monitoring to kill my services when I see these log messages.

This is occurring in GKE and only occurs after the service is running for a period of time.

I'm running @google-cloud/logging version 5.2.0 and google-gax 1.1.5.

The log I'm posting below came from a service that started a 6pm PST then at around 20 minutes past midnight, I get one log message:

"(node:34) UnhandledPromiseRejectionWarning: Error: 
    at Http2CallStream.call.on (/tsg/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:96:45)
    at Http2CallStream.emit (events.js:203:15)
    at process.nextTick (/tsg/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22)
    at process._tickCallback (internal/process/next_tick.js:61:11)
" 

Then

[winston] Attempt to write logs with no transports {"warning":{"name":"UnhandledPromiseRejectionWarning"},"level":"warn","message":"Error: \n    at Http2CallStream.call.on (/tsg/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:96:45)\n    at Http2CallStream.emit (events.js:203:15)\n    at process.nextTick (/tsg/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22)\n    at process._tickCallback (internal/process/next_tick.js:61:11)"}

then I just get repeated messages about writing logs with no transports for a while then it just goes completely silent

@bcoe bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 18, 2019
@bcoe
Copy link
Contributor

bcoe commented Sep 18, 2019

@yinzara does the client begin logging again after this, or is it necessary to redeploy the service?

@yinzara
Copy link
Author

yinzara commented Sep 18, 2019

It stops logging completely and I must delete the pod

@bcoe
Copy link
Contributor

bcoe commented Sep 19, 2019

@yinzara that sounds extremely frustrating, would you be able to provide the code that you use to configure logging?

@alexander-fenster
Copy link
Contributor

This is frustrating indeed. @yinzara What's in your npm ls? I'm specifically interested in the version of @grpc/grpc-js. We've had some fixes there recently related to the proper handling of connection resets, so if it's older than v0.5.3, I'd try to update it and give the latest version a try.

@alexander-fenster
Copy link
Contributor

@murgatroid99 this is FYI for now, we see some rejection coming from @grpc/grpc-js/build/src/client.js:96, it's this code:

        call.on('status', (status) => {
            /* We assume that call emits status after it emits end, and that it
             * accounts for any cancelWithStatus calls up until it emits status.
             * Therefore, considering the above event handlers, status.code should be
             * OK if and only if we have a non-null responseMessage */
            if (status.code === constants_1.Status.OK) {
                callback(null, responseMessage);
            }
            else {
                const error = Object.assign(new Error(status.details), status);
                callback(error);
            }
        });

but there seems to be nothing in error details. Does this ring any bells for you?

@yinzara
Copy link
Author

yinzara commented Sep 19, 2019

We are using @grpc/grpc-js version 0.4.3 so I'll update to latest and see if that makes any difference.

To give a little bit more color, we are running a Istio w/ Kubernetes solution and this problem only happens when we don't use the GCloud Compute metadata application credentials and instead provide a GOOGLE_APPLICATION_CREDENTIALS key (so that we can be more fine grained in our access control of the service accounts).

Initially we saw this problem happen nearly all the time on application start up.

What we found out was happening was Istio's sidecar was starting at the same time as the application and depending on which code executed first, the log would start and go to retrieve the access token from the googleapis auth endpoint using the private key and would instantly fail because the sidecar wasn't started yet and would refuse all outbound requests. Somehow, the failure to retrieve the access token would cause Winston to remove the transport (I have no idea why) or at least start reporting the error that there were no transports.

We were able to alleviate the issue by starting Winston with only a console transport and then use the gtoken library to retrieve the access token (which I saw was what gax was using behind the scenes). If it failed, we would retry on a backoff. If it succeeded, we would remove the console transport and add the Google Cloud Winston transport.

This solved our problem for start up, however it seems like this problem occurs as part of the process to renew/create the access token as well. Most of the time the renewal is successful, however in the case where it's not, somehow the library gets in to this state of non-function.

We had the exact same problem with GCP tracing as well (causing us to abandon using it as well). If we started tracing with the application (as is required), it would immediate try and fail to get an access token and then cease to function (never retrying).

@filipsuk
Copy link

We are facing a similar issue on Google Compute managed group instances.

"@google-cloud/logging-winston": "1.0.0"

[winston] Attempt to write logs with no transports {"code":2,"details":"Metadata key \"a\u0000\u0000\u0000\" contains illegal characters","metadata":{"internalRepr":{}},"note":"Exception occurred in retry method that was not classified as transient","level":"error","message":"Unhandled Rejection happened: Metadata key \"a\u0000\u0000\u0000\" contains illegal charactersMetadata key \"a\u0000\u0000\u0000\" contains illegal characters","stack":"Error: Metadata key \"a\u0000\u0000\u0000\" contains illegal characters\n at Http2CallStream.call.on (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:102:45)\n at Http2CallStream.emit (events.js:203:15)\n at Http2CallStream.endCall (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:74:18)\n at ClientHttp2Stream.stream.on (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:205:30)\n at ClientHttp2Stream.emit (events.js:198:13)\n at emit (internal/http2/core.js:236:8)\n at process._tickCallback (internal/process/next_tick.js:63:19)"}

@gae123
Copy link

gae123 commented Sep 29, 2019

We have hit this issue too. What is very annoying is that it can take up to 24 hours from a new deployment to kick in.

The very first error in the logs is this. Then we see other error messages about trying to log with no transport.
image

We definitely do not log 1M messages.

Our stack is:

GKE: 1.14.3-gke.11
nodejs: v10.16.2
"winston": "3.2.1"
"@google-cloud/logging-winston":"1.1.1",

I just opened case 20792133 with GCP support referencing this github issue.

@bcoe I have seen the service stop logging a couple of times, but other times it keeps logging. In both cases it starts leaking memory. See the graph below, the knee of the graph is just exactly when the first such message appears. A few hours later the service runs out of memory and crashes.

@alexander-fenster I see grpc-js once at the top level at version 0.5.4 and once embedded in another library at version 0.3.6

"name": "@google-cloud/logging",
"version": "5.2.0",

image

(I tried upgrading to logging-winston 2.0 and I hit a typescirpt issue complaining about 'Cannot find name 'Long' opened GCP case 20788496 for this one)

@gae123
Copy link

gae123 commented Sep 29, 2019

Since my previous post:

  1. Figured why I could not upgrade to the latest @google-cloud/logging-winston (see here). Rebuilt and deployed, issue has not happened but not 24 hours yet...
  2. @alexander-fenster here is a more detailed why:
yarn why v1.17.3
[1/4] Why do we have the module "@grpc/grpc-js"...?
[2/4] Initialising dependency graph...
[3/4] Finding dependency...
[4/4] Calculating file sizes...
=> Found "@grpc/[email protected]"
info Reasons this module exists
   - "@google-cloud#logging-winston#@google-cloud#logging#google-gax" depends on it
   - Hoisted from "@google-cloud#logging-winston#@google-cloud#logging#google-gax#@grpc#grpc-js"
info Disk size without dependencies: "340KB"
info Disk size with unique dependencies: "448KB"
info Disk size with transitive dependencies: "448KB"
info Number of shared dependencies: 1
=> Found "@google-cloud/common-grpc#@grpc/[email protected]"
info This module exists because "@google-cloud#logging-winston#@google-cloud#logging#@google-cloud#common-grpc" depends on it.
info Disk size without dependencies: "460KB"
info Disk size with unique dependencies: "568KB"
info Disk size with transitive dependencies: "568KB"
info Number of shared dependencies: 1
Done in 0.26s.

@bcoe
Copy link
Contributor

bcoe commented Sep 30, 2019

@gae123 I'm keeping a close eye, let me know if you see this issue again early this week.

@gae123
Copy link

gae123 commented Sep 30, 2019

Issue did happen with version @google-cloud/logging-winston 2.1.1 too

image

The logs:
image

@yinzara
Copy link
Author

yinzara commented Oct 3, 2019

I have updated to the following library versions and added the following as "resolutions" to my package.json to ensure it's using these versions. When I updated to @grpc/grpc-js 0.6.x or higher, everything stops working so I've stayed at 0.5.4.

{
  "dependencies": {
    "@google-cloud/common": "^2.2.2",
    "@google-cloud/common-grpc": "^1.0.5",
    "@google-cloud/logging": "^5.3.1",
    "@google-cloud/logging-winston": "^2.1.0",
    "@google-cloud/profiler": "^3.0.0",
    "@google-cloud/trace-agent": "^4.2.0",
    "@grpc/grpc-js": "^0.5.4",
    "@grpc/proto-loader": "^0.5.2",
    "winston": "^3.2.1"
  },
  "resolutions": {
    "@google-cloud/common": "2.2.2",
    "@google-cloud/common-grpc": "1.0.5",
    "@google-cloud/logging": "5.3.1",
    "@grpc/grpc-js": "0.5.4",
    "@grpc/proto-loader": "0.5.2",
    "gaxios": "2.0.1",
    "grpc": "1.23.3",
    "google-auth-library": "5.2.2",
    "google-gax": "1.6.2",
    "gtoken": "4.0.0"
  }
}

Everything seems great and the service will work with no issues for 6 to 12 hours. Then I see the following error and it stops functioning all together. I think we need to upgrade this or get someone to respond. This is a crippling issue.

[winston] Attempt to write logs with no transports
[winston] Attempt to write logs with no transports
[winston] Attempt to write logs with no transports

E 2019-10-02T18:49:40.684740085Z    'Exception occurred in retry method that was not classified as transient' }
 
E 2019-10-02T18:49:40.684735977Z   note:
 
metadata: Metadata { options: undefined, internalRepr: Map {} },
details: '',
code: 1,
Error from log { Error: at Http2CallStream.call.on (/tsg/node_modules/@grpc/grpc-js/build/src/client.js:96:45) at Http2CallStream.emit (events.js:203:15) at process.nextTick (/tsg/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22) at process._tickCallback (internal/process/next_tick.js:61:11)

@alexander-fenster
Copy link
Contributor

@yinzara Re: grpc-js v0.6.x not working, we have fixed an issue that might've caused that just today (grpc/grpc-node#1062). May I bother you to try v0.6.4?

@gae123
Copy link

gae123 commented Oct 3, 2019

This is killing us, any update? As I said the company has opened a GCP support case as well but they are not going anywhere with Google support (20792133).

@bcoe bcoe added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 3, 2019
@alexander-fenster
Copy link
Contributor

Trying to repro that on the latest @grpc/grpc-js - will try hitting the size limit first and see if the transport error occurs then. Will keep you updated.

@alexander-fenster alexander-fenster self-assigned this Oct 3, 2019
@gae123
Copy link

gae123 commented Oct 3, 2019

Thanks for your attention to this.

If you can suggest a work around that would help too. We are not married to winston, if another library would not cause the issue we can try to change our logging layer, we use something internally that uses console locally and winston when running on GCP. But if we do that and we still hit the limitation at the google logging layer we should probably wait for a fix.

Like I said I do not think it is the code under our control that generates these huge log messages. Does this reflect buffering of multiple messages or indeed you think some library is trying to log such a big message?

@alexander-fenster
Copy link
Contributor

alexander-fenster commented Oct 3, 2019

I have a repro. Looking into it...

const winston = require('winston');
const {LoggingWinston} = require('@google-cloud/logging-winston');
const loggingWinston = new LoggingWinston();

// Create a Winston logger that streams to Stackdriver Logging
// Logs will be written to: "projects/YOUR_PROJECT_ID/logs/winston_log"
const logger = winston.createLogger({
  level: 'info',
  transports: [
    loggingWinston,
  ],
});

function doLog(size, count) {
  let string = `[W] This is a long log message. Its length is ${size}K. `;
  while (string.length < size * 1024) {
    string += 'x';
  }
  
  for (let i = 0; i < count; ++i) {
    const text = `[${i}] ${string}`;
    logger.info(text);
  }
}

doLog(1200, 1);
setTimeout(() => { doLog(1, 1000); }, 1000);

With this, after I get an expected unhandled rejection for a huge log message, I start getting Winston's messages about having no transport:

(node:47459) UnhandledPromiseRejectionWarning: Error: Log entry with size 1.17M exceeds maximum size of 256.0K
    at Http2CallStream.<anonymous> (/Users/fenster/logging-repro/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:96:45)
    at Http2CallStream.emit (events.js:201:15)
    at /Users/fenster/logging-repro/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22
    at processTicksAndRejections (internal/process/task_queues.js:82:9)
(node:47459) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:47459) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
[winston] Attempt to write logs with no transports {"message":"[0] [W] This is a long log message. Its length is 1K. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","level":"info"}

Some preliminary thoughts:

  1. I suspect the long message is still coming from your code, or maybe somehow from Winston. We do have request bundling, but the bundling does not group log messages into one entry, and the error says that an entry size exceeds the limit.

  2. I could not reproduce any problem using @google-cloud/logging directly without Winston. The Attempt to write logs with no transports error comes from Winston. Unclear yet if it's Winston fault or our layer (@google-cloud/logging-winston). I'm looking into this now.

Please hold on...

@alexander-fenster
Copy link
Contributor

alexander-fenster commented Oct 3, 2019

Seems like the logger gets deleted from Winston loggers after throwing an exception:

console.log('Logger pipes: ', logger._readableState.pipes); // logs `LoggerWinston` object
doLog(1200, 1);
setTimeout(() => {
  console.log('Logger pipes: ', logger._readableState.pipes); // logs `null`
  doLog(1, 1);
}, 1000);

I'm guessing we might need to catch exception somewhere in the Winston wrapper. Looking...

@yinzara
Copy link
Author

yinzara commented Oct 3, 2019

Right now i'm using 0.6.4 in my running containers and haven't had any log messages drop yet but your most recent comment about winston dropping transports when it gets an exception seems like we may have a smoking gun.

I'm never writing large log messages so it throwing an error from that is unlikely HOWEVER we have seen that this problem mostly occurs intermittently and only after a period in time. The one thing that does happen intermittently and after a period in time is the refresh of the access token. If it refreshed and threw an error, maybe it's removing the transports as you describe.

@alexander-fenster
Copy link
Contributor

@yinzara There was an issue in @grpc/grpc-js before version 0.6.x that sometimes caused the first network request after a long period of silence to fail (going a little bit into details, an RST TCP packet was sometimes not handled correctly). This is fixed in 0.6.x, which may explain why you're seeing less intermittent failures than before.

Having that said, it's not impossible to get any kind of exception from the underline logging library (basically, this @google-cloud/logging). Network is full of possible exceptions. Having a long message is just one possible option. So I'm going to keep looking for the right place to tune this behavior so that the transport does not get removed after a random exception was thrown.

[by this time, I found that this library correctly communicated the error by calling the callback(err) as expected, so seems like an unhandled promise rejection that causes transport removal happens somewhere either in @google-cloud/logging-winston or one of its dependencies, but I'm not going to move this issue there until I find the exact place]

@yinzara
Copy link
Author

yinzara commented Oct 4, 2019

The "unhandled promise rejection" can be fixed by adding an event listener on the "error" event to the logger after you create it. However as you've seen ANY error from a transport causes the pipes to become null.

logger.on('error', error=> console.error(error))

I have a feeling this issue may be related:
winstonjs/winston#1364

@yinzara
Copy link
Author

yinzara commented Oct 4, 2019

I can say the upgrade to @grpc/grpc-js 0.6.4 was definitely bad. I started getting consistent "No connection established Error: No connection established", errors in other parts of my app that use Google Services.

@alexander-fenster
Copy link
Contributor

@yinzara Thank you for your comments!

Re: @grpc/grpc-js v0.6.4, may I ask you to submit an issue in their repository? https://github.com/grpc/grpc-node/issues Include all the information you have, maybe a stack, exact error messages, etc. I feel it's not something directly related to this problem.

@gae123
Copy link

gae123 commented Oct 5, 2019

Yesterday evening we yanked out the google winston log library and used directly "@google-cloud/logging": "5.3.1",

I now see another failure, we document what happened after that in issue #596

@yinzara
Copy link
Author

yinzara commented Oct 5, 2019

I think I believe we have the original cause of this issue solved.

See winstonjs/winston#1364 .

Turns out it's actually Winston. However @gae123 I think your issue is different and equally as concerning.

@bcoe
Copy link
Contributor

bcoe commented Oct 7, 2019

CC: @soldair, any thoughts about this issue with Winston?

@bcoe
Copy link
Contributor

bcoe commented Oct 10, 2019

@yinzara I"m going to go ahead and close this, since it sounds like you've worked around the issue; if you bump into similar issues please feel free to reopen

@gae123, let's keep our conversation centered around #596 👍

@bcoe bcoe closed this as completed Oct 10, 2019
@google-cloud-label-sync google-cloud-label-sync bot added the api: logging Issues related to the googleapis/nodejs-logging API. label Jan 31, 2020
@maxfrigge
Copy link

I am having the same issue.

Do I understand correctly that dropping Winston is the solution?

I wouldn't mind to be honest. But this is also makes me think that, in Cloud Run logging to stdout would probably be the most stable solution.

Would there even be a benefit of using the logging API vs stdout on Cloud Run?

@yinzara
Copy link
Author

yinzara commented Jul 9, 2020

The problem with logging to stdout is errors. Because stdout considers each line a different log message, you can never actually read an error with more than one line nor does it get properly handled by StackDriver Errors. You also lose the ability to change logging levels and formats dynamically.

If you just update to the latest version of Winston it has my fix in it (i.e. version 3.3.3 or higher).

If you don't want to upgrade, the actual solution to this problem for most is to just add the following snippet:

log.on("error", error => {
            if (!log.transports.includes(winstonTransport)) {
              log.add(winstonTransport)
            }
          })

where "log" is the winston log instance and "winstonTransport" is a reference to an instance of the LoggingWinston class from @google-cloud/logging-winston

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: logging Issues related to the googleapis/nodejs-logging API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants