-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to write logs with no transports #586
Comments
@yinzara does the client begin logging again after this, or is it necessary to redeploy the service? |
It stops logging completely and I must delete the pod |
@yinzara that sounds extremely frustrating, would you be able to provide the code that you use to configure logging? |
This is frustrating indeed. @yinzara What's in your |
@murgatroid99 this is FYI for now, we see some rejection coming from @grpc/grpc-js/build/src/client.js:96, it's this code: call.on('status', (status) => {
/* We assume that call emits status after it emits end, and that it
* accounts for any cancelWithStatus calls up until it emits status.
* Therefore, considering the above event handlers, status.code should be
* OK if and only if we have a non-null responseMessage */
if (status.code === constants_1.Status.OK) {
callback(null, responseMessage);
}
else {
const error = Object.assign(new Error(status.details), status);
callback(error);
}
}); but there seems to be nothing in error details. Does this ring any bells for you? |
We are using @grpc/grpc-js version 0.4.3 so I'll update to latest and see if that makes any difference. To give a little bit more color, we are running a Istio w/ Kubernetes solution and this problem only happens when we don't use the GCloud Compute metadata application credentials and instead provide a GOOGLE_APPLICATION_CREDENTIALS key (so that we can be more fine grained in our access control of the service accounts). Initially we saw this problem happen nearly all the time on application start up. What we found out was happening was Istio's sidecar was starting at the same time as the application and depending on which code executed first, the log would start and go to retrieve the access token from the googleapis auth endpoint using the private key and would instantly fail because the sidecar wasn't started yet and would refuse all outbound requests. Somehow, the failure to retrieve the access token would cause Winston to remove the transport (I have no idea why) or at least start reporting the error that there were no transports. We were able to alleviate the issue by starting Winston with only a console transport and then use the This solved our problem for start up, however it seems like this problem occurs as part of the process to renew/create the access token as well. Most of the time the renewal is successful, however in the case where it's not, somehow the library gets in to this state of non-function. We had the exact same problem with GCP tracing as well (causing us to abandon using it as well). If we started tracing with the application (as is required), it would immediate try and fail to get an access token and then cease to function (never retrying). |
We are facing a similar issue on Google Compute managed group instances.
|
We have hit this issue too. What is very annoying is that it can take up to 24 hours from a new deployment to kick in. The very first error in the logs is this. Then we see other error messages about trying to log with no transport. We definitely do not log 1M messages. Our stack is: GKE: 1.14.3-gke.11 I just opened case 20792133 with GCP support referencing this github issue. @bcoe I have seen the service stop logging a couple of times, but other times it keeps logging. In both cases it starts leaking memory. See the graph below, the knee of the graph is just exactly when the first such message appears. A few hours later the service runs out of memory and crashes. @alexander-fenster I see grpc-js once at the top level at version 0.5.4 and once embedded in another library at version 0.3.6 "name": "@google-cloud/logging", (I tried upgrading to logging-winston 2.0 and I hit a typescirpt issue complaining about 'Cannot find name 'Long' opened GCP case 20788496 for this one) |
Since my previous post:
|
@gae123 I'm keeping a close eye, let me know if you see this issue again early this week. |
I have updated to the following library versions and added the following as "resolutions" to my package.json to ensure it's using these versions. When I updated to @grpc/grpc-js 0.6.x or higher, everything stops working so I've stayed at 0.5.4. {
"dependencies": {
"@google-cloud/common": "^2.2.2",
"@google-cloud/common-grpc": "^1.0.5",
"@google-cloud/logging": "^5.3.1",
"@google-cloud/logging-winston": "^2.1.0",
"@google-cloud/profiler": "^3.0.0",
"@google-cloud/trace-agent": "^4.2.0",
"@grpc/grpc-js": "^0.5.4",
"@grpc/proto-loader": "^0.5.2",
"winston": "^3.2.1"
},
"resolutions": {
"@google-cloud/common": "2.2.2",
"@google-cloud/common-grpc": "1.0.5",
"@google-cloud/logging": "5.3.1",
"@grpc/grpc-js": "0.5.4",
"@grpc/proto-loader": "0.5.2",
"gaxios": "2.0.1",
"grpc": "1.23.3",
"google-auth-library": "5.2.2",
"google-gax": "1.6.2",
"gtoken": "4.0.0"
}
} Everything seems great and the service will work with no issues for 6 to 12 hours. Then I see the following error and it stops functioning all together. I think we need to upgrade this or get someone to respond. This is a crippling issue.
|
@yinzara Re: grpc-js v0.6.x not working, we have fixed an issue that might've caused that just today (grpc/grpc-node#1062). May I bother you to try v0.6.4? |
This is killing us, any update? As I said the company has opened a GCP support case as well but they are not going anywhere with Google support (20792133). |
Trying to repro that on the latest |
Thanks for your attention to this. If you can suggest a work around that would help too. We are not married to winston, if another library would not cause the issue we can try to change our logging layer, we use something internally that uses console locally and winston when running on GCP. But if we do that and we still hit the limitation at the google logging layer we should probably wait for a fix. Like I said I do not think it is the code under our control that generates these huge log messages. Does this reflect buffering of multiple messages or indeed you think some library is trying to log such a big message? |
I have a repro. Looking into it... const winston = require('winston');
const {LoggingWinston} = require('@google-cloud/logging-winston');
const loggingWinston = new LoggingWinston();
// Create a Winston logger that streams to Stackdriver Logging
// Logs will be written to: "projects/YOUR_PROJECT_ID/logs/winston_log"
const logger = winston.createLogger({
level: 'info',
transports: [
loggingWinston,
],
});
function doLog(size, count) {
let string = `[W] This is a long log message. Its length is ${size}K. `;
while (string.length < size * 1024) {
string += 'x';
}
for (let i = 0; i < count; ++i) {
const text = `[${i}] ${string}`;
logger.info(text);
}
}
doLog(1200, 1);
setTimeout(() => { doLog(1, 1000); }, 1000); With this, after I get an expected unhandled rejection for a huge log message, I start getting Winston's messages about having no transport:
Some preliminary thoughts:
Please hold on... |
Seems like the logger gets deleted from Winston loggers after throwing an exception: console.log('Logger pipes: ', logger._readableState.pipes); // logs `LoggerWinston` object
doLog(1200, 1);
setTimeout(() => {
console.log('Logger pipes: ', logger._readableState.pipes); // logs `null`
doLog(1, 1);
}, 1000); I'm guessing we might need to catch exception somewhere in the Winston wrapper. Looking... |
Right now i'm using 0.6.4 in my running containers and haven't had any log messages drop yet but your most recent comment about winston dropping transports when it gets an exception seems like we may have a smoking gun. I'm never writing large log messages so it throwing an error from that is unlikely HOWEVER we have seen that this problem mostly occurs intermittently and only after a period in time. The one thing that does happen intermittently and after a period in time is the refresh of the access token. If it refreshed and threw an error, maybe it's removing the transports as you describe. |
@yinzara There was an issue in Having that said, it's not impossible to get any kind of exception from the underline logging library (basically, this [by this time, I found that this library correctly communicated the error by calling the |
The "unhandled promise rejection" can be fixed by adding an event listener on the "error" event to the logger after you create it. However as you've seen ANY error from a transport causes the pipes to become null.
I have a feeling this issue may be related: |
I can say the upgrade to @grpc/grpc-js 0.6.4 was definitely bad. I started getting consistent "No connection established Error: No connection established", errors in other parts of my app that use Google Services. |
@yinzara Thank you for your comments! Re: |
Yesterday evening we yanked out the google winston log library and used directly I now see another failure, we document what happened after that in issue #596 |
I think I believe we have the original cause of this issue solved. See winstonjs/winston#1364 . Turns out it's actually Winston. However @gae123 I think your issue is different and equally as concerning. |
CC: @soldair, any thoughts about this issue with Winston? |
I am having the same issue. Do I understand correctly that dropping Winston is the solution? I wouldn't mind to be honest. But this is also makes me think that, in Cloud Run logging to stdout would probably be the most stable solution. Would there even be a benefit of using the logging API vs stdout on Cloud Run? |
The problem with logging to stdout is errors. Because stdout considers each line a different log message, you can never actually read an error with more than one line nor does it get properly handled by StackDriver Errors. You also lose the ability to change logging levels and formats dynamically. If you just update to the latest version of Winston it has my fix in it (i.e. version 3.3.3 or higher). If you don't want to upgrade, the actual solution to this problem for most is to just add the following snippet:
where "log" is the winston log instance and "winstonTransport" is a reference to an instance of the LoggingWinston class from @google-cloud/logging-winston |
This seems to be the issue reported as part of #381 however is not caused by the same problem and a serious problem that I have to put in monitoring to kill my services when I see these log messages.
This is occurring in GKE and only occurs after the service is running for a period of time.
I'm running
@google-cloud/logging
version 5.2.0 andgoogle-gax
1.1.5.The log I'm posting below came from a service that started a 6pm PST then at around 20 minutes past midnight, I get one log message:
Then
then I just get repeated messages about writing logs with no transports for a while then it just goes completely silent
The text was updated successfully, but these errors were encountered: