Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 0.27.1 introduced an bug when two queries are sent one after another #1210

Closed
MaximillionMai opened this issue May 17, 2020 · 34 comments
Closed
Labels
bug Generally incorrect behavior networking-stack
Milestone

Comments

@MaximillionMai
Copy link

MaximillionMai commented May 17, 2020

For this function in ApolloClient,
@discardableResult public func fetch<Query: GraphQLQuery>(query: Query,
cachePolicy: CachePolicy = .returnCacheDataElseFetch,
context: UnsafeMutableRawPointer? = nil,
queue: DispatchQueue = DispatchQueue.main,
resultHandler: GraphQLResultHandler<Query.Data>? = nil) -> Cancellable

[Bug] 0.27.1 introduced a bug when calling the function above, one after another, with different queries. One of the two queries would not call the "resultHandler" callback. Sometimes the one that does call the callback would come back with this error: dataForRequestNotFound.

This bug does not reproduce in 0.24.1. I will check whether it reproduces in 0.25.1 and report back.

@designatednerd
Copy link
Contributor

@MaximillionMai Can you confirm which version this breaks on for you? That will help me do a comparison between that version and the previous version to figure out what might have changed here.

@yuriiik
Copy link

yuriiik commented May 19, 2020

Hi @designatednerd! We're experiencing the same issue. Testing shows that it reproduces in 0.27.1 and does NOT reproduce in 0.24.1 and 0.25.1.

@sfoulston
Copy link
Contributor

I have just noticed this too, in v0.27.1. In my case it's when calling watch(...) simultaneously for two different queries. I receive a dataForRequestNotFound error occasionally for one of the queries.

@sneakyness
Copy link

We've been seeing this occasionally also, have not had time to look into it with more depth as we were assuming it was the dogshit wifi we have currently. Just a series of normal fetches in our case

@designatednerd
Copy link
Contributor

Looks like the problem is here - the data object that's been dequeued for the task which has just finished is nil, which is weird, since we create an empty data object for each task before we start it.

This was introduced in 0.27.0 so anything prior to that will not be affected.

Thought: Are any of you calling cancel directly on the Cancelable returned from URLSessionClient?

@sneakyness
Copy link

Nope! The only place I reliably see it is when I'm calling a series of fetch requests one after another, at one point I kick off a request in parallel for each of the objects received, but the request that fails isn't always one of those.

@designatednerd
Copy link
Contributor

@sneakyness Got a rough guess on how many requests at once you've got going before you see this issue?

@designatednerd
Copy link
Contributor

Also, that error has an associated value that should have the originalRequest from the task that we couldn't pull data for - are you able to tell if there's any pattern to what's failing vs what's working?

@sneakyness
Copy link

sneakyness commented May 19, 2020

2-3 completed (each one must finish before the next can start), then 1+ fetch for one type while also making a singular request of another. The 1+ is commonly 4-10+, making more does not seem to change how sporadic it is. I'll start logging that now but IME it's always either of the last two

@designatednerd
Copy link
Contributor

Thanks. Woof, sporadic bugs are my faaaavorite to troubleshoot 😆

@sneakyness
Copy link

Unfortunately I'm in the middle of an API Model refactor, if I'm lucky I should be able to automate firing this process over and over to try and reproduce by end of week

@designatednerd
Copy link
Contributor

Just pushed up a branch where I added a test that checks if 300 identical URL requests hitting the client at the same time (from different threads) create different IDs, and it passes.

Throwing some breakpoints in and doing some po's, it looks like it does create one instance per ID of the data in the self.datas property for URLSessionClient, and none of them are nil. That puts a kibosh on my theory of "Maybe different requests are winding up with the same ID if they're being sent at the same time, causing the data to be deallocated when the first request finishes."

The only place the data for a particular task ID gets removed is either in clearTask or clearAllTasks - clearAllTasks only gets hit when the session invalidates, which would give you a different error. clearTask only gets hit either a) If you cancel via URLSessionClient's cancel(task:) method or b) If the task actually completes.

I've added this level of detail because I'm basically stumped. Clearly, something's causing the data for a particular URLSessionTask identifier key to get nulled out. What the hell is doing that? If anyone's got thoughts, would love to hear them.

@sneakyness
Copy link

I'm using DispatchQueue.concurrentPerform, not super sure on how different that is from .userInitiated.

I'm not doing this specifically for the network requests, there's a decent amount of objects being initialized and sorted, just happened to be a clean spot to snag an id I needed for the request.

Actually I've just realized that it might get called multiple times per index, do you think that's causing it? 🤡

@designatednerd
Copy link
Contributor

Actually I've just realized that it might get called multiple times per index

I'm not sure what you mean by this

@sneakyness
Copy link

If it's parallel can't it rarely call the same index twice? Especially in cases where the number of tasks is >8 (afaik 8 is the max number of threads it'll kick off)

@designatednerd
Copy link
Contributor

I still am missing something - what is "it" that could be calling the same index multiple times in this context - do you mean the URLSession creating tasks with the same identifier, or something else?

@sneakyness
Copy link

concurrentPerform, so it might be calling the same fetch multiple times, but that doesn't explain why it would fail with "dataForRequestNotFound" instead of eventually just succeeding

@designatednerd
Copy link
Contributor

Ah! That makes more sense. I updated the test to use concurrentPerform and yep, multiple tasks with the same ID are getting generated.

@designatednerd
Copy link
Contributor

(basically if multiple tasks with the same ID are getting generated, then my "I will unique this based on the task identifier" strategy falls apart entirely)

@sneakyness
Copy link

sneakyness commented May 19, 2020

My case may be different from others in that the first request was fine. If we know for sure this is the cause, I can do whatever to make sure it's only called once.

If there was an implementation of fetch/watch/etc like fetchOnce that either returned a request if there's no task ID, or did nothing/emit something that said this was the case, would that be a good bandaid for the others' cases pending a bigger change?

@designatednerd
Copy link
Contributor

The first request would probably be fine in what you were describing because it's not trying to make multiple requests at the same time, it's once you start making multiple concurrent requests that the ID incrementor fails. Kind of wish they'd gone for a UUID instead of an auto-incrementing int for that identifier...

I'll think on that - for now I need to look at what to do about the fact that the ID isn't unique if it's hit concurrently repeatedly, because I thought Apple was handling that and they ain't. 🙃

@designatednerd
Copy link
Contributor

👋Can I ask a couple of you folks to try out the fix/multifetch branch to see if the changes there make a difference? I've updated the dictionary to key off the URLSessionTask rather than its identifier, so this should stop the issue where something gets ripped out of the dictionary too early. The tests are passing but I'd love to get confirmation this is working in the real world before I PR it.

@designatednerd designatednerd added the bug Generally incorrect behavior label May 21, 2020
@yuriiik
Copy link

yuriiik commented May 22, 2020

Just tested the fix/multifetch branch. Unfortunately the issue still persists. In our case we just make two identical serial calls to fetch function on the same queue, and for one of these invocations resultHandler is not called back. It happens roughly 4/5 times, sometimes it works as expected.

@RolandasRazma
Copy link
Contributor

this sounds like completion handler overwritten somewhere without cancelling and new call created without canceling old one

@designatednerd
Copy link
Contributor

designatednerd commented May 22, 2020

@RolandasRazma That's definitely what was happening when using the task identifier as the key, since the task identifier would not get incremented if the session was getting hit quickly enough, and everything for the particular key would get hammered by the next task.

@yuriiik - Are you making other calls at the same time, or just the two identical calls? Also, thank you for testing, it's a bit frustrating that the concurrentPerform test didn't uncover this 😞

@yuriiik
Copy link

yuriiik commented May 22, 2020

@designatednerd – just two identical calls, one after another.

@sneakyness
Copy link

sneakyness commented May 22, 2020

The var is different in mine but otherwise they are the same call. Applied branch, saw the bug once, set a breakpoint, never saw it again 👻

The process of triggering the request was streamlined, I was able to run it on repeat long enough to forget about it and still got nothing!

@designatednerd
Copy link
Contributor

@sneakyness Did it continue to work fine after you removed the breakpoint? I'm thinking it might be a Heisenbug (in that it changes when observed, probably because the breakpoint slows things up enough to stop whatever was happening concurrently before) if it goes back to not working. If it randomly broke once and then worked forevermore, I have no goddamned idea 🙃

@yuriiik Something occurred to me - can you confirm that you're not using Query Watchers? There's a bug in the current version where we're super-agressively cancelling in-flight fetches if a cache result comes in that's fixed with #1156 (which is merged but not shipped). If you are using query watchers, do you mind pulling master and letting me know if that fixes the issue?

@sneakyness
Copy link

That lines up with what I figured was going on earlier, the only time it was happening was either after the backend had been rebuilt and the query did a cold run, or if we were in the "best worst zone" of wifi coverage which caused a similar amount of delay.

This weekend I've got some nebulous and grand refactoring planned for the backend, I suspect once the schema tidies up this will disappear. If I manage to cause it again after that I will let you know. I'm inclined to say that this error is a "smell" and that ultimately it's best to find a way to get the result you want with less requests, even if it shouldn't happen

@MaximillionMai
Copy link
Author

@designatednerd Sorry to get back to you late! It was indeed introduced in version 0.27.1.
For me, it repros when I call two different query calls one right after another.

@yuriiik
Copy link

yuriiik commented May 24, 2020

@designatednerd we're not using Query Watchers, so that's not the case I think.

@designatednerd
Copy link
Contributor

OK, I think I got it. Basically: Concurrency is hard. 🙃

Please see #1227.

@designatednerd designatednerd added this to the Next Release milestone May 27, 2020
@designatednerd
Copy link
Contributor

#1227 was merged and is now available as part of 0.28.0. If you are still experiencing any problems around this, please open a new issue indicating what version you are using.

Thanks for uncovering this!

@yuriiik
Copy link

yuriiik commented Jun 10, 2020

We can no longer reproduce the issue on 0.28.0. @designatednerd thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Generally incorrect behavior networking-stack
Projects
None yet
Development

No branches or pull requests

6 participants