queue outgoing requests? #4

bewinsnw · 2024-05-29T13:46:29Z

bewinsnw
May 29, 2024

Running some tests I noticed that we often had many concurrent requests in flight for the same url; they were hitting CachedHttpRepository while the original request was still in flight so never hit the cache.

This isn't ideal; eventually I did this (which may not be the best idea either, but it works)

    async def _fetch_simple_page(
        self,
        page_url: str,
    ) -> tuple[str, str]:
        """Retrieves a simple page from the given URL. The retrieved page,
        content type and etag are cached. If the cached content is
        unchanged or the source is unavailable, the cached data is returned.
        """
        # try to avoid having many request in flight for the same url by
        # having a task queue, and waiting for the next response
        task = next((t for t in self._queue if t[0] == page_url), None)
        if not task:
            task = asyncio.create_task(self._fetch_simple_page_task(page_url))
            self._queue.append((page_url, task))
        page, content_type = await task
        self._queue = [t for t in self._queue if t[1] != task]
        return page, content_type

... renaming the original _fetch_simple_page to _fetch_simple_page_task. It doesn't really matter here if I remove items from the queue with the same task or same page url, or if there are multiple items in the queue with the same page_url; that kind of thing may happen in edge cases, the main thing is to not have 50 requests for /simple/ in flight.

It would probably make more sense to handle this by wrapping the session, so it applies to all the outgoing requests, and also throttling the requests that make it into the task behind a semaphore.

bewinsnw · 2024-05-30T10:11:52Z

bewinsnw
May 30, 2024
Author

this doesn't work out in general, I've found. If I use tasks in CachedHttpRepository I have the issue that that doesn't stream responses - it downloads first then sends. So, I'm experimenting with using nginx as a caching layer, and just using HttpRepository to stream. But if I use that, some of the clients that use the task above end up getting dropped connection errors - I guess it's not going to like trying to read from an already exhausted stream.

However, the biggest benefit of using the task was to avoid repeated requests for the very slow main project index anyway; so I just changed my SkippingRepository to do this:

    async def get_project_list(
        self,
        *,
        request_context: model.RequestContext = model.RequestContext.DEFAULT,
    ) -> model.ProjectList:
        now = datetime.now()
        if not (self.project_list_expires and self.project_list_expires > now):
            self.project_list_task = asyncio.create_task(self.source.get_project_list(request_context=request_context))
            self.project_list_expires = now + timedelta(seconds = self.ttl_seconds)
        return await self.project_list_task

... and that gets rid of the thundering herd.

1 reply

pelson Jun 20, 2024
Maintainer

Glad you found a solution here. Out of purity (from running a server perspective) we avoided the statefulness of the queue, but in practice it is indeed necessary when you reach a certain scale. On our side, we don't have any substantial use of the project_list endpoint, so it hasn't been much of a concern for us so far. There would be no technical reason that this couldn't be done belt-and-braces with something like redis to support a multi-process/host setup.

In a single process, having a FuturedProjectList type component would probably be sufficient to avoid the duplicate requests. Something like (untested):

class FuturedProjectList(RepositoryContainer):
    def __init__(self, source: SimpleRepository) -> None:
        super().__init__(source)
        self._project_list_future = None

    async def get_project_list(
        self,
        *,
        request_context: model.RequestContext = model.RequestContext.DEFAULT,
    ) -> model.ProjectList:
        if self._project_list_future is None:
            self._project_list_future = asyncio.ensure_future(
                self.source.get_project_list(request_context=request_context)
            )
        result = await self._project_list_future
        self._project_list_future = None
        return result

Should result in there only ever being one request for the project_list in flight at any one time. It would be the responsibility of other components to cache this, if we want to avoid other requests coming in after the thing has been computed.

I'd be interested to hear from you whether this approach would work for your case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple-repository

queue outgoing requests? #4

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

simple-repository

queue outgoing requests? #4

bewinsnw May 29, 2024

Replies: 1 comment · 1 reply

bewinsnw May 30, 2024 Author

pelson Jun 20, 2024 Maintainer

bewinsnw
May 29, 2024

Replies: 1 comment 1 reply

bewinsnw
May 30, 2024
Author

pelson Jun 20, 2024
Maintainer