Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the future package to aid in determining symbols and packages to export #2

Open
richcalaway opened this issue Aug 1, 2019 · 8 comments

Comments

@richcalaway
Copy link
Contributor

Background

Original implementation

In April 2018, in an unreleased version 1.4.6 of foreach, Rich Calaway introduced a new feature that used Henrik Bengtsson's future package (if available) to determine the appropriate symbols and packages to export to the foreach workers. In the initial implementation, future was used if it was installed and its namespace could be loaded via requireNamespace. Henrik responded with the following comments:

COMMENT #1: Surprised users

When future::getGlobalsAndPackages() will be used or not might
surprise users. Your proposal to use it when

requireNamespace("future", quietly=TRUE) == TRUE

risks confusing the end users and whoever helps troubleshooting issues
related to globals. They might get one result one day, and another
result the next day, just because the 'future' package happened to be
installed in between. The same if they run on different systems. If
someone runs into issues related to globals, the troubleshooting
counter questions have to involve: "Do you have the 'future' package
installed?" (unless they share a proper sessionInfo() that is).

One slightly less confusing condition would be to condition it on:

("future" %in% loadedNamespaces())

I've seen that style used in some packages. The user has to enable
the "feature" by making sure a package is loaded. However, that is
still not transparent to the end user since the 'future' package might
be loaded as a side effect by some other package. For instance, it
may be loaded when using registerDoSeq(), but then not when they retry
with registerDoParallel().

It might be better to let the user explicitly control this via an R option, e.g.

useFuture <- getOption("foreach.globalsAs", default = "future")

or default = "foreach". A middle ground, during your migration, is to
use something like:

useFuture <- getOption("foreach.globalsAs", default = NULL)
if (is.null(useFuture)) {
useFuture <- requireNamespace("future", quietly=TRUE)
if (useFuture) {
warning('foreach() will identify globals and packages using the
future package because that package is installed. To suppress this
warning set options(foreach.globalsAs = "future"). To use the
traditional approach of foreach for identifying globals, use
options(foreach.globalsAs = "foreach").')
} else {
warning('foreach() will identify globals and packages using the
foreach package because the alternative based on the future package
requires that future is installed. To suppress this warning set
options(foreach.globalsAs = "foreach"). To use future for identifying
globals, install that package.')
}
}

COMMENT #2: Maximum total size of globals

future::getGlobalsAndPackages() has a built-in protection for
exporting too large amounts of globals. It's currently set to 500
MiB, and currently only controlled via option
'future.globals.maxSize'. To minimize the surprise here, you might
wanna temporarily set this to +Inf to disable this check/assertion,
e.g.

gp <- local({
oopts <- options(future.globals.maxSize = +Inf)
on.exit(options(oopts))
future::getGlobalsAndPackages(ex, envir = env)
})

FYI, in the next release of the future package, you'll be able to do
future::getGlobalsAndPackages(..., maxSize = +Inf).

COMMENT #3: future+globals vs foreach (<= 1.4.5)

I've identified one use case where some people might say that the
'globals' package is too conservative. The case is when a variable is
global or local conditionally on some other variable/state, e.g.

{
if (runif(1) < 1/2) y <- 0
y
}

In future+globals, the above will NOT pick up 'y' as a global variable
(*), whereas in foreach (<= 1.4.5), it is identified as a global
variable (because you are using a more liberal approach).
Unfortunately, I don't this there is an easy solution to handle the
above ambiguous case, where a variable is global or local depending on
the run-time state, without adding significant processing overhead.
OTH, I think this is an oversight by the developer (or at least a
deliberate hack) and I don't mind "forcing" code to break in these
ambiguous cases since it's quite easy to make it non-ambiguous. So
far I've only identified one case of this in the wild:

example("avNNet", package = "caret", run.dontrun = TRUE)

so I think you shouldn't expect much reports on that.

(*) The main reason for future+globals failing to find 'y' here is
because it is a side-effect of its "ordered" code inspection for the
purpose of identifying 'x' as a global in { x <- x + 1 }. I'm
tracking this over at futureverse/globals#31

First revised implementation

Rich revised the implementation in foreach 1.5.0/1.5.1, adding a check for a global option foreachGlobals to see if the user preferred the original foreach functionality (which found some global symbols but did not search for additional package exports), and expanding the symbol search to include the union of those found by future's getGlobalsAndPackages function and those found by foreach's original method.

This revision addressed issue #3 and part of #1, but did not address the size of globals problem.

Henrik responded with the following additional comments (and Rich responded with >RBC inline comments):

  • DOCUMENTATION: A user who sets options(foreachGlobals = "foreach")
    may wonder how to undo that. From the code, it looks like one can do
    this by unsetting the option, i.e. options(foreachGlobals = NULL). It
    would help to document that too.

RBC --Agreed

  • DOCUMENTATION: It says "Beginning with foreach 0.5.0, foreach will
    use the future package, if available, to automatically detect
    needed packages." Without seeing the code, the term "available" is a
    bit ambiguous to the user. Maybe write "..., if it is installed and
    can be loaded, ..." instead. And clarify further by "If the future
    package is not installed, then the identification of globals will be
    done as if options(foreachGlobals = "foreach") was set."

RBC I'll work on this.

  • SURPRISE FACTOR: The silent, conditional requireNamespace("future")
    behavior may still be confusing, surprising, ... and makes it hard to
    troubleshoot. It's likely that there'll be comments like "weird, it
    works for me" kind of discussions.

RBC -- Agreed, but that's part of the point--I want to get the benefit of future as seamlessly as possible.
RBC If users need to set an option, most users won't.

  • CLEANUP: To avoid loading 'future' when not needed, you might wanna
    swap the order to if (!identical(getOption("foreachGlobals"),
    "foreach") && requireNamespace("future", quietly=TRUE)){){ ... }

RBC: Good idea!

  • PREDICTABILITY: You could introduce options(foreachGlobals =
    "future+foreach"), which if set, will produce an error if the 'future'
    package is not installed. OTH, not sure what the current default
    should be named - maybe "future+foreach-or-foreach"?
  • doFuture: If you could support/standardize on options(foreachGlobals
    = "future+foreach") and possible also options(foreachGlobals =
    "future"), then I could rely on that in doFuture rather than adding
    another option.

RBC: Yes, I think I can do this.

  • ROBUSTNESS: I've also played around with the idea of something like
    options(foreachGlobals = "manual") which will disable the automatic
    identification of globals and rely solely on arguments .export and
    '.packages`.

RBC: Another good idea.

Henrik responded to Rich's responses with the following additional comments:

I still argue its a bad idea that the behavior is conditioned on
whether you have a package installed or not. What is even worse is
that the dependent package (here 'future') may be installed at a
random time because it is installed together with some other package.
All of a sudden the behavior changes and the user has no idea why
because they did not change anything.

BTW, do you have a better place than an email thread where this can be
discussed and be properly documented? There's a great risk that
things are getting lost in nested email threads. I would love if you
would bring foreach to GitHub where the issue tracker provides an
excellent communication channel. I believe there's valueable feedback
from other developers that would reach you if you'd be on GitHub.
Also, I have two suggestions that I think would improve foreach a lot-
but I'll wait with those for now.

The second of these final comments is addressed with this issue; foreach has indeed been brought to GitHub and this issue is the first to be set into the issue tracker.

Current Development

Rich has handed off maintenance of foreach to Hong Ooi, but is leaving one final pass at the future integration as a branch richcala/foreachFutureTake3. This addresses most of the remaining issues--including the surprise factor. In this implementation, if the user is not using Microsoft R Open, the future option is added only if one of the foreachGlobals options "future+foreach", "foreach+future", or "future" is explicitly set. All of those act the same, however--the union of future and foreach previously implemented.

@hongooi73
Copy link
Contributor

@HenrikBengtsson do you think there's still a need for different "master" and "richcala/foreachFutureTake3" branches? Just trying to rationalise the structure.

@HenrikBengtsson
Copy link

Yes, I think for a bit longer. The long story...

My suggestions:

  1. For now, don't drop branches unless you know for sure you don't want them. They add to the clutter but otherwise just pointers in git. If you buy into the below, you can just rename/reorganize into that model. That will clarify things.

  2. Look into the https://nvie.com/posts/a-successful-git-branching-model/. The gist is that master always reflects you latest release (=what's on CRAN). You never commit to master - only merge into it. Continuous development and bug fixes goes in to the develop branch. Major updates like the recent 'local' and the 'globals-by-future' subprojects can be developed in 'feature/local' branches. When these are "closed" you merge them back into the develop branch (="to become the next release").

  3. There's a shell tool git flow (https://github.com/petervanderdoes/gitflow-avh) that automates all of the above branching model.

I've used above branching model and tools for several years now and I can't imagine doing something else. It really cleared things up for me when I found it.

@HenrikBengtsson
Copy link

Also, lately I've made develop my default branch on GitHub so that people can see what's new and when they fork and create PRs that's the branch they work on.

@hongooi73
Copy link
Contributor

hongooi73 commented Feb 19, 2020

Yeah, it's just a bit complicated here because master isn't really a master; it's another staging branch that contains some of the work done to interop with future. The rest of the work is in richcala/foreachFutureTake3.

I think the reason for this structure is because master corresponds to the version of foreach loaded on the Azure DSVM image, but that's actually pretty old now.

@hongooi73
Copy link
Contributor

I've now done something like what you suggested and made the new "staging" branch the default.

@HenrikBengtsson
Copy link

Another thing you could do is git tag 1.4.8 and git push --tags to tag commits corresponding to the different CRAN releases. (That way you can also install_github("org/repos@version"))

@hongooi73
Copy link
Contributor

I've renamed master to mro/1.5.2 to reflect what it actually is. I'm not too worried about tracking releases, since they're always available from CRAN itself.

@HenrikBengtsson
Copy link

You want to have a master branch, which I suggest should be the latest release (see above). Some tools expect a master branch, e.g. remotes:: install_github().

Not critical, but tagging CRAN releases help devs/users when troubleshooting among other things, e.g. makes it easy to git diff what changed etc. The git flow tool does this automatically.

Tags can be kept for ever. Most branches tend to come and go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants