Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All metalearner coefficients are zero, predictions will all be equal to 0 #155

Open
DavidMarguerit opened this issue Jan 14, 2025 · 3 comments

Comments

@DavidMarguerit
Copy link

I am using SuperLearner to predict an outcome from a Random Forest algorithm. However, the Random Forest predicts only 0, and I don't understand how to fix the issue.

Here is a reproducible example:

#Training dataset
y <- c(-4.605170, 9.181019, -4.605170, -4.605170, 5.998099, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, 8.788880, -4.605170, 7.259213, -4.605170, -4.605170, -4.605170, -4.605170, 8.851838, 8.182144, -4.605170, -4.605170, -4.605170, 8.824345, -4.605170, -4.605170, 8.824345, -4.605170, -4.605170, -4.605170, 9.195547, 8.214720, 8.374350, 6.971533)

weightML <- c(14.95239, 18.55120, 18.55120, 19.70231, 14.95239, 14.95239, 18.55120, 14.95239, 18.55120, 18.55120, 18.55120, 14.95239, 18.55120, 15.73830, 18.55120, 18.55120, 19.70231, 15.73830, 14.95239, 15.73830, 14.95239, 14.95239, 15.73830, 18.55120, 18.55120, 14.95239, 14.95239, 14.95239, 14.95239, 15.73830, 14.95239, 14.95239, 14.95239, 14.95239, 18.55120, 19.70231, 14.95239)

train_x<-data.frame(matrix(,nrow=length(y),ncol=0))
train_x$x1 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x2 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x3 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x4 <- sample(100, size = nrow(df), replace = TRUE)

#Test dataset
test_x<-data.frame(matrix(,nrow=length(y),ncol=0))
test_x$x1 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x2 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x3 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x4 <- sample(100, size = nrow(df), replace = TRUE)

# RF
rf <- SuperLearner(Y = y, X = train_x, family = gaussian(), SL.library = "SL.ranger", obsWeights = weightML)
predict(rf, test_x, onlySL = TRUE)$pred

This code returns the following output:

> rf <- SuperLearner(Y = y, X = train_x, family = gaussian(), SL.library = "SL.ranger", obsWeights = weightML)
Warning messages:
1: All algorithms have zero weight 
2: All metalearner coefficients are zero, predictions will all be equal to 0 
> predict(rf, test_x, onlySL = TRUE)$pred
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Warning message:
All metalearner coefficients are zero, predictions will all be equal to 0 

Any idea why it predicts only 0 and how I can fix the issue?

I noticed that if I change y by by y+2 the prediction works. For instance, like this:

#Training dataset
y <- c(-4.605170, 9.181019, -4.605170, -4.605170, 5.998099, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, -4.605170, 8.788880, -4.605170, 7.259213, -4.605170, -4.605170, -4.605170, -4.605170, 8.851838, 8.182144, -4.605170, -4.605170, -4.605170, 8.824345, -4.605170, -4.605170, 8.824345, -4.605170, -4.605170, -4.605170, 9.195547, 8.214720, 8.374350, 6.971533)
y <- y+2

weightML <- c(14.95239, 18.55120, 18.55120, 19.70231, 14.95239, 14.95239, 18.55120, 14.95239, 18.55120, 18.55120, 18.55120, 14.95239, 18.55120, 15.73830, 18.55120, 18.55120, 19.70231, 15.73830, 14.95239, 15.73830, 14.95239, 14.95239, 15.73830, 18.55120, 18.55120, 14.95239, 14.95239, 14.95239, 14.95239, 15.73830, 14.95239, 14.95239, 14.95239, 14.95239, 18.55120, 19.70231, 14.95239)

train_x<-data.frame(matrix(,nrow=length(y),ncol=0))
train_x$x1 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x2 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x3 <- sample(100, size = nrow(df), replace = TRUE)
train_x$x4 <- sample(100, size = nrow(df), replace = TRUE)

#Test dataset
test_x<-data.frame(matrix(,nrow=length(y),ncol=0))
test_x$x1 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x2 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x3 <- sample(100, size = nrow(df), replace = TRUE)
test_x$x4 <- sample(100, size = nrow(df), replace = TRUE)

# RF
rf <- SuperLearner(Y = y, X = train_x, family = gaussian(), SL.library = "SL.ranger", obsWeights = weightML)
predict(rf, test_x, onlySL = TRUE)$pred
@ecpolley
Copy link
Owner

To make it a reproducible example, should set the random seed (and need to define df)

library(SuperLearner)
set.seed(42)
df <- data.frame(y, weightML)

But this result isn't unexpected. There is no information in the X variables, and so predicting everyone to have Y=0 is better than using the ranger predictions which are not informative. Since the (weighted) mean of Y is close to 0 even adding SL.mean to the candidate library is unlikely to help much, but this is why you see shifting the mean value of Y gives some weight to the ranger predictions (but if you add SL.mean to the candidate library, it will get weight 1, again because the X variables are not informative here).

@DavidMarguerit
Copy link
Author

Thank you for your answer. It helps me to understand what is happening better.

You are correct that in my example, x1, x2, x3, and x4 are uninformative since they are random. However, in my data, they are informative. Y measures hourly wages (in log), and x1, x2, x3, and x4 are confounders for age, working experience, household composition, and education, respectively. I am sure these confounders matter for wages. However, I get the same warning message and output as the one in my example.

@ecpolley
Copy link
Owner

What are you using for the library of candidate algorithms? You may want to try expanding the candidates and could look at the CV risk estimates relative to SL.mean to confirm your assumption about informative variables/algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants