Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melt with measure.vars=list of length=1 should return integer in variable column #5209

Closed
tdhock opened this issue Oct 11, 2021 · 6 comments · Fixed by #5247
Closed

melt with measure.vars=list of length=1 should return integer in variable column #5209

tdhock opened this issue Oct 11, 2021 · 6 comments · Fixed by #5247
Assignees
Labels

Comments

@tdhock
Copy link
Member

tdhock commented Oct 11, 2021

melt() docs say that variable column should be an integer when measure.vars is a list,

variable.name: name (default ''variable'') of output column containing
          information about which input column(s) were melted. 
...
          If 'measure.vars' is a list of integer/character
          vectors, then each entry of this column contains an integer
          indicating an index/position in each of those vectors. 

In most cases that is true, for example

> library(data.table)
> (iris.row <- data.table(iris[1,]))
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
> melt(iris.row, measure.vars = list(Length=c("Petal.Length","Sepal.Length"), Width=c("Petal.Width","Sepal.Width")))
   Species variable Length Width
1:  setosa        1    1.4   0.2
2:  setosa        2    5.1   3.5
> melt(iris.row, measure.vars = list(Length="Petal.Length",Width="Petal.Width"))
   Sepal.Length Sepal.Width Species variable Length Width
1:          5.1         3.5  setosa        1    1.4   0.2

However if measure.vars is a list of length=1 then we get variable= character column name instead,

> melt(iris.row, measure.vars = list(Length="Petal.Length"))
   Sepal.Length Sepal.Width Petal.Width Species     variable Length
1:          5.1         3.5         0.2  setosa Petal.Length    1.4

This is not a big deal (probably not a lot of users which melt only one column), but it is inconsistent with the documentation, so I will work on a fix.

@SamuelAllain
Copy link

+1 I have been surprised by this behaviour too

@r2evans
Copy link
Contributor

r2evans commented Dec 4, 2024

I've seen the docs for that but had not recognized the real meaning of it until looking at your example measure.vars = list(Length="Petal.Length"). Is there an example of where changing the variable's values to integer indices is meaningful? I've internally assumed that if I needed an index instead of a string, I'd set it with match or factor/levels.

@tdhock
Copy link
Member Author

tdhock commented Dec 4, 2024

Is there an example of where changing the variable's values to integer indices is meaningful?

not sure how to interpret "meaningful" in this context, could you please clarify?

my goal was to increase consistency.

@r2evans
Copy link
Contributor

r2evans commented Dec 4, 2024

Sure, I apologize for lack of clarity. What is the justification for having this seemingly inconsistent difference? ("Inconsistent" between 1 and 2+ args, not commenting on documentation-vs-execution, that's different.)

To me, having (starkly) different behavior between 1 arg and 2+ args is counter-intuitive and requires caller's use of data.table::melt to guard against it with extra code. Granted, that code is not cosmic.

To me it seems more intuitive and consistent to always behave the same, whether always-strings or always-integers or always-factors, with arguments (e.g., variable.factor=) that support differing paths.

I'm late to the discussion, in a sense, and I'm not arguing change for change's sake, so I'm "curious" in case you know the history of why this explicit behavior was chosen (or did it just fall into place this way).

@tdhock
Copy link
Member Author

tdhock commented Dec 4, 2024

ok, so I understand that you are curious about the inconsistency in the output type, between two uses.
First usage below is when measure.vars is a character vector, in which output variable is character (name of melted column)

> melt(iris.row, measure.vars = c("Petal.Length","Sepal.Length"))
   Sepal.Width Petal.Width Species     variable value
         <num>       <num>  <fctr>       <fctr> <num>
1:         3.5         0.2  setosa Petal.Length   1.4
2:         3.5         0.2  setosa Sepal.Length   5.1

Second usage below is when measure.vars is a list, in which output variable is "integer indicating an index/position in each of those vectors." (quote from ?melt)

> melt(iris.row, measure.vars = list(Length=c("Petal.Length","Sepal.Length"),Width=c("Petal.Width","Sepal.Width")))
   Species variable Length Width
    <fctr>   <fctr>  <num> <num>
1:  setosa        1    1.4   0.2
2:  setosa        2    5.1   3.5

you would have to ask @arunsrinivasan about the history about why he made this choice, but I think it makes sense. (integer index is necessary to identify which column was melted)

Rather than using measure.vars=list(...) (above), more recently I implemented support for measure.vars=measure(...) (see below) which is almost always preferable, because we get more informative "variable" columns (part below).

> melt(iris.row, measure.vars = measure(part, value.name, sep="."))
   Species   part Length Width
    <fctr> <char>  <num> <num>
1:  setosa  Sepal    5.1   3.5
2:  setosa  Petal    1.4   0.2

@r2evans
Copy link
Contributor

r2evans commented Dec 4, 2024

I understand that that's what the docs say, and I see how measure(.) adds a lot of value here. My question is why it's useful to have different behaviors? From a "simple/consistent" approach mindset, I don't think it's wrong to hope that

melt(iris.row, measure.vars = list(Length=c("Petal.Length","Sepal.Length"))) # ignoring the warning
melt(iris.row, measure.vars = list(Length=c("Petal.Length","Sepal.Length"), Width=c("Petal.Width","Sepal.Width")))

would both produce a column named variable that contains a factor with levels derived from the column names. In my use-cases, I can't think of a time when I don't care which column name is paired with each row; I recognize my use-cases are just mine.

I think the point of my question is the history of it (as you suggested, from arunsrinivasan) to know if it "just happened this way" or if there is a specific mindset/use-case where having diverging behavior makes sense. (Backwards compatibility is always valuable, of course.)

Thanks tdhock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants