Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the argument type.convert of tstrsplit to accept a named list. #5094

Closed
Kamgang-B opened this issue Aug 4, 2021 · 3 comments · Fixed by #5099
Closed

Allow the argument type.convert of tstrsplit to accept a named list. #5094

Kamgang-B opened this issue Aug 4, 2021 · 3 comments · Fixed by #5099

Comments

@Kamgang-B
Copy link
Contributor

Kamgang-B commented Aug 4, 2021

This is not really an issue but a feature request.
Sometimes when using the function tstrsplit, setting its argument type.convert to FALSE or TRUE
does not return the desired output. We may want some columns in the result to be of a different type.
Here is an illustrative example:

options(datatable.print.class=TRUE)


dt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos"))

	                  x
                     <char>
1: 00531725   Male 2021 Neg
2: 07640613 Female 2020 Pos

Spltting the variable x:

cols <- c("personID", "gender", "year", "covidTest")

dt[, tstrsplit(x, split=" ", names=cols, type.convert=FALSE)]

   personID gender   year covidTest
     <char> <char> <char>    <char>
1: 00531725   Male   2021       Neg
2: 07640613 Female   2020       Pos

Setting type.convert to TRUE:

dt[, tstrsplit(x, split=" ", names=cols, type.convert=TRUE)]

   personID gender  year covidTest
      <int> <char> <int>    <char>
1:   531725   Male  2021       Neg
2:  7640613 Female  2020       Pos

All columns are of type character when type.convert=FALSE and when it is set to TRUE, the columns gender, and covidTest are set to character while the column personID is set to integer. In practice, it is quite likely that the desired type for gender and covidTest is factor while the desired type for personID is to keep it as character (especially because removing leading zeros lead to IDs that are likely not valid).

My suggestion is to allow the type.convert argument to accept a named list where the names are functions and the values (integer positions or column names) specify the columns to apply the functions on.
Using the same example, we would obtain something like:

dt[, tstrsplit(x, split=" ", names=cols, type.convert=list(as.character=1, as.factor=c(2, 4), as.integer=3))]

   personID gender   year covidTest
     <char> <fctr>  <int>    <fctr>
1: 00531725   Male   2021       Neg
2: 07640613 Female   2020       Pos

This idea is closely related to that of the built-in function strcapture, which allows to split a variable into several columns by extracting groups. But it is usually quite slow and typing the same thing several times (like factor below) makes it less attractive.
Further, the conversion function name must start with the prefix as. (like as.factor, etc.).

strcapture(pattern="(\\d+) ([a-zA-Z]+) (\\d+) ([a-zA-Z]+)", 
           x=x, 
           proto=list(personID=character(), gender=factor(), year=integer(), covidTest=factor()))

  personID gender year covidTest
1 00531725   Male 2021       Neg
2 07640613 Female 2020       Pos
@MichaelChirico
Copy link
Member

Sounds pretty reasonable to me!

Why don't you try filing a PR if you don't mind. Happy to review and help get unstuck if you're struggling.

@Kamgang-B
Copy link
Contributor Author

OK, thank you! I will fill a PR.

@Kamgang-B
Copy link
Contributor Author

@MichaelChirico I filled a PR. That was my very first PR. Could you check it out and let me know if I should modify/add/remove something? I followed the rules specified in the data.table Contributing guide.

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 13, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants