R – How to Replace String Values with dplyr

dplyrrstringr

I have a dataframe in R:

df <- data.frame(CRP = c("10", "2", "3", NA, "<4", ">5"))

I want to create a numeric column that replaces the strings into numeric values. Additionally, whenever it find a string starting with “<“ or “>”, it should look in the next letter, i.e. number, and impute the string with the median of the observations above or below that number. For example, “<4” should be replaced with median(c(2,3)). If there are no values below or above the threshold,NA the value.The remaining NAs should remain as it is currently now.

Desired output:

df = data.frame(c(10,2,3,NA,median(c(2,3)),median(10))

I have not managed to find a solution that avoids pre specifying functions a priori. I also want to avoid for loops. Ideally I would like to apply this into a data frame of millions of rows where multiple columns have that issue.

Best Answer

library(dplyr)
library(stringr)
library(purrr)

nums <- na.omit(as.numeric(df$CRP))

df |> 
  mutate(imputed = map2_dbl(CRP, str_detect(CRP, "<|>"), \(x, gt_lt) {
    if (gt_lt %in% T) {
      eval(str2expression(str_glue("median(nums[nums{x}])")))
    } else as.numeric(x)
  })
  )
#    CRP imputed
# 1   10    10.0
# 2    2     2.0
# 3    3     3.0
# 4 <NA>      NA
# 5   <4     2.5
# 6   >5    10.0

This basically uses CRP to construct and evaluate an expression like: median(nums[nums<4]), where nums is c(10, 2, 3). Depending on your data structure it might be cleaner to assign this anonymous function and define nums within it.