1

I have data like this:

key_data <- data.frame(orig_letter=c("A","A","A","A","C","C","F","B","B","B"), new_letter=c("Z","R","P","S","H","K","W","V","L","X"))
                        
orig_data <- data.frame(rownum=c(1:6000000),colA=rep(c("A","B","C","D","E","F"),1000000), colB=paste0(rep(c("A","B","C","D","E","F"),1000000),": ","moretext"), valC=sample(c(1:1000),6000000,replace = T))

I want to use key_data as a reference lookup and replace locations of the orig_letter in BOTH colA and colB of orig_data creating a dataset to be appended to orig_data.

I can do this in a loop as below but it is pretty inefficient across big data like I have (in reality my data are more than the sample).

Is there a better/more efficient way to do this in data.table/dplyr/tidyverse or another way other than an inefficient loop?

Thanks.

Example desired data created by slow loop:

library(data.table)
library(foreach)

replace_cols <- c("colA", "colB")
looplist <- list()
foreach(irow=1:nrow(key_data))%do%{
  append_data <- orig_data[orig_data$colA==key_data$orig_letter[irow],] 
  append_data[replace_cols] <- lapply(append_data[replace_cols], gsub, pattern = key_data$orig_letter[irow], replacement = key_data$new_letter[irow], fixed=T)
  looplist[[length(looplist)+1]]<-append_data
}

desired_data <- rbindlist(looplist)
3
  • 1
    Your orig_data <- ... code references itself to define how many rows to use, so it doesn't run as is (without pre-existing data.frame). Commented Jul 26, 2024 at 0:20
  • Oops my b @JonSpring fixed now Commented Jul 26, 2024 at 0:29
  • Do you need to use gsub here? It seems that sub would also work for this situation, and should be quicker. Commented Jul 26, 2024 at 8:48

1 Answer 1

0

I am not sure whether I fully understand your desired result, but here are my assumptions based upon which I propose a solution:

  1. colA and colB start with the same key (as in your example). (if this is not the case be aware that we will have a lot of more rows: a row starting with C would then create 4 rows for eah occurence [H | H: moretext], [H | K: moretext], [K | H: moretext], [K | K: moretext]), A would create 16 rows and so on).
  2. For each key there are 0 to n possible replacement keys (in your case D and E have no replacements, A has 4, B has 3, C has 2 and F has one.
  3. If there is no replacement defined, let the column be unaltered. That is in your example we would expect [4 (A) + 3 (B) + 2 (C) + 1 (D) + 1 (E) + 1 (F)] * 100000 = 12000000 rows in the final result (yours show only 10000000 records, b/c columns starting with D or E are silently dropped (if this is indeed desired, simply add nomatch = NULL)

Under this assumptions you can use the following data.table solution:

library(data.table)
set.seed(20240726)
key_data <- data.frame(orig_letter = c("A", "A", "A", "A", "C", "C", "F", 
                                       "B", "B", "B"), 
                       new_letter = c("Z", "R", "P", "S", "H", "K", "W", 
                                      "V", "L", "X"))

orig_data <- data.frame(rownum = c(1:6000000), 
                        colA = rep(c("A", "B", "C", "D", "E", "F"), 1000000), 
                        colB = paste0(rep(c("A", "B", "C", "D", "E", "F"), 
                                          1000000), ": ", "moretext"), 
                        valC = sample(c(1:1000), 6000000, replace = TRUE))

setDT(orig_data)
setDT(key_data, key = "orig_letter")

start <- Sys.time()
key_data[
  orig_data,
  on = c(orig_letter = "colA"), allow.cartesian = TRUE
  #, nomatch = NULL ## add this if unmatched lookups should be discarded
  ][,
    .(rownum, 
      colA = fcoalesce(new_letter, orig_letter),
      colB = `substr<-`(colB, 1L, 1L, fcoalesce(new_letter, orig_letter)), 
      valC)]

#           rownum colA        colB valC
#        1:       1    Z Z: moretext  282
#        2:       1    R R: moretext  282
#        3:       1    P P: moretext  282
#        4:       1    S S: moretext  282
#        5:       2    V V: moretext  401
#       ---                              
# 11999996: 5999997    H H: moretext  529
# 11999997: 5999997    K K: moretext  529
# 11999998: 5999998    D D: moretext   34
# 11999999: 5999999    E E: moretext   83
# 12000000: 6000000    W W: moretext  644

cat("\nFinalized in: ", as.character(hms::as_hms(round(Sys.time() - start, 0))))
# Finalized in:  00:00:04
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.