How can I append duplicated groups of a dataset with changes to existing data in R efficiently?

Question

I have data like this:

key_data <- data.frame(orig_letter=c("A","A","A","A","C","C","F","B","B","B"), new_letter=c("Z","R","P","S","H","K","W","V","L","X"))
                        
orig_data <- data.frame(rownum=c(1:6000000),colA=rep(c("A","B","C","D","E","F"),1000000), colB=paste0(rep(c("A","B","C","D","E","F"),1000000),": ","moretext"), valC=sample(c(1:1000),6000000,replace = T))

I want to use key_data as a reference lookup and replace locations of the orig_letter in BOTH colA and colB of orig_data creating a dataset to be appended to orig_data.

I can do this in a loop as below but it is pretty inefficient across big data like I have (in reality my data are more than the sample).

Is there a better/more efficient way to do this in data.table/dplyr/tidyverse or another way other than an inefficient loop?

Thanks.

Example desired data created by slow loop:

library(data.table)
library(foreach)

replace_cols <- c("colA", "colB")
looplist <- list()
foreach(irow=1:nrow(key_data))%do%{
  append_data <- orig_data[orig_data$colA==key_data$orig_letter[irow],] 
  append_data[replace_cols] <- lapply(append_data[replace_cols], gsub, pattern = key_data$orig_letter[irow], replacement = key_data$new_letter[irow], fixed=T)
  looplist[[length(looplist)+1]]<-append_data
}

desired_data <- rbindlist(looplist)

Your orig_data <- ... code references itself to define how many rows to use, so it doesn't run as is (without pre-existing data.frame). — Jon Spring
– Jon Spring, Commented Jul 26, 2024 at 0:20
Do you need to use gsub here? It seems that sub would also work for this situation, and should be quicker. — Edward
– Edward, Commented Jul 26, 2024 at 8:48

thothal · Accepted Answer · 2024-07-26 13:35:07Z

I am not sure whether I fully understand your desired result, but here are my assumptions based upon which I propose a solution:

colA and colB start with the same key (as in your example). (if this is not the case be aware that we will have a lot of more rows: a row starting with C would then create 4 rows for eah occurence [H | H: moretext], [H | K: moretext], [K | H: moretext], [K | K: moretext]), A would create 16 rows and so on).
For each key there are 0 to n possible replacement keys (in your case D and E have no replacements, A has 4, B has 3, C has 2 and F has one.
If there is no replacement defined, let the column be unaltered. That is in your example we would expect [4 (A) + 3 (B) + 2 (C) + 1 (D) + 1 (E) + 1 (F)] * 100000 = 12000000 rows in the final result (yours show only 10000000 records, b/c columns starting with D or E are silently dropped (if this is indeed desired, simply add nomatch = NULL)

Under this assumptions you can use the following data.table solution:

library(data.table)
set.seed(20240726)
key_data <- data.frame(orig_letter = c("A", "A", "A", "A", "C", "C", "F", 
                                       "B", "B", "B"), 
                       new_letter = c("Z", "R", "P", "S", "H", "K", "W", 
                                      "V", "L", "X"))

orig_data <- data.frame(rownum = c(1:6000000), 
                        colA = rep(c("A", "B", "C", "D", "E", "F"), 1000000), 
                        colB = paste0(rep(c("A", "B", "C", "D", "E", "F"), 
                                          1000000), ": ", "moretext"), 
                        valC = sample(c(1:1000), 6000000, replace = TRUE))

setDT(orig_data)
setDT(key_data, key = "orig_letter")

start <- Sys.time()
key_data[
  orig_data,
  on = c(orig_letter = "colA"), allow.cartesian = TRUE
  #, nomatch = NULL ## add this if unmatched lookups should be discarded
  ][,
    .(rownum, 
      colA = fcoalesce(new_letter, orig_letter),
      colB = `substr<-`(colB, 1L, 1L, fcoalesce(new_letter, orig_letter)), 
      valC)]

#           rownum colA        colB valC
#        1:       1    Z Z: moretext  282
#        2:       1    R R: moretext  282
#        3:       1    P P: moretext  282
#        4:       1    S S: moretext  282
#        5:       2    V V: moretext  401
#       ---                              
# 11999996: 5999997    H H: moretext  529
# 11999997: 5999997    K K: moretext  529
# 11999998: 5999998    D D: moretext   34
# 11999999: 5999999    E E: moretext   83
# 12000000: 6000000    W W: moretext  644

cat("\nFinalized in: ", as.character(hms::as_hms(round(Sys.time() - start, 0))))
# Finalized in:  00:00:04

Collectives™ on Stack Overflow

How can I append duplicated groups of a dataset with changes to existing data in R efficiently?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related