I have data like this:
key_data <- data.frame(orig_letter=c("A","A","A","A","C","C","F","B","B","B"), new_letter=c("Z","R","P","S","H","K","W","V","L","X"))
orig_data <- data.frame(rownum=c(1:6000000),colA=rep(c("A","B","C","D","E","F"),1000000), colB=paste0(rep(c("A","B","C","D","E","F"),1000000),": ","moretext"), valC=sample(c(1:1000),6000000,replace = T))
I want to use key_data as a reference lookup and replace locations of the orig_letter in BOTH colA and colB of orig_data creating a dataset to be appended to orig_data.
I can do this in a loop as below but it is pretty inefficient across big data like I have (in reality my data are more than the sample).
Is there a better/more efficient way to do this in data.table/dplyr/tidyverse or another way other than an inefficient loop?
Thanks.
Example desired data created by slow loop:
library(data.table)
library(foreach)
replace_cols <- c("colA", "colB")
looplist <- list()
foreach(irow=1:nrow(key_data))%do%{
append_data <- orig_data[orig_data$colA==key_data$orig_letter[irow],]
append_data[replace_cols] <- lapply(append_data[replace_cols], gsub, pattern = key_data$orig_letter[irow], replacement = key_data$new_letter[irow], fixed=T)
looplist[[length(looplist)+1]]<-append_data
}
desired_data <- rbindlist(looplist)
orig_data <- ...code references itself to define how many rows to use, so it doesn't run as is (without pre-existing data.frame).gsubhere? It seems thatsubwould also work for this situation, and should be quicker.