Signup/Sign In
Ask Question
Not satisfied by the Answer? Still looking for a better solution?

Fastest way to replace NAs in a large data.table

I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible.

I see two options:
1: Convert to a data.frame, and use something like this
2: Some kind of cool data.table sub setting command

I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.
by

2 Answers

vishaljlf39
Here's the simplest one I could come up with:

dt[is.na(dt)] <- 0

It's efficient and no need to write functions and other glue code.
sandhya6gczb
Dedicated functions (nafill and setnafill) for that purpose are available in data.table package (version >= 1.12.4):

It process columns in parallel so well address previously posted benchmarks, below its timings vs fastest approach till now, and also scaled up, using 40 cores machine.

library(data.table)
create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
v <- runif(nrow ncol)
v[sample(seq_len(nrow*ncol), propNA * nrow
ncol)] <- NA
data.table(matrix(v, ncol=ncol))
}
f_dowle3 = function(DT) {
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
#[1] 200000 200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
# user system elapsed
# 0.193 0.062 0.254
system.time(setnafill(dt2, fill=0))
# user system elapsed
# 0.633 0.000 0.020 ## setDTthreads(1) elapsed: 0.149
all.equal(dt1, dt2)
#[1] TRUE

set.seed(1)
dt1 = create_dt(2e7, 200, 0.1)
dim(dt1)
#[1] 20000000 200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
# user system elapsed
# 22.997 18.179 41.496
system.time(setnafill(dt2, fill=0))
# user system elapsed
# 39.604 36.805 3.798
all.equal(dt1, dt2)
#[1] TRUE

Login / Signup to Answer the Question.