Signup/Sign In
Ask Question
Not satisfied by the Answer? Still looking for a better solution?

Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

I have the following 2 data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])


I need to discover the row a1 has that a2 doesn't.

Is there a built function for this kind of operation?

(p.s: I composed an solution for it, I am simply curious if someone already made a more crafted code)

Here is my answer:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
by

3 Answers

espadacoder11
This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
sandhya6gczb
sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT
FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr)
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2
pankajshivnani123
t is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

Login / Signup to Answer the Question.