Estoy usando dcast para transponer la siguiente tabla

date               event          user_id
25-07-2020         Create          3455
25-07-2020         Visit           3567
25-07-2020         Visit           3567
25-07-2020         Add             3567
25-07-2020         Add             3678
25-07-2020         Add             3678
25-07-2020         Create          3567
24-07-2020         Edit            3871

Estoy usando dcast para transponer para tener mis eventos como columnas y contar user_id

dae_summ <- dcast(ahoy_events, date ~ event, value.var="user_id")

Pero no obtengo IDs de usuario únicos . está contando el mismo user_id varias veces. ¿Qué puedo hacer para que un ID de usuario se cuente solo una vez para la misma fecha y evento?

6
Abhi 25 jul. 2020 a las 22:05

4 respuestas

La mejor respuesta

Podríamos usar uniqueN de data.table

library(data.table)
dcast(setDT(ahoy_events), date ~ event, fun.aggregate = uniqueN)
#         date Add Create Edit Visit
#1: 24-07-2020   0      0    1     0
#2: 25-07-2020   2      2    0     1

O usando pivot_wider de tidyr con values_fn especificado como n_distinct

library(tidyr)
library(dplyr)
ahoy_events %>%
   pivot_wider(names_from = event, values_from = user_id, 
      values_fn = list(user_id = n_distinct), values_fill = list(user_id = 0))
# A tibble: 2 x 5
#   date       Create Visit   Add  Edit
#  <chr>       <int> <int> <int> <int>
#1 25-07-2020      2     1     2     0
#2 24-07-2020      0     0     0     1

Datos

ahoy_events <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
"25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
"Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
-8L))
3
akrun 25 jul. 2020 a las 19:35

Usando el paquete reshape2, puede utilizar lo siguiente:

library(reshape2)

Datos:

zz <- "date               event          user_id
       25-07-2020         Create          3455
       25-07-2020         Visit           3567
       25-07-2020         Visit           3567
       25-07-2020         Add             3567
       25-07-2020         Add             3678
       25-07-2020         Add             3678
       25-07-2020         Create          3567
       24-07-2020         Edit            3871"
data <- read.table(text=zz, header = TRUE)

Código:

data %>% 
  dcast(user_id ~ event, value.var="user_id",fun.aggregate = function(x) length(unique(x)))

Salida:

date         Add     Create      Edit      Visit
<fctr>       <int>   <int>       <int>     <int>
24-07-2020   0       0           1         0
25-07-2020   2       2           0         1

Creado el 25-07-2020 por el paquete reprex (v0.3.0)

1
Eric Fletcher 25 jul. 2020 a las 19:41

Una opción base R usando reshape

out <- replace(
  u <- reshape(
    unique(transform(ahoy_events, user_id = ave(user_id, event, date, FUN = function(x) length(unique(x))))),
    direction = "wide",
    idvar = "date",
    timevar = "event"
  ),
  is.na(u),
  0
)

Tal que

> out
        date user_id.Create user_id.Visit user_id.Add user_id.Edit
1 25-07-2020              2             1           2            0
8 24-07-2020              0             0           0            1

datos

  "25-07-2020", "25-07-2020", "25-07-2020",
  "25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c(
  "Create", "Visit", "Visit", "Add", "Add", "Add",
  "Create", "Edit"
), user_id = c(
  3455L, 3567L, 3567L, 3567L, 3678L,
  3678L, 3567L, 3871L
)), class = "data.frame", row.names = c(
  NA,
  -8L
))
2
ThomasIsCoding 25 jul. 2020 a las 21:51

Puedes probar:

library(reshape2)

#Data
df <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
"25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
"Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
-8L))

#New code
dae_summ <- dcast(df, date ~ event,  value.var="user_id",fun.aggregate = function(x) length(unique(x)))

        date Add Create Edit Visit
1 24-07-2020   0      0    1     0
2 25-07-2020   2      2    0     1

Su código produce esto:

        date Add Create Edit Visit
1 24-07-2020   0      0    1     0
2 25-07-2020   3      2    0     2

Así que hay una diferencia.

2
Duck 25 jul. 2020 a las 19:14