Posted 2019-12-14Ran hour read (About 6899 words)

[Taes' R #3] 분석을 위한 데이터셋의 형태 변환과 결합에 대하여

1. 지난 이야기

지난 글에서는 데이터가 표현될 수 있는 다양한 형태들에 대해 살펴보았다. 이 글에서는 그 중 하나였던 Wide/Long Form 전환의 필요성에 대해 다룰 예정이다. 원활한 이해를 위해 업무로 수행했던 분석 과제 중 하나를 매우 간소화해서 예제로 제시하고, 이를 함께 해결해보고자 한다.

2. 분석 과제 정의

이 글에서 우리는 VOD Contents를 고객에게 제공하는 Provider로서, 고객의 VOD 시청 로그를 이용하여 해당 고객의 다음 달 서비스 이탈 여부를 예측해보고자 한다. 각 고객에 대해 이탈 여부를 예측하고자 하므로, 모형의 학습에 이용할 최종 트레이닝 데이터 세트의 하나의 행에는 하나의 고객이 표현되어야 한다. 다시 말해, 우리의 분석 단위(Unit of Analysis)는 고객 개인이며, 하나의 행으로 표현된 각 고객의 여러 특징 정보들은 각각의 개별적인 컬럼으로 표현되도록 할 것이다. 즉, 우리가 모델링을 위해 이용할 데이터는 최종적으로 다음과 같은 형태가 되어야 한다.

USER_ID	$Y$	$X_1$	$X_2$	…	$X_m$
USER₁	YES	…	…	…	…
USER₂	NO	…	…	…	…
…	…	…	…	…	…
USER_n	YES	…	…	…	…

데이터의 각 행을 유니크하게 식별할 수 있는 Unique Key는 USER_ID라는 사실을 기억해두고 다음으로 넘어가보도록 하자.

3. 가상 데이터의 생성

이제 우리가 활용할 가상의 VOD 시청 기록 데이터를 생성해볼 차례이다. 순서대로 유저, 컨텐츠, 유저의 컨텐츠 시청 로그 순으로 정의할 것이며, tidyverse 라이브러리를 이용할 예정이다.

1 2	library(tidyverse) set.seed(19)

(1) 유저 데이터

슬프게도 우리는 당장 내일 망해도 이상할 게 없을 정도로 소규모의 VOD Contents Provider이다. 따라서 우리의 고객은 단 100명 뿐이고, 샘플 데이터의 간결함을 위해 우리가 고객에 대해 가지고 있는 정보는 고객 개개인을 유일하게 식별할 수 있는 user_id와 성별을 나타내는 gender, 그리고 해당 고객의 서비스 이탈 여부를 나타내는 target 뿐이라고 가정한다.

n_user = 100
target_v = readLines(url("https://www.dropbox.com/s/o7dbkdl7wwo0wh3/target.txt?raw=1"))
# target_v = sample(c("CHURN", "STAY"), n_user, replace=TRUE)

users = tibble(
    user_id = paste0("user_", str_pad(1:n_user, 2, "left", "0")),
    gender = factor(sample(c("F", "M"), n_user, replace=TRUE)),
    target = factor(target_v, levels=c("CHURN", "STAY"))
)

head(users)

A tibble: 6 × 3
user_id	gender	target
<chr>	<fct>	<fct>
user_01	F	CHURN
user_02	M	STAY
user_03	M	STAY
user_04	F	CHURN
user_05	F	STAY
user_06	M	STAY

(2) 컨텐츠 데이터

다음으로는 우리가 보유한 VOD 컨텐츠에 대해 정의해보자. 우리가 서비스하는 컨텐츠는 다음과 같은 특징들을 지니고 있다.

보유한 VOD 컨텐츠의 개수는 총 20개이다.
모든 VOD 컨텐츠는 다음 3개 채널 중 하나에 속한다.
- ch_a, ch_b, ch_c
모든 VOD 컨텐츠는 다음 4개 장르 중 하나에 속한다.
- drama, horror, comedy, romance
모든 VOD 컨텐츠는 0, 1000, 5000원 중 하나의 가격에 제공된다.
전체 VOD 마켓의 가격별 컨텐츠 비율은 대략적으로 다음과 같다.

가격	비율
`0원`	60%
`1000원`	30%
`5000원`	10%
합계	100%

n_content = 20
channels = paste0("ch_", letters[1:3])
genres = c("drama", "horror", "comedy", "romance")
prices = c(0, 1000, 5000)

contents = tibble(
    content_id = paste0("content_", 1:n_content),
    content_channel = channels[ceiling(runif(n_content, 0, length(channels)))],
    content_genre = genres[ceiling(runif(n_content, 0, length(genres)))],
    content_price = sample(prices, n_content, prob=c(0.6, 0.3, 0.1), replace=TRUE)
)

head(contents)

A tibble: 6 × 4
content_id	content_channel	content_genre	content_price
<chr>	<chr>	<chr>	<dbl>
content_1	ch_c	romance	0
content_2	ch_b	romance	0
content_3	ch_a	comedy	1000
content_4	ch_c	romance	0
content_5	ch_a	horror	1000
content_6	ch_c	comedy	0

우리가 방금 생성한 contents 데이터셋은 content_id 컬럼을 이용하여 각 행을 유일하게 식별할 수 있다는 사실을 쉽게 확인할 수 있다. 다시 말해, contents 데이터셋의 Unique Key는 content_id이다.

(3) 시청 로그 데이터

마지막으로, 우리의 소중한 고객들이 컨텐츠를 시청한 로그 데이터는 다음과 같은 특징들을 가지고 있다.

보유한 시청 로그는 총 3000개이며, 고객마다 시청한 VOD 컨텐츠 개수는 상이하다.
고객이 컨텐츠를 한 번 시청할 때마다, 각 고객별 시퀀스 번호를 신규 생성하여 하나의 레코드(행)로 기록한다.
컨텐츠 1회 시청 시간은 평균 30분 정도이며, 일반적으로 10분 정도 수준의 편차가 존재한다.
고객이 컨텐츠 시청을 시작한 시간대에 대한 정보를 기록하고 있다.

n_log = 3000
logs = users[ceiling(runif(n_log, 0, n_user)),"user_id"] %>%
    group_by(user_id) %>%
    mutate(seq = row_number()) %>%
    ungroup() %>%
    mutate(
        watching_time = floor(runif(n_log, 0, 24)),
        watching_duration = ceiling(rnorm(n_log, 30, 10)),
        content_id = pull(contents[ceiling(runif(n_log, 0, n_content)),"content_id"])
    ) %>%
    arrange(user_id, seq)

1	head(logs)

A tibble: 6 × 5
user_id	seq	watching_time	watching_duration	content_id
<chr>	<int>	<dbl>	<dbl>	<chr>
user_01	1	20	25	content_9
user_01	2	21	35	content_19
user_01	3	0	29	content_17
user_01	4	12	24	content_16
user_01	5	5	10	content_19
user_01	6	20	31	content_19

logs 데이터셋의 경우, 우선 user_id를 통해 users 데이터셋과, content_id를 통해 contents 데이터셋과 결합할 수 있으리라는 점을 쉽게 살펴볼 수 있다. 다만 users 데이터셋과는 달리 각 행은 user_id 하나만으로는 유일하게 식별되지 않으며, seq 컬럼이 추가된 user_id + seq가 Unique Key가 된다는 사실 또한 확인할 수 있다.

4. 최종 데이터의 형태

이제 우리가 어떤 데이터를 보유하고 있는지 모두 살펴보았다. 지금부터는 이 세 개의 데이터를 어떻게 조합해서 모델 학습에 이용할 최종 데이터셋을 생성할 지에 대해 생각해보자. 우선 시청 로그와 컨텐츠 데이터를 먼저 결합해보자. 시청 로그 데이터를 기준으로 우측에 컨텐츠 정보를 추가하는 형태가 될 것이고, 결합을 위한 Join Key는 각 데이터셋의 content_id 컬럼이 될 것이다. 결합이 완료된 데이터셋의 Unique Key는 contents 데이터셋과 동일하게 user_id + seq로 유지된다.

logs_with_contents = logs %>%
    left_join(contents, by = c("content_id"))

head(logs_with_contents)

A tibble: 6 × 8
user_id	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_01	1	20	25	content_9	ch_c	drama	0
user_01	2	21	35	content_19	ch_b	drama	0
user_01	3	0	29	content_17	ch_a	drama	0
user_01	4	12	24	content_16	ch_a	horror	0
user_01	5	5	10	content_19	ch_b	drama	0
user_01	6	20	31	content_19	ch_b	drama	0

이제 남은 문제는 유저 데이터와 시청 로그 데이터를 어떻게 결합할 것인지에 대한 부분이다. 둘 중 어느 데이터가 기준 데이터(Left Join을 가정할 때 Left에 위치하게 될 데이터)가 되어야 할 지 먼저 생각해보자. 앞서 2. 분석 과제 정의에서 우리의 분석 단위는 고객 개인이며, 따라서 학습에 이용할 최종 데이터셋에서 각 고객은 하나의 행으로 표현되어야 한다는 부분을 이미 언급했다. 즉, 최종 데이터셋의 Unique Key는 user_id가 되어야 할 것이다.

만약 단순하게 user_id를 기준으로 두 데이터셋(users와 logs_with_contents)를 결합하면 어떻게 될까? logs_with_contents 데이터의 각 고객은 여러 행에 걸쳐서 표현되고 있으므로, 두 데이터셋을 단순 결합할 경우 결과 데이터셋 또한 한 명의 고객의 데이터가 여러 행에 걸쳐서 나타나게 될 것이다. 다르게 표현하자면, 두 데이터셋의 Unique Key의 관계가 1:N이기 때문에, 1에 해당하는 user_id를 Unique Key로 유지한 상태로 결합하기 위해서는 N에 해당하는 데이터셋의 변환이 필요하다는 이야기이다.

따라서 우리가 원하는 최종 데이터 형태를 얻기 위해서는 users를 기준으로 데이터를 결합하되, N에 해당하는 logs_with_contents의 데이터는 다음의 세 가지 방법 중 하나를 통해 변환되어야만 한다.

이해를 돕기 위해 두 데이터셋에서 user_id가 ‘user_08’에 해당하는 데이터를 예시로 살펴보면서 각 방법에 대해서 살펴보도록 하자.

5. 결합을 위한 데이터의 변환

users 데이터셋에서 user_id가 ‘user_08’인 행은 단 하나뿐인 반면, logs_with_contents 데이터셋에서 user_id가 ‘user_08’인 행은 총 29개이다. 지금 우리의 관심사는 이 스물 아홉개의 행을 users의 하나의 행과 결합하기 위해서 어떻게 선택 혹은 요약해서 하나의 행으로 표현할 것인지에 대한 것이다.

1 2	users %>% filter(user_id=='user_08')

A tibble: 1 × 3
user_id	gender	target
<chr>	<fct>	<fct>
user_08	F	STAY

1 2	logs_with_contents %>% filter(user_id=='user_08')

A tibble: 29 × 8
user_id	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_08	1	7	30	content_12	ch_a	drama	0
user_08	2	10	29	content_2	ch_b	romance	0
user_08	3	15	28	content_6	ch_c	comedy	0
user_08	4	8	40	content_18	ch_c	horror	0
user_08	5	21	18	content_2	ch_b	romance	0
user_08	6	19	22	content_18	ch_c	horror	0
user_08	7	9	22	content_8	ch_a	comedy	1000
user_08	8	11	31	content_18	ch_c	horror	0
user_08	9	2	36	content_13	ch_a	comedy	0
user_08	10	5	18	content_2	ch_b	romance	0
user_08	11	17	39	content_3	ch_a	comedy	1000
user_08	12	3	41	content_7	ch_b	romance	1000
user_08	13	0	37	content_18	ch_c	horror	0
user_08	14	5	39	content_4	ch_c	romance	0
user_08	15	2	23	content_20	ch_b	romance	0
user_08	16	21	31	content_12	ch_a	drama	0
user_08	17	4	33	content_2	ch_b	romance	0
user_08	18	22	16	content_20	ch_b	romance	0
user_08	19	17	36	content_9	ch_c	drama	0
user_08	20	13	11	content_12	ch_a	drama	0
user_08	21	2	35	content_15	ch_c	horror	1000
user_08	22	2	23	content_10	ch_c	romance	0
user_08	23	15	18	content_5	ch_a	horror	1000
user_08	24	13	46	content_20	ch_b	romance	0
user_08	25	16	35	content_20	ch_b	romance	0
user_08	26	0	35	content_19	ch_b	drama	0
user_08	27	9	50	content_15	ch_c	horror	1000
user_08	28	3	16	content_1	ch_c	romance	0
user_08	29	7	45	content_12	ch_a	drama	0

(1) 첫 번째 방법: 특정 행의 선택

가장 쉽고 간단한 방법은 스물 아홉개의 행 중에서 하나의 행만을 선택해서 결합하는 것이다. 이를테면, 전체 시청 로그 중에서 시청 시간이 가장 긴 행을 대표행으로 선택하거나, 가장 최근의 시청 로그에 해당하는 행을 대표행으로 선택하고 나머지 행을 모두 버리는 방법을 생각해 볼 수 있다.

1
2
3

# users 데이터셋에서 user_id가 'user_08'인 행만 추출 
(row_u8 = users %>% 
    filter(user_id=='user_08'))

A tibble: 1 × 3
user_id	gender	target
<chr>	<fct>	<fct>
user_08	F	STAY

# user_08의 컨텐츠 시청 로그 중, 시청 시간(watching_duration)이 가장 긴 행만 추출
(method_1_select1 = logs_with_contents %>%
    filter(user_id=='user_08') %>%
    filter(watching_duration==max(watching_duration)) %>%
    filter(seq==min(seq)))

# 결합
row_u8 %>%
    left_join(method_1_select1, by='user_id')

A tibble: 1 × 8
user_id	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_08	27	9	50	content_15	ch_c	horror	1000

A tibble: 1 × 10
user_id	gender	target	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<fct>	<fct>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_08	F	STAY	27	9	50	content_15	ch_c	horror	1000

# user_08의 컨텐츠 시청 로그 중, 가장 최근에 시청한 컨텐츠에 대한 행만 추출
(method_1_select2 = logs_with_contents %>%
    filter(user_id=='user_08') %>%
    filter(seq==max(seq)))

# 결합
row_u8 %>%
    left_join(method_1_select2, by='user_id')

A tibble: 1 × 8
user_id	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_08	29	7	45	content_12	ch_a	drama	0

A tibble: 1 × 10
user_id	gender	target	seq	watching_time	watching_duration	content_id	content_channel	content_genre	content_price
<chr>	<fct>	<fct>	<int>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<dbl>
user_08	F	STAY	29	7	45	content_12	ch_a	drama	0

사실 이러한 첫번째 방법은 지금 우리가 가진 데이터에서는 크게 의미가 없는 방법일 수도 있다. 하지만 예를 들어 한명의 고객이 여러 개의 계정/프로필을 가질 수 있고, 그 중에서 대표 계정/프로필에 대한 데이터만 선택해서 결합할 필요가 있다거나 하는 경우에는 매우 유용하게 활용될 수 있는 방법이다.

(2) 두 번째 방법: 분석 단위를 기준으로 집계

두 번째로는 최종 분석 단위 user_id를 기준으로 데이터를 집계(Aggregation/Summary)하는 변환에 대해서 살펴볼 것이다. 예를 들어, logs_with_contents 데이터로부터 user_id별로 총 시청 시간을 집계한 후에 users 데이터셋과 결합해보도록 하자.

우선 user_08의 29개의 시청 로그는 다음과 같이 하나의 행으로 요약될 수 있다.

(total_watching_duration_u8 = logs_with_contents %>% 
    filter(user_id=='user_08') %>%
    group_by(user_id) %>%
    summarise(total_watching_duration = sum(watching_duration)))

A tibble: 1 × 2
user_id	total_watching_duration
<chr>	<dbl>
user_08	883

users의 user_08에 대한 행도 하나이므로, 요약된 데이터와 다음과 같이 쉽게 결합할 수 있다.

1 2	row_u8 %>% left_join(total_watching_duration_u8, by='user_id')

A tibble: 1 × 4
user_id	gender	target	total_watching_duration
<chr>	<fct>	<fct>	<dbl>
user_08	F	STAY	883

‘user_08’만을 필터링하지 않은 전체 고객에 대한 집계와 결합 또한 다음 코드를 통해 어렵지 않게 수행할 수 있다.

# 각 고객별 총 시청 시간 집계
total_watching_duration = logs_with_contents %>%
    group_by(user_id) %>%
    summarise(total_watching_duration = sum(watching_duration))

head(total_watching_duration)

A tibble: 6 × 2
user_id	total_watching_duration
<chr>	<dbl>
user_01	684
user_02	980
user_03	1100
user_04	1155
user_05	810
user_06	704

최종 분석 단위를 기준으로 집계된 데이터의 Unique Key는 당연히 최종 분석 단위 user_id 가 된다. 따라서 users 데이터셋과 total_watching_duration 데이터셋의 Unique Key는 모두 user_id로 1:1 관계이므로, 다음과 같이 쉽게 결합할 수 있다.

# 결합
users %>%
    left_join(total_watching_duration, by='user_id') %>%
    head()

A tibble: 6 × 4
user_id	gender	target	total_watching_duration
<chr>	<fct>	<fct>	<dbl>
user_01	F	CHURN	684
user_02	M	STAY	980
user_03	M	STAY	1100
user_04	F	CHURN	1155
user_05	F	STAY	810
user_06	M	STAY	704

(3) 세 번째 방법: 분석 단위보다 세부 단위를 기준으로 집계 후, 결과 행의 개별 컬럼화

우리는 앞서 살펴본 것처럼 각 고객별 총 시청 시간을 집계해서 변수로 활용할 수도 있지만, 채널별 시청 시간을 집계해서 조금 더 세부적인 정보를 나타내는 변수로 활용하고 싶을 수도 있다. 이 때 집계의 기준 그룹은 기존과 달리 user_id가 아니라 user_id + content_channel이 되고, 집계 결과는 다음 코드를 통해 확인해 볼 수 있다.

watching_duration_by_channel = logs_with_contents %>%
    group_by(user_id, content_channel) %>% 
    summarise(total_watching_duration = sum(watching_duration))

head(watching_duration_by_channel, 12)

`summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.

A grouped_df: 12 × 3
user_id	content_channel	total_watching_duration
<chr>	<chr>	<dbl>
user_01	ch_a	159
user_01	ch_b	330
user_01	ch_c	195
user_02	ch_a	371
user_02	ch_b	189
user_02	ch_c	420
user_03	ch_a	412
user_03	ch_b	267
user_03	ch_c	421
user_04	ch_a	212
user_04	ch_b	353
user_04	ch_c	590

(2)에서와 달리 집계 결과의 Unique Key는 user_id + content_channel로 우리의 분석 단위 user_id와 일치하지 않는다. 다시 말해, 각 유저별로 3개의 채널에 대한 행을 가지고 있기 때문에 users와 watching_duration_by_channel의 관계는 여전히 1:n이며, 우리는 각 유저의 채널a, 채널b, 채널c 시청 시간을 별도의 컬럼들로 분리시킴으로써 하나의 레코드(행)에 하나의 유저에 대한 정보만 표현될 수 있도록 형태를 변경해볼 것이다. 즉, 현재 1차로 집계한 데이터를 보다 Wide한 형태로 변환할 것이다.

watching_duration_by_channel_wider = watching_duration_by_channel %>%
    pivot_wider(
        names_from=c(content_channel),
        names_prefix="watching_duration_",
        values_from = total_watching_duration
    )

head(watching_duration_by_channel_wider, 4)

A grouped_df: 4 × 4
user_id	watching_duration_ch_a	watching_duration_ch_b	watching_duration_ch_c
<chr>	<dbl>	<dbl>	<dbl>
user_01	159	330	195
user_02	371	189	420
user_03	412	267	421
user_04	212	353	590

이전 집계 결과에서는 watching_duration이라는 하나의 컬럼에서 3개 행으로 표현되던 채널별 시청 시간이, 각 채널별로 별도 컬럼으로 표현되도록 형태가 변환된 것을 확인해 볼 수 있다. (3 rows * 1 column)의 형태가 (1 row * 3 columns)으로 변환되면서, 각 user_id의 시청 정보는 하나의 행에 표현되고 있다. 따라서 변환된 집계 결과의 Unique Key는 user_id가 되며, 이전과 마찬가지로 users 데이터셋과 쉽게 결합할 수 있게 되었다.

1
2
3

users %>%
    left_join(watching_duration_by_channel_wider, by="user_id") %>%
    head(4)

A tibble: 4 × 6
user_id	gender	target	watching_duration_ch_a	watching_duration_ch_b	watching_duration_ch_c
<chr>	<fct>	<fct>	<dbl>	<dbl>	<dbl>
user_01	F	CHURN	159	330	195
user_02	M	STAY	371	189	420
user_03	M	STAY	412	267	421
user_04	F	CHURN	212	353	590

마지막으로 조금만 더 깊이 들어가보자. 채널별로 시청 시간만을 집계했던 조금 전과는 달리, 그룹별로 여러 개의 요약값을 집계하면 데이터는 어떤 형태가 될까? 예를 들어, 장르별로 ‘시청 시간’, ‘시청 횟수’, ‘결제액 합계’, ‘최고가 결제액’의 네 가지 수치를 동시에 집계할 경우의 데이터 형태에 대해서 살펴보도록 하자.

앞서 살펴보았던 케이스와의 차이점은 다음과 같이 정리해 볼 수 있다.

유저별 채널별 시청 시간 합계
- 집계 그룹:
  - user_id + content_channel
- 요약값:
  - 시청 시간 합계 sum(watching_duration)
유저별 장르별 시청 시간 합계, 시청 횟수, 결제액 합계, 최고가 결제액
- 집계 그룹:
  - user_id + conetent_genre
- 요약값:
  - 시청 시간 합계 sum(watching_duration)
  - 시청 횟수 count(content_id)
  - 결제액 합계 sum(content_price)
  - 최고가 결제액 max(content_price

기존과 동일한 방법으로 logs_with_contents 데이터셋을 집계한 결과는 다음과 같다.

agg_by_genre = logs_with_contents %>%
    group_by(user_id, content_genre) %>% 
    summarise(
        duration_sum = sum(watching_duration),
        watching_cnt = n(),
        price_sum = sum(content_price),
        price_max = max(content_price)
    )

head(agg_by_genre, 12)

`summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.

A grouped_df: 12 × 6
user_id	content_genre	duration_sum	watching_cnt	price_sum	price_max
<chr>	<chr>	<dbl>	<int>	<dbl>	<dbl>
user_01	comedy	30	1	1000	1000
user_01	drama	341	11	0	0
user_01	horror	70	3	2000	1000
user_01	romance	243	10	5000	1000
user_02	comedy	217	8	3000	1000
user_02	drama	126	4	0	0
user_02	horror	276	9	4000	1000
user_02	romance	361	13	5000	1000
user_03	comedy	360	10	2000	1000
user_03	drama	141	5	0	0
user_03	horror	334	10	6000	1000
user_03	romance	265	10	2000	1000

출력된 결과와 이전 집계 결과와의 차이점에 대해서 하나씩 살펴보도록 하자. 우선, 컨텐츠의 채널이 ch_a, ch_b, ch_c의 3개 값을 가질 수 있으므로 각 고객의 시청 로그가 3개 행에 걸쳐서 표현되었던 이전 집계 결과와 달리, 장르는 comedy, drama, horror, romance 4개의 값을 가질 수 있으므로 각 고객의 시청로그가 4개 행에 걸쳐서 표현되고 있다는 사실을 쉽게 확인해 볼 수 있다. 다음으로는, 집계된 수치에 해당하는 값이 total_watching_duration 하나 뿐이었던 이전과 달리, 이번 집계 결과에서는 duration_sum, watching_cnt, price_sum, price_max의 네 개의 수치가 집계되었다는 점 또한 확인할 수 있다.

데이터의 Shape 관점에서는 다음과 같이 정리해 볼 수 있다.

Type	Form	Row	Column	Total
채널별 집계 결과	Long	3	1	3
형태 변환 후	Wide	1	3	3

상단의 표는 우리가 Long Form의 형태로 표현된 채널별 집계 결과의 각 행을 별도의 컬럼으로 표현함으로써 Wide Form으로 변환할 수 있었다는 것을 의미한다. 한편, 장르별 집계 결과와 우리가 형태 변환을 통해서 얻고자 하는 데이터의 Shape은 다음과 같이 정리해 볼 수 있을 것이다.

Type	Form	Row	Column	Total
장르별 집계 결과	-	4	4	16
형태 변환 후	Wide	1	16	16

이전과 달리 집계 결과가 완전한 Long Form이 아니기 때문에, 이전과 동일한 방법으로 바로 Wide Form으로 변환하기 보다는, 일차적으로 Long Form으로 변환 후에 Wide Form으로 재변환하는 방식으로 데이터의 형태를 바꿔볼 것이다. 이를 정리하자면 다음과 같다.

Type	Form	Row	Column	Total
장르별 집계 결과	-	4	4	16
1차 변환 후	Long	16	1	16
2차 변환 후	Wide	1	16	16

Long Form으로 데이터를 표현하기 위한 1차 변환은 다음과 같이 수행할 수 있다. 한명의 고객의 데이터가 최대 16개의 행에 걸쳐서 표현된다는 점에 유의해서 살펴보도록 하자.

agg_by_genre_longer = agg_by_genre %>%
    pivot_longer(cols = c(duration_sum, price_sum, price_max, watching_cnt))

agg_by_genre_longer %>% head(48)

A grouped_df: 48 × 4
user_id	content_genre	name	value
<chr>	<chr>	<chr>	<dbl>
user_01	comedy	duration_sum	30
user_01	comedy	price_sum	1000
user_01	comedy	price_max	1000
user_01	comedy	watching_cnt	1
user_01	drama	duration_sum	341
user_01	drama	price_sum	0
user_01	drama	price_max	0
user_01	drama	watching_cnt	11
user_01	horror	duration_sum	70
user_01	horror	price_sum	2000
user_01	horror	price_max	1000
user_01	horror	watching_cnt	3
user_01	romance	duration_sum	243
user_01	romance	price_sum	5000
user_01	romance	price_max	1000
user_01	romance	watching_cnt	10
user_02	comedy	duration_sum	217
user_02	comedy	price_sum	3000
user_02	comedy	price_max	1000
user_02	comedy	watching_cnt	8
user_02	drama	duration_sum	126
user_02	drama	price_sum	0
user_02	drama	price_max	0
user_02	drama	watching_cnt	4
user_02	horror	duration_sum	276
user_02	horror	price_sum	4000
user_02	horror	price_max	1000
user_02	horror	watching_cnt	9
user_02	romance	duration_sum	361
user_02	romance	price_sum	5000
user_02	romance	price_max	1000
user_02	romance	watching_cnt	13
user_03	comedy	duration_sum	360
user_03	comedy	price_sum	2000
user_03	comedy	price_max	1000
user_03	comedy	watching_cnt	10
user_03	drama	duration_sum	141
user_03	drama	price_sum	0
user_03	drama	price_max	0
user_03	drama	watching_cnt	5
user_03	horror	duration_sum	334
user_03	horror	price_sum	6000
user_03	horror	price_max	1000
user_03	horror	watching_cnt	10
user_03	romance	duration_sum	265
user_03	romance	price_sum	2000
user_03	romance	price_max	1000
user_03	romance	watching_cnt	10

Long Form으로 1차 변환된 집계 결과를 다시 Wide Form으로 최종 변환하는 과정은 다음 코드를 통해서 수행할 수 있다.

agg_by_genre_wider = agg_by_genre_longer %>%
  pivot_wider(names_from = c(content_genre, name))

head(agg_by_genre_wider, 3)

A grouped_df: 3 × 17
user_id	comedy_duration_sum	comedy_price_sum	comedy_price_max	comedy_watching_cnt	drama_duration_sum	drama_price_sum	drama_price_max	drama_watching_cnt	horror_duration_sum	horror_price_sum	horror_price_max	horror_watching_cnt	romance_duration_sum	romance_price_sum	romance_price_max	romance_watching_cnt
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
user_01	30	1000	1000	1	341	0	0	11	70	2000	1000	3	243	5000	1000	10
user_02	217	3000	1000	8	126	0	0	4	276	4000	1000	9	361	5000	1000	13
user_03	360	2000	1000	10	141	0	0	5	334	6000	1000	10	265	2000	1000	10

드디어 우리가 원했던대로 각 고객의 정보가 하나의 행에 표현될 수 있도록 데이터의 형태가 변환되었다. 최종 변환된 집계 결과의 Unique Key는 user_id가 되며, 앞서 살펴보았던 케이스들과 마찬가지로 users 데이터셋과 쉽게 결합함으로써 분석을 위한 최종 데이터셋을 생성할 수 있다.

data_for_analysis = users %>%
    left_join(agg_by_genre_wider, by='user_id')

head(data_for_analysis, 10)

A tibble: 10 × 19
user_id	gender	target	comedy_duration_sum	comedy_price_sum	comedy_price_max	comedy_watching_cnt	drama_duration_sum	drama_price_sum	drama_price_max	drama_watching_cnt	horror_duration_sum	horror_price_sum	horror_price_max	horror_watching_cnt	romance_duration_sum	romance_price_sum	romance_price_max	romance_watching_cnt
<chr>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
user_01	F	CHURN	30	1000	1000	1	341	0	0	11	70	2000	1000	3	243	5000	1000	10
user_02	M	STAY	217	3000	1000	8	126	0	0	4	276	4000	1000	9	361	5000	1000	13
user_03	M	STAY	360	2000	1000	10	141	0	0	5	334	6000	1000	10	265	2000	1000	10
user_04	F	CHURN	278	2000	1000	9	261	0	0	9	145	3000	1000	5	471	4000	1000	16
user_05	F	STAY	318	5000	1000	10	126	0	0	4	189	3000	1000	6	177	1000	1000	6
user_06	M	STAY	150	3000	1000	4	77	0	0	2	283	4000	1000	9	194	3000	1000	8
user_07	F	CHURN	305	4000	1000	10	171	0	0	5	229	2000	1000	8	432	4000	1000	13
user_08	F	STAY	125	2000	1000	4	188	0	0	6	233	3000	1000	7	337	1000	1000	12
user_09	F	STAY	202	2000	1000	6	200	0	0	6	267	6000	1000	8	389	4000	1000	14
user_10	F	STAY	334	5000	1000	10	154	0	0	5	198	3000	1000	6	340	5000	1000	11

물론 이러한 형태의 데이터 형태 변환이 익숙하다면, 초기 집계 결과의 형태에서 Long Form으로의 중간 전환을 건너뛰고 바로 Wide Form으로 전환하는 것도 가능하다. 이를 위한 코드는 다음과 같으며, 함수의 각 파라미터의 의미에 대해서는 글을 읽는 우리 모두가 형태 변환에 보다 익숙해지기를 바라는 마음으로 직접 고민해 볼 부분으로 남겨놓는다.

agg_by_genre %>%
    pivot_wider(
        names_from=c(content_genre),
        values_from = c(duration_sum, price_sum, price_max, watching_cnt)
    ) %>%
    head(10)

A grouped_df: 10 × 17
user_id	duration_sum_comedy	duration_sum_drama	duration_sum_horror	duration_sum_romance	price_sum_comedy	price_sum_drama	price_sum_horror	price_sum_romance	price_max_comedy	price_max_drama	price_max_horror	price_max_romance	watching_cnt_comedy	watching_cnt_drama	watching_cnt_horror	watching_cnt_romance
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<int>	<int>	<int>
user_01	30	341	70	243	1000	0	2000	5000	1000	0	1000	1000	1	11	3	10
user_02	217	126	276	361	3000	0	4000	5000	1000	0	1000	1000	8	4	9	13
user_03	360	141	334	265	2000	0	6000	2000	1000	0	1000	1000	10	5	10	10
user_04	278	261	145	471	2000	0	3000	4000	1000	0	1000	1000	9	9	5	16
user_05	318	126	189	177	5000	0	3000	1000	1000	0	1000	1000	10	4	6	6
user_06	150	77	283	194	3000	0	4000	3000	1000	0	1000	1000	4	2	9	8
user_07	305	171	229	432	4000	0	2000	4000	1000	0	1000	1000	10	5	8	13
user_08	125	188	233	337	2000	0	3000	1000	1000	0	1000	1000	4	6	7	12
user_09	202	200	267	389	2000	0	6000	4000	1000	0	1000	1000	6	6	8	14
user_10	334	154	198	340	5000	0	3000	5000	1000	0	1000	1000	10	5	6	11

6. 간단한 모델링

다음은 생성한 데이터를 이용한 간단한 모델링과 결과 확인을 위한 샘플 코드이다. 랜덤하게 제너레이팅한 부분이 많은 데이터셋이라서 성능이 만족할만한 수준으로 나오지는 않지만, 모델링을 위한 글이 아니므로 세부적인 내용에 대해서는 다루지 않을 예정이다. 사실 모델링에 대한 내용을 제외해도 이미 한번에 읽기 부담스러울만큼 충분히 긴 글이기도 하다.

for (pkg in c("caret", "e1071", "randomForest"))
    if (!pkg %in% installed.packages()[,1])
        install.packages(pkg)

library(caret)

Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
    lift

1
2
3

# Imputation for Missing Values
data_imp = data_for_analysis %>%
    mutate_if(is.numeric, ~replace(., is.na(.), 0))

idx_tr = createDataPartition(pull(data_imp, target), p = 0.7, list=FALSE)
data_tr = select(data_imp[idx_tr, ], -user_id)
data_te = select(data_imp[-idx_tr, ], -user_id)

actual = pull(data_te, target)

1
2
3

fit_glm = glm(target ~ ., data=data_tr, family=binomial)
pred_prob_glm = predict(fit_glm, newdata=data_te, type='response')
pred_glm = as.factor(ifelse(pred_prob_glm > 0.5, levels(actual)[2], levels(actual)[1]))

1
2
3

fit_rf = train(target ~ ., data = data_tr, method = "rf", verbose = F)
pred_prob_rf = predict(fit_rf, newdata=data_te, type='prob')
pred_rf = as.factor(ifelse(pred_prob_rf[,2] > 0.5, "STAY", "CHURN"))

1	confusionMatrix(pred_glm, actual)

Confusion Matrix and Statistics

          Reference
Prediction CHURN STAY
     CHURN     5    3
     STAY      6   15
                                          
               Accuracy : 0.6897          
                 95% CI : (0.4917, 0.8472)
    No Information Rate : 0.6207          
    P-Value [Acc > NIR] : 0.287           
                                          
                  Kappa : 0.304           
                                          
 Mcnemar's Test P-Value : 0.505           
                                          
            Sensitivity : 0.4545          
            Specificity : 0.8333          
         Pos Pred Value : 0.6250          
         Neg Pred Value : 0.7143          
             Prevalence : 0.3793          
         Detection Rate : 0.1724          
   Detection Prevalence : 0.2759          
      Balanced Accuracy : 0.6439          
                                          
       'Positive' Class : CHURN

1	confusionMatrix(pred_rf, actual)

Confusion Matrix and Statistics

          Reference
Prediction CHURN STAY
     CHURN     6    1
     STAY      5   17
                                          
               Accuracy : 0.7931          
                 95% CI : (0.6028, 0.9201)
    No Information Rate : 0.6207          
    P-Value [Acc > NIR] : 0.03859         
                                          
                  Kappa : 0.5272          
                                          
 Mcnemar's Test P-Value : 0.22067         
                                          
            Sensitivity : 0.5455          
            Specificity : 0.9444          
         Pos Pred Value : 0.8571          
         Neg Pred Value : 0.7727          
             Prevalence : 0.3793          
         Detection Rate : 0.2069          
   Detection Prevalence : 0.2414          
      Balanced Accuracy : 0.7449          
                                          
       'Positive' Class : CHURN

7. Wrap-Up

이번 글에서는 원장성 데이터(마스터)와 이력성 데이터(히스토리)를 가지고 있을 때, 분석을 위한 최종 테이터셋을 생성하는 과정에 대해서 살펴보았다. 그 중에서도 특히 원장성 데이터와 결합하기 위해 이력성 데이터를 선택/집계하는 세 가지 방법에 대해 다루었으며, 각 방법을 데이터셋 간의 관계와 데이터의 형태 변환 측면에서 자세히 살펴보고자 했다.

특히 세 가지 방법 중에서 마지막에 해당하는 방법을 다루면서는, 지난 글에서 살펴보았던 데이터의 Long/Wide Form 변환에 대해 조금 더 자세히 설명하고자 했다. 이번 글을 통해서 지난 글에서 설명이 다소 부족했던 형태 변환의 필요성과 구체적인 변환 방법에 대한 실마리를 얻을 수 있었다면 더할 나위 없는 보람이 되겠다.

적지 않은 내용을 다소 많은 관점에서 다루면서 글이 다소 복잡해진 느낌이 들긴 하지만, 누군가에게는 도움이 될 수 있는 내용이기를 바라면서 글을 마친다.

[Taes' R #3] 분석을 위한 데이터셋의 형태 변환과 결합에 대하여

https://taes.me/R03/

Author

Taeseung Hahn

Posted on

2019-12-14

Updated on

2023-07-23

[Taes' R #3] 분석을 위한 데이터셋의 형태 변환과 결합에 대하여

1. 지난 이야기

2. 분석 과제 정의

3. 가상 데이터의 생성

(1) 유저 데이터

(2) 컨텐츠 데이터

(3) 시청 로그 데이터

4. 최종 데이터의 형태

5. 결합을 위한 데이터의 변환

(1) 첫 번째 방법: 특정 행의 선택

(2) 두 번째 방법: 분석 단위를 기준으로 집계

(3) 세 번째 방법: 분석 단위보다 세부 단위를 기준으로 집계 후, 결과 행의 개별 컬럼화

6. 간단한 모델링

7. Wrap-Up

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Categories

Recents