RFM
library(ggplot2)
library(tidyverse)
options(scipen = 9)
commerce <- read.csv("data.csv")
str(commerce)
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: chr "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
We change the data type for the date and the custumerID, fo other variables seem correct.
commerce$InvoiceDate <- lubridate::mdy_hm(commerce$InvoiceDate)
commerce$InvoiceDate <- as.Date(commerce$InvoiceDate, format = "%m/%d/%Y %H:%M")
commerce$CustomerID <- as.character(commerce$CustomerID)
We should get rid of units that are worth less than zero, and is not in store, as well as no customers without and ID.
commerce <- commerce %>%
filter(UnitPrice > 0 & Quantity > 0 & !is.na(CustomerID))
skimr::skim(commerce)
| Name | commerce |
| Number of rows | 397884 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| InvoiceNo | 0 | 1 | 6 | 6 | 0 | 18532 | 0 |
| StockCode | 0 | 1 | 1 | 12 | 0 | 3665 | 0 |
| Description | 0 | 1 | 6 | 35 | 0 | 3877 | 0 |
| CustomerID | 0 | 1 | 5 | 5 | 0 | 4338 | 0 |
| Country | 0 | 1 | 3 | 20 | 0 | 37 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| InvoiceDate | 0 | 1 | 2010-12-01 | 2011-12-09 | 2011-07-31 | 305 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Quantity | 0 | 1 | 12.99 | 179.33 | 1 | 2.00 | 6.00 | 12.00 | 80995.00 | ▇▁▁▁▁ |
| UnitPrice | 0 | 1 | 3.12 | 22.10 | 0 | 1.25 | 1.95 | 3.75 | 8142.75 | ▇▁▁▁▁ |
DataExplorer::plot_intro(commerce)

commerce <- commerce %>%
group_by(CustomerID) %>%
mutate(spending = UnitPrice * Quantity) %>%
ungroup()
commerce %>%
arrange(desc(spending)) %>%
reactable::reactable(compact = T)