A\B Test. Real-Time Bidding for Display Advertising
Real-Time Bidding (RTB)¶
Before going in depth into the main issue which is A\B Testing, if you have questioned yourself about what it is Real-Time Bidding, is a form of programmatic ad buying.
It refers to the buying and selling ads in real time on a per-impression basis in an instant auction. This is usually facilitated by a supply-side platform or an ad exchanges.
Web pages or mobile apps submit bid requests to potential advertisers on theses exchanges. If the advertiser sees a bid request for a user they want to target on a web page or mobile app they want to serve on, they will submit a bid for that advertising placement. If they submit the highest (on only) bid, they win the placement and serve an ad.
As an example of a marketplace as product, DoubleClid Ad Exchange (developed by Google) allows to buy and sell display advertising space. And if you didn't expected, Amazon Web Service has platforms solutions to make bidding decisions based on available input data (such as user information).
Problem Description¶
During a campaign, two distinct ad designs were run.We'd like to take a look at the campaign's results to answer if the new ad had incresed conversions. In order to analyse the outcome, some users were assigned for a "test" group that received the new ad during the campaign.
Through the implementation of the A\B Testing, distribuitions related to requests and the targeted users are addressed to gaing insight.
Data Sets information¶
There are two files available as follows:
user_atributes.csv is a csv file of target users' attributes
| Variable | Description |
|---|---|
| user_id | unique identifier for each user |
| attributes | text blob of the user's attributes |
| age | user's age in years |
| gender | user's inferred gender |
| location | user's state of residence |
| test | binary variable indicating that the user belongs to the test group |
bid_requests.csv is a csv file of bid requests for the targeted users. Each row represents one (bid) request.
| Variable | Description |
|---|---|
| timestamp | time the bid request was received |
| user_id | unique identifier for each user |
| bid | binary variable indicating wheter the bid was made for the request |
| win | binary variable indicating wheter was submitted the winning bid and was served an ad |
| conversion | binary variable indicating whether the user converted after seeing the ad |
Load files & merge dataframes¶
By way of summary, this are the steps we will walk through the data preparation
- Load files.User attributes are given on the same rows as json format, so we separate columns
- Merge both dataframes on user_id
- Discretize age variable to group numeric values into discrete bins for easier grouping later
# get python packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import timedelta
import json
import squarify
sns.set(color_codes = True)
#from matplotlib import style
#style.use('dark_background')
%matplotlib inline
import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
import plotly.figure_factory as ff
from numpy import mean
# Load the csv file with User Attributes into Pandas
user_attributes = pd.read_csv('ser_attributes.csv')
# Check first five rows of the User Attributs Table
user_attributes.head()
| user_id | attributes | |
|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | {"age": 33, "gender": "F", "location": "FL", "... |
| 1 | 000053b1e684c9e7ea73727b2238ce18 | {"age": 26, "gender": "M", "location": "AL", "... |
| 2 | 00029153d12ae1c9abe59c17ff2e0895 | {"age": 29, "gender": "F", "location": "AR", "... |
| 3 | 0002ac0d783338cfeab0b2bdbd872cda | {"age": 29, "gender": "M", "location": "SC", "... |
| 4 | 0004d0b59e19461ff126e3a08a814c33 | {"age": 27, "gender": "F", "location": "AR", "... |
# Split json string in attributes into a separate dataframe with
# four separate columns: age, gender, location and test
attributes = user_attributes.attributes.apply(json.loads) \
.apply(pd.io.json.json_normalize)\
.pipe(lambda x: pd.concat(x.values))
attributes.head()
| age | gender | location | test | |
|---|---|---|---|---|
| 0 | 33 | F | FL | 1 |
| 0 | 26 | M | AL | 1 |
| 0 | 29 | F | AR | 1 |
| 0 | 29 | M | SC | 0 |
| 0 | 27 | F | AR | 1 |
# Combine user_id with parsed json attributes columns into one main table.
# Reset indexes first in both tables since they are different.
user_attributes.reset_index(drop=True, inplace=True)
attributes.reset_index(drop=True, inplace=True)
uatt = pd.concat([user_attributes['user_id'],attributes], axis=1)
uatt.head()
| user_id | age | gender | location | test | |
|---|---|---|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 |
| 1 | 000053b1e684c9e7ea73727b2238ce18 | 26 | M | AL | 1 |
| 2 | 00029153d12ae1c9abe59c17ff2e0895 | 29 | F | AR | 1 |
| 3 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 |
| 4 | 0004d0b59e19461ff126e3a08a814c33 | 27 | F | AR | 1 |
# Load the csv file with User Attributes into Pandas
bid_requests = pd.read_csv('bid_requests.csv')
# Check first five rows of the User Attributs Table
bid_requests.head()
| timestamp | user_id | bid | win | conversion | |
|---|---|---|---|---|---|
| 0 | 2017-01-01 00:00:01 | be7485be5b6eb3690efcbc9e95e8f15a | 0 | 0 | 0 |
| 1 | 2017-01-01 00:00:02 | 26c5dca2512a4c7fe8810bd04191b1b3 | 0 | 0 | 0 |
| 2 | 2017-01-01 00:00:05 | 2121376a323507c01c5e92c39ae8ccd4 | 0 | 0 | 0 |
| 3 | 2017-01-01 00:00:12 | fa6a0925d911185338b0acc93c66dc92 | 0 | 0 | 0 |
| 4 | 2017-01-01 00:00:13 | 4299f209da83da82b711f1d631cc607b | 1 | 0 | 0 |
# Merge user_id table with bid_requests table
# Convert timestamp from object to datetime format
data = uatt.merge(bid_requests, on='user_id',
parse_dates=['timestamps'])
data.head()
| user_id | age | gender | location | test | timestamp | bid | win | conversion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-01 13:43:34 | 1 | 0 | 0 |
| 1 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 03:59:18 | 0 | 0 | 0 |
| 2 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 17:41:50 | 1 | 1 | 0 |
| 3 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-07 04:02:42 | 1 | 1 | 0 |
| 4 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-08 09:05:01 | 1 | 0 | 0 |
# add age bins to the data for easier grouping later
data['age_bins'] = pd.cut(x=data['age'], bins=[17, 21, 25, 29, 33])
display(data.head())
| user_id | age | gender | location | test | timestamp | bid | win | conversion | age_bins | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-01 13:43:34 | 1 | 0 | 0 | (29, 33] |
| 1 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 03:59:18 | 0 | 0 | 0 | (29, 33] |
| 2 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 17:41:50 | 1 | 1 | 0 | (29, 33] |
| 3 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-07 04:02:42 | 1 | 1 | 0 | (29, 33] |
| 4 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-08 09:05:01 | 1 | 0 | 0 | (29, 33] |
Recency - Frequency - Conversion (M)¶
RFM is a simple technique for segmenting customers. RFM clustering stands for :
- Recency. This is the number of days passed since the user seen the ad. To calculate it, we took on day after the last invoice date of the data set at the snapshot date. The date difference will show how recent the last conversion was made.
- Frequency. Since the day the user saw the ad, how many days have passed?
- Monetary value. In this scenario, M represents conversion history. No monetary value as revenue was given, and hence we will calculate Conversion
Before we start with the data exploration, we set up two versions of data (test and baseline).
#Break data into two categories: Base and New Ad to analyze them separately
data_test = data[data['test'] == 1].reset_index(drop=True)
display(data_test.head())
display(data_test.shape)
data_base = data[data['test'] == 0].reset_index(drop=True)
display(data_base.head())
display(data_base.shape)
| user_id | age | gender | location | test | timestamp | bid | win | conversion | age_bins | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-01 13:43:34 | 1 | 0 | 0 | (29, 33] |
| 1 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 03:59:18 | 0 | 0 | 0 | (29, 33] |
| 2 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 17:41:50 | 1 | 1 | 0 | (29, 33] |
| 3 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-07 04:02:42 | 1 | 1 | 0 | (29, 33] |
| 4 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-08 09:05:01 | 1 | 0 | 0 | (29, 33] |
(337893, 10)
| user_id | age | gender | location | test | timestamp | bid | win | conversion | age_bins | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-08 00:30:17 | 0 | 0 | 0 | (25, 29] |
| 1 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-09 21:17:16 | 0 | 0 | 0 | (25, 29] |
| 2 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-13 06:12:09 | 0 | 0 | 0 | (25, 29] |
| 3 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-14 03:37:56 | 1 | 0 | 0 | (25, 29] |
| 4 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-15 06:12:19 | 0 | 0 | 0 | (25, 29] |
(262107, 10)
# Initial data process table for Base data
# Create snapshot date
snapshot_date = data_base['timestamp'].max() + timedelta(hours=8)
display(snapshot_date)
# Grouping by user id
data_process_base = data_base.groupby(['user_id']).agg({'timestamp': lambda x: (snapshot_date - x.max()).days,
'win': 'sum',
'conversion': 'sum'})
# Rename the columns
data_process_base.rename(columns={'timestamp': 'Recency',
'win': 'Frequency',
'conversion': 'Conversion Frequency'}, inplace=True)
display(data_process_base.head())
display('{:,} rows; {:,} columns'
.format(data_process_base.shape[0], data_process_base.shape[1]))
Timestamp('2017-01-23 21:19:09')
| Recency | Frequency | Conversion Frequency | |
|---|---|---|---|
| user_id | |||
| 0002ac0d783338cfeab0b2bdbd872cda | 3 | 0 | 0 |
| 00063cb5da1826febf178b669eea3250 | 1 | 4 | 0 |
| 0006aabe0ba47a35c0b0bf6596f85159 | 0 | 3 | 0 |
| 0007789b118e4710fc0e7c8758a6532a | 0 | 2 | 0 |
| 000c076c390a4c357313fca29e390ece | 16 | 0 | 0 |
'27,608 rows; 3 columns'
# Initial data process table for Test data
# Create snapshot date
snapshot_date = data_test['timestamp'].max() + timedelta(hours=8)
display(snapshot_date)
# Groping by user id
data_process_test = data_test.groupby(['user_id']).agg({'timestamp': lambda x: (snapshot_date - x.max()).days,
'win': 'sum',
'conversion': 'sum'})
# Rename the columns
data_process_test.rename(columns={'timestamp': 'Recency',
'win': 'Frequency',
'conversion': 'Conversion Frequency'}, inplace=True)
display(data_process_test.head())
display('{:,} rows; {:,} columns'
.format(data_process_test.shape[0], data_process_test.shape[1]))
Timestamp('2017-01-23 21:19:28')
| Recency | Frequency | Conversion Frequency | |
|---|---|---|---|
| user_id | |||
| 00003e3b9e5336685200ae85d21b4f5e | 6 | 4 | 0 |
| 000053b1e684c9e7ea73727b2238ce18 | 8 | 0 | 0 |
| 00029153d12ae1c9abe59c17ff2e0895 | 0 | 0 | 0 |
| 00053f5e11d1fe4e49a221165b39abc9 | 3 | 2 | 0 |
| 0006dd05ea1e999ddaa041a7091b7b36 | 6 | 0 | 0 |
'35,529 rows; 3 columns'
# Add process table to the main data table for each category: Base and Test
total_data_base = data_base.merge(data_process_base, on='user_id')
display(total_data_base.head())
total_data_test = data_test.merge(data_process_test, on='user_id')
display(total_data_test.head())
| user_id | age | gender | location | test | timestamp | bid | win | conversion | age_bins | Recency | Frequency | Conversion Frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-08 00:30:17 | 0 | 0 | 0 | (25, 29] | 3 | 0 | 0 |
| 1 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-09 21:17:16 | 0 | 0 | 0 | (25, 29] | 3 | 0 | 0 |
| 2 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-13 06:12:09 | 0 | 0 | 0 | (25, 29] | 3 | 0 | 0 |
| 3 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-14 03:37:56 | 1 | 0 | 0 | (25, 29] | 3 | 0 | 0 |
| 4 | 0002ac0d783338cfeab0b2bdbd872cda | 29 | M | SC | 0 | 2017-01-15 06:12:19 | 0 | 0 | 0 | (25, 29] | 3 | 0 | 0 |
| user_id | age | gender | location | test | timestamp | bid | win | conversion | age_bins | Recency | Frequency | Conversion Frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-01 13:43:34 | 1 | 0 | 0 | (29, 33] | 6 | 4 | 0 |
| 1 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 03:59:18 | 0 | 0 | 0 | (29, 33] | 6 | 4 | 0 |
| 2 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-04 17:41:50 | 1 | 1 | 0 | (29, 33] | 6 | 4 | 0 |
| 3 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-07 04:02:42 | 1 | 1 | 0 | (29, 33] | 6 | 4 | 0 |
| 4 | 00003e3b9e5336685200ae85d21b4f5e | 33 | F | FL | 1 | 2017-01-08 09:05:01 | 1 | 0 | 0 | (29, 33] | 6 | 4 | 0 |
Exploratory Data Analysis¶
The percentage of records in the different states can affect to the resulting conversion.From the results obtained below, we assure that test data has larger records in TN, AL, NC and MC.
for i in list(data_base.columns[data_base.dtypes =='object']):
print(data_base[i].value_counts(normalize = True)*100)
print()
fbb43b2b2560e178cd84f7eaf9709740 0.018313
d0fda0cf702231b3963aedb610256146 0.017932
c7d67339105519c3dae1c323022dbd04 0.017550
37e9b839eeb8b2d3c484e37c548dccae 0.017550
796f13adf0167d7019717e1b71291ad7 0.017169
...
ddc1ff560673ec24600a5f837f0cb9bd 0.000382
da35a24fb3674802565f8bd5243a94d4 0.000382
673a52f787cc434410519234173933fc 0.000382
a19fbff55f49a023807bfe3ea9a9946e 0.000382
2596a5e119b63ccc73b675925f6aa228 0.000382
Name: user_id, Length: 27608, dtype: float64
M 50.107017
F 49.892983
Name: gender, dtype: float64
AR 11.745966
SC 11.615867
KY 11.449523
VA 11.324383
FL 11.186653
LA 11.012678
GA 10.900129
NC 5.386350
MS 5.286009
AL 5.134926
TN 4.957517
Name: location, dtype: float64
for i in list(data_test.columns[data_base.dtypes =='object']):
print(data_test[i].value_counts(normalize = True)*100)
print()
c270380928379b118113b0e0ce1a3f2f 0.014798
6aa1c1a875889044ebf43dfe844c7bc1 0.013614
ef8f94395be9fd78b7d0aeecf7864a03 0.013022
3c841cee5b2497ea9617f7e630b8ead1 0.013022
f6a4305cef95d570863561961ff7b118 0.012726
...
7b4d0bb1ea88165e7da2c307bea11d6e 0.000296
0043b79674f3725261aca28ec94e5fdc 0.000296
3279e1d4c83720ed7e4b99f98b26feb1 0.000296
5fc4698a9539a70b368c5aa9736c49eb 0.000296
dfaf5c7e7dcea4cc0a11e5961b2d6de6 0.000296
Name: user_id, Length: 35529, dtype: float64
F 50.045725
M 49.954275
Name: gender, dtype: float64
TN 12.477027
AL 12.261278
NC 12.123956
MS 11.932180
KY 7.461534
GA 7.415069
LA 7.329243
SC 7.273012
VA 7.267093
FL 7.262358
AR 7.197249
Name: location, dtype: float64
Bid request , win and resulted conversion¶
We start on plot the mean of the dataset for bid request, win and resulted conversion in each state broken by gender and age bin.
fig, ax = plt.subplots(1, 2,figsize=(21,8), sharey=True)
fig.suptitle('Average bid count per gender in each state')
order_statebin = pd.value_counts(data['location']).sort_index().index
ax1 = sns.barplot(data_test['location'],total_data_test['bid'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[1]).set_title('New Ad');
ax2 = sns.barplot(data_base['location'],total_data_base['bid'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[0]).set_title('Baseline');
fig, ax = plt.subplots(1, 2,figsize=(21,8), sharey=True)
fig.suptitle('Average win count per gender in each state')
order_statebin = pd.value_counts(data['location']).sort_index().index
ax1 = sns.barplot(data_test['location'],total_data_test['win'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[1]).set_title('New Ad');
ax2 = sns.barplot(data_base['location'],total_data_base['win'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[0]).set_title('Baseline');
fig, ax = plt.subplots(1, 2,figsize=(21,8), sharey=True)
fig.suptitle('Average conversion count per gender in each state')
order_statebin = pd.value_counts(data['location']).sort_index().index
ax1 = sns.barplot(data_test['location'],total_data_test['Conversion Frequency'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[1]).set_title('New Ad');
ax2 = sns.barplot(data_base['location'],total_data_base['Conversion Frequency'], hue=data['gender'], order = order_statebin, palette ='inferno', ax=ax[0]).set_title('Baseline');
Observations from the three pairs of plots above:
- Bid requests and wins are similar for all data across the baseline and new add campaign. Hence one can compare conversions across each group without normalizing coversion data to bid and win data. Conversion Frequency plot shows that All states benefited from a new add campaign. AR, FL, GA, KY, MS, SC, TN and VA had a slight increase in female conversion vs male.
Females seem to have higher conversions </font>
fig, ax = plt.subplots(1, 2,figsize=(34,8),sharey=True)
fig.suptitle('Average bid count per age bracket in each state')
order_statebin = pd.value_counts(data['location']).sort_index().index
ax1 = sns.barplot(data_test['location'],total_data_test['bid'], hue=data['age_bins'], order = order_statebin, palette ='viridis', ax=ax[1]).set_title('New Ad');
ax2 = sns.barplot(data_base['location'],total_data_base['bid'], hue=data['age_bins'], order = order_statebin, palette ='viridis', ax=ax[0]).set_title('Baseline');
fig, ax = plt.subplots(1, 2,figsize=(34,8),sharey=True)
fig.suptitle('Average conversion count per age bracket in each state')
order_statebin = pd.value_counts(data['location']).sort_index().index
ax1 = sns.barplot(data_test['location'],total_data_test['Conversion Frequency'], hue=data['age_bins'], order = order_statebin, palette ='viridis', ax=ax[1]).set_title('New Ad');
ax2 = sns.barplot(data_base['location'],total_data_base['Conversion Frequency'], hue=data['age_bins'], order = order_statebin, palette ='viridis', ax=ax[0]).set_title('Baseline');
Observations from above two plots:
- AL - Age bracket [29-33] experienced the highest conversions
- AR - Age brackets [21-25] and [29-33] expreienced the highest conversions
- FL - Age brackets [18-21] and [21-25] experienced the highest conversions
- GA - Age brackets [18-21] and [29-33] experienced the highest conversions
- KY - Age bracket [21-25] experienced the highest conversions
- LA - Age bracket [25-29] experienced the highest coversions. Note that the youngest bracket [18-21] conversions decreased.
- MS - Age bracket [18-21] experienced the highest conversions.
- NC - Age bracket [21-25] experienced the higest coversions.
- SC - Age bracket [18-21] experienced the highest conversions.
- TN - Experienced a uniform increase in conversion for each age bracket.
- VA - Age bracket [21-25] experienced the highest conversions.
Based on these observations, most conversions were observed in the population under 25 years of age. </font>
RFC Distribution for Base and Test¶
Our aim is to observe the distribution of RFC. For the purpose, we divide up the underlying data into equal-sized buckets bases on sample quantiles (4 quartiles).For the resulting bins we use score labels for the recency and frequency.
Calculations can be done if we want to examine the behavior in more detail.
# --Calculate Recency and Frequency groups for Base--
# Create labels for Recency and Frequency
r_labels = range(4, 0, -1); f_labels = range(1, 5)
# Assign these labels to 4 equal percentile groups
r_groups = pd.qcut(data_process_base['Recency'], q=4, labels=r_labels)
f_groups = pd.qcut(data_process_base['Frequency'], q=4, labels=f_labels)
# Create new columns R and F
data_process_base = data_process_base.assign(R = r_groups.values, F = f_groups.values)
display(data_process_base.head())
#---------------------------------------------------------------------------------------
# --Calculate Recency and Frequency groups for Test--
# Create labels for Recency and Frequency
r_labels = range(4, 0, -1); f_labels = range(1, 5)
# Assign these labels to 4 equal percentile groups
r_groups = pd.qcut(data_process_test['Recency'], q=4, labels=r_labels)
f_groups = pd.qcut(data_process_test['Frequency'], q=4, labels=f_labels)
# Create new columns R and F
data_process_test = data_process_test.assign(R = r_groups.values, F = f_groups.values)
display(data_process_test.head())
| Recency | Frequency | Conversion Frequency | R | F | |
|---|---|---|---|---|---|
| user_id | |||||
| 0002ac0d783338cfeab0b2bdbd872cda | 3 | 0 | 0 | 2 | 1 |
| 00063cb5da1826febf178b669eea3250 | 1 | 4 | 0 | 4 | 4 |
| 0006aabe0ba47a35c0b0bf6596f85159 | 0 | 3 | 0 | 4 | 3 |
| 0007789b118e4710fc0e7c8758a6532a | 0 | 2 | 0 | 4 | 2 |
| 000c076c390a4c357313fca29e390ece | 16 | 0 | 0 | 1 | 1 |
| Recency | Frequency | Conversion Frequency | R | F | |
|---|---|---|---|---|---|
| user_id | |||||
| 00003e3b9e5336685200ae85d21b4f5e | 6 | 4 | 0 | 1 | 4 |
| 000053b1e684c9e7ea73727b2238ce18 | 8 | 0 | 0 | 1 | 1 |
| 00029153d12ae1c9abe59c17ff2e0895 | 0 | 0 | 0 | 4 | 1 |
| 00053f5e11d1fe4e49a221165b39abc9 | 3 | 2 | 0 | 2 | 2 |
| 0006dd05ea1e999ddaa041a7091b7b36 | 6 | 0 | 0 | 1 | 1 |
# Plot RFC (Recency, Frequency, Conversion Frequency) distributions
plt.figure(figsize=(21,8))
# Plot distribution of R
plt.subplot(3, 2, 1); sns.distplot(data_process_base['Recency']).set_title('Base Recency')
# Plot distribution of F
plt.subplot(3, 2, 3); sns.distplot(data_process_base['Frequency']).set_title('Base Frequency')
# Plot distribution of C
plt.subplot(3, 2, 5); sns.distplot(data_process_base['Conversion Frequency']).set_title('Base Conversion Frequency')
# Plot distribution of R
plt.subplot(3, 2, 2); sns.distplot(data_process_test['Recency']).set_title('Test Recency')
# Plot distribution of F
plt.subplot(3, 2, 4); sns.distplot(data_process_test['Frequency']).set_title('Test Frequency')
# Plot distribution of C
plt.subplot(3, 2, 6); sns.distplot(data_process_test['Conversion Frequency']).set_title('Test Conversion Frequency')
# Show the plot
plt.show()
Observations from the six plots above:
- Recency, Frequency and Conversion Frequency plots seem to have similar behavior.
- Conversion Frequency is examined later in more detail.
First two days have the highest recency and should be used to build a retargeting schedule </font>
RFC Segmentation¶
We focus on the sum of scores to create segment for the RFM. We'll divide users into groups based on the sum of ther scores.
# Concat RFC quartile values to create RFC Segments for Base
def join_RFC(x): return str(x['R']) + str(x['F'])
data_process_base['RFC_Segment_Concat_Base'] = data_process_base.apply(join_RFC, axis=1)
rfc_base = data_process_base
#select only rows where conversion happened
rfc_base = rfc_base[rfc_base['Conversion Frequency'] != 0].reset_index(drop=True)
display(rfc_base.head())
#-----------------------------------------------------------------------------------------
# Concat RFC quartile values to create RFC Segments for Test
def join_RFC(x): return str(x['R']) + str(x['F'])
data_process_test['RFC_Segment_Concat_Test'] = data_process_test.apply(join_RFC, axis=1)
rfc_test = data_process_test
#select only rows where conversion happened
rfc_test = rfc_test[rfc_test['Conversion Frequency'] != 0].reset_index(drop=True)
display(rfc_test.head())
| Recency | Frequency | Conversion Frequency | R | F | RFC_Segment_Concat_Base | |
|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 4 | 3 | 43 |
| 1 | 7 | 2 | 1 | 1 | 2 | 12 |
| 2 | 5 | 3 | 1 | 1 | 3 | 13 |
| 3 | 1 | 5 | 1 | 4 | 4 | 44 |
| 4 | 2 | 1 | 1 | 3 | 1 | 31 |
| Recency | Frequency | Conversion Frequency | R | F | RFC_Segment_Concat_Test | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 4 | 1 | 41 |
| 1 | 1 | 5 | 1 | 4 | 4 | 44 |
| 2 | 3 | 3 | 1 | 2 | 3 | 23 |
| 3 | 6 | 3 | 1 | 1 | 3 | 13 |
| 4 | 0 | 6 | 1 | 4 | 4 | 44 |
# Count num of unique segments for Base
rfc_count_unique_base = rfc_base.groupby('RFC_Segment_Concat_Base')['RFC_Segment_Concat_Base'].nunique()
display(rfc_count_unique_base.sum())
# Count num of unique segments for Test
rfc_count_unique_test = rfc_test.groupby('RFC_Segment_Concat_Test')['RFC_Segment_Concat_Test'].nunique()
display(rfc_count_unique_test.sum())
16
16
# Calculate RF_Score for Base
rfc_base['RFC_Score'] = rfc_base[['R','F']].sum(axis=1)
display(rfc_base['RFC_Score'].head())
# Calculate RF_Score for Test
rfc_test['RFC_Score'] = rfc_test[['R','F']].sum(axis=1)
display(rfc_test['RFC_Score'].head())
0 7.0 1 3.0 2 4.0 3 8.0 4 4.0 Name: RFC_Score, dtype: float64
0 5.0 1 8.0 2 5.0 3 4.0 4 8.0 Name: RFC_Score, dtype: float64
# Define RFC_level function
def rfc_level(df):
if df['RFC_Score'] >= 6:
return 'High Conversions'
elif ((df['RFC_Score'] >= 5) and (df['RFC_Score'] < 6)):
return 'Medium Conversions'
elif ((df['RFC_Score'] >= 4) and (df['RFC_Score'] < 5)):
return 'Lower Conversions'
elif ((df['RFC_Score'] >= 3) and (df['RFC_Score'] < 4)):
return 'Need Slight Retargeting Modification'
elif ((df['RFC_Score'] >= 2) and (df['RFC_Score'] < 3)):
return 'Need Medium Retargeting Modification'
elif ((df['RFC_Score'] >= 1) and (df['RFC_Score'] < 2)):
return 'Need High Retargeting Modification'
else:
return 'Require Activation'
# Create a new variable RFC_Level in Base
rfc_base['RFC_Level'] = rfc_base.apply(rfc_level, axis=1)
# Print the header with top 5 rows to the console
display(rfc_base.head())
# Create a new variable RFC_Level in Test
rfc_test['RFC_Level'] = rfc_test.apply(rfc_level, axis=1)
# Print the header with top 5 rows to the console
display(rfc_test.head())
| Recency | Frequency | Conversion Frequency | R | F | RFC_Segment_Concat_Base | RFC_Score | RFC_Level | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 4 | 3 | 43 | 7.0 | High Conversions |
| 1 | 7 | 2 | 1 | 1 | 2 | 12 | 3.0 | Need Slight Retargeting Modification |
| 2 | 5 | 3 | 1 | 1 | 3 | 13 | 4.0 | Lower Conversions |
| 3 | 1 | 5 | 1 | 4 | 4 | 44 | 8.0 | High Conversions |
| 4 | 2 | 1 | 1 | 3 | 1 | 31 | 4.0 | Lower Conversions |
| Recency | Frequency | Conversion Frequency | R | F | RFC_Segment_Concat_Test | RFC_Score | RFC_Level | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 4 | 1 | 41 | 5.0 | Medium Conversions |
| 1 | 1 | 5 | 1 | 4 | 4 | 44 | 8.0 | High Conversions |
| 2 | 3 | 3 | 1 | 2 | 3 | 23 | 5.0 | Medium Conversions |
| 3 | 6 | 3 | 1 | 1 | 3 | 13 | 4.0 | Lower Conversions |
| 4 | 0 | 6 | 1 | 4 | 4 | 44 | 8.0 | High Conversions |
# Calculate average values for each RFC_Level, and return a size of each segment for Base
rfc_level_agg_base = rfc_base.groupby('RFC_Level').agg({
'Recency': 'mean',
'Frequency': 'mean',
'Conversion Frequency': 'mean',
}).round(1)
# Print the aggregated dataset
display(rfc_level_agg_base)
#---------------------------------------------------------------------------------
# Calculate average values for each RFC_Level, and return a size of each segment for Test
rfc_level_agg_test = rfc_test.groupby('RFC_Level').agg({
'Recency': 'mean',
'Frequency': 'mean',
'Conversion Frequency': 'mean',
}).round(1)
# Print the aggregated dataset
display(rfc_level_agg_test)
| Recency | Frequency | Conversion Frequency | |
|---|---|---|---|
| RFC_Level | |||
| High Conversions | 0.9 | 4.8 | 1.1 |
| Lower Conversions | 4.3 | 2.1 | 1.0 |
| Medium Conversions | 2.8 | 2.5 | 1.1 |
| Need Medium Retargeting Modification | 8.2 | 1.0 | 1.0 |
| Need Slight Retargeting Modification | 6.0 | 1.7 | 1.0 |
| Recency | Frequency | Conversion Frequency | |
|---|---|---|---|
| RFC_Level | |||
| High Conversions | 0.9 | 4.8 | 1.1 |
| Lower Conversions | 4.3 | 2.1 | 1.0 |
| Medium Conversions | 2.8 | 2.5 | 1.1 |
| Need Medium Retargeting Modification | 7.8 | 1.0 | 1.0 |
| Need Slight Retargeting Modification | 5.8 | 1.6 | 1.0 |
Observations from the two plots below:
- New ad campaign increase conversion from medium conversion rate customers to high conversion rate customers.
# A tree plot for Base
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(13, 8)
squarify.plot(sizes=rfc_level_agg_base['Frequency'],
label=['High Conversions',
'Medium Conversions',
'Lower Conversions',
'Need Ligthly Modified Retargeting',
'Need Medium Modified Retargeting ',
'Need Highly Modified Retargeting',
'Require Activation'], alpha=.6 )
plt.title("RFC Segments Base",fontsize=18,fontweight="bold")
plt.axis('off')
plt.show()
# A tree plot for Test
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(13, 8)
squarify.plot(sizes=rfc_level_agg_test['Frequency'],
label=['High Conversions',
'Medium Conversions',
'Lower Conversions',
'Need Ligthly Modified Retargeting',
'Need Medium Modified Retargeting ',
'Need Highly Modified Retargeting',
'Require Activation'], alpha=.6 )
plt.title("RFC Segments Test",fontsize=18,fontweight="bold")
plt.axis('off')
plt.show()
Checking the conversion rate for New Ad and Base.
- Conversion rate seems to be very small. A desired rate is usually 0.02. In this case it is 0.00035
This is a first sign that new design should not be considered. Further A/B testing can show whether it is true. </font>
A\B Testing¶
ab_summary = data.pivot_table(values='conversion', index='test', aggfunc=np.sum)
# add additional columns to the pivot table
ab_summary['total'] = data.pivot_table(values='conversion', index='test', aggfunc=lambda x: len(x))
ab_summary['rate'] = data.pivot_table(values='conversion', index='test')
display(ab_summary)
| conversion | total | rate | |
|---|---|---|---|
| test | |||
| 0 | 2723 | 262107 | 0.010389 |
| 1 | 3392 | 337893 | 0.010039 |
test_results = data[data.test == 1].conversion
base_results = data[data.test == 0].conversion
hist_data = [test_results, base_results]
group_labels = ['test', 'control']
# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot(hist_data, group_labels, bin_size=.5,
curve_type='normal',show_rug=False)
fig.layout = go.Layout(
title='Test vs Control',
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
# Plot!
pyoff.iplot(fig)
Checking the density plot for test and base groups.
- There seems to be a slight difference between two groups
- Check further below via scipy </font>
from scipy import stats
test_result = stats.ttest_ind(test_results, base_results)
print(test_result)
Ttest_indResult(statistic=-1.3396111452321071, pvalue=0.18037230587087172)
Hypotheses¶
Null Hypothesis: no significant difference between test and base group.
- t-statistic respresents the difference between averages of test and base group in units of standard error. Higher t-statistic value means bigger difference.
- p-value measures the probability of the null hypothesis to be true. If it is < 5%, the result here is statistically significant. Check if the result is significant below. </font>
def eval_test(test_results,base_results):
test_result = stats.ttest_ind(test_results, base_results)
if test_result[1] < 0.05:
print('result is significant')
else:
print('result is not significant')
eval_test(test_results, base_results)
result is not significant
The result is not statistically significant since there are two large groups with many variables. Break the data by state and test the significance again.
Conclusions¶
From distributions related to requests and the targeted users, it seems that the new ad campaign had a higher conversion rate among female under the age od 25 years old.
About the effectiveness test, while visual plots show an increase in conversion in new ad campaign compared to the baseline, t-testing on the whole baseline data vs test data showed that it was not statistically significant.