Kaggle - GA Customer Revenue prediction - EDA

21 minute read

Published:

There is an ongoing competition in Kaggle. I want to share the exploration of the data as learned from other competitor’s kernel. Of course I will not include the detailed feature engineering and modeling here and now.

Overview

What to predict

Mission of this competition is to predict natural log of the sum of all transactions per user

Data

  • fullVisitorId- User ID
  • channelGrouping - Channel
  • date - date
  • device - device
  • geoNetwork - User’s geography info
  • socialEngagementType - Either “Socially Engaged” or “Not Socially Engaged”
  • totals - Aggregate values across the session
  • trafficSource - Traffic Source from which the session originated
  • visitId - An identifier for this session
  • visitNumber - The session number for this user
  • visitStartTime - The timestamp
  • hits - A record of all page visits
  • customDimensions - User-level or session-level custom dimensions that are set for a session
  • totals - High-level aggregate data.

Evaluation

Using RMSE to measure the difference between the predictions and the fare. \(RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i)^2}\)

So here, the lower RMSE means better prediction.

EDA

I want to share that I learned and practiced after study SRK and Andrew Lukyanenko ‘s kernel. Thanks for their open kernel for study and let me start the practice.

Import python library to be used.

import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import pandas as pd
import json
import bq_helper
from pandas.io.json import json_normalize
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import lightgbm as lgb

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import TimeSeriesSplit, KFold
from sklearn.metrics import mean_squared_error

Let’s see the train data. There are JSON format in several columns like device, hits, etc.

train1=pd.read_csv("../input/train_v2.csv",nrows=200)
train1.head()
channelGroupingcustomDimensionsdatedevicefullVisitorIdgeoNetworkhitssocialEngagementTypetotalstrafficSourcevisitIdvisitNumbervisitStartTime
0Organic Search[{'index': '4', 'value': 'EMEA'}]20171016{"browser": "Firefox", "browserVersion": "not ...3162355547410993243{"continent": "Europe", "subContinent": "Weste...[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged{"visits": "1", "hits": "1", "pageviews": "1",...{"campaign": "(not set)", "source": "google", ...150819845011508198450
1Referral[{'index': '4', 'value': 'North America'}]20171016{"browser": "Chrome", "browserVersion": "not a...8934116514970143966{"continent": "Americas", "subContinent": "Nor...[{'hitNumber': '1', 'time': '0', 'hour': '10',...Not Socially Engaged{"visits": "1", "hits": "2", "pageviews": "2",...{"referralPath": "/a/google.com/transportation...150817630761508176307
2Direct[{'index': '4', 'value': 'North America'}]20171016{"browser": "Chrome", "browserVersion": "not a...7992466427990357681{"continent": "Americas", "subContinent": "Nor...[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged{"visits": "1", "hits": "2", "pageviews": "2",...{"campaign": "(not set)", "source": "(direct)"...150820161311508201613
3Organic Search[{'index': '4', 'value': 'EMEA'}]20171016{"browser": "Chrome", "browserVersion": "not a...9075655783635761930{"continent": "Asia", "subContinent": "Western...[{'hitNumber': '1', 'time': '0', 'hour': '9', ...Not Socially Engaged{"visits": "1", "hits": "2", "pageviews": "2",...{"campaign": "(not set)", "source": "google", ...150816985111508169851
4Organic Search[{'index': '4', 'value': 'Central America'}]20171016{"browser": "Chrome", "browserVersion": "not a...6960673291025684308{"continent": "Americas", "subContinent": "Cen...[{'hitNumber': '1', 'time': '0', 'hour': '14',...Not Socially Engaged{"visits": "1", "hits": "2", "pageviews": "2",...{"campaign": "(not set)", "source": "google", ...150819055211508190552

Julian Peller opened his kernel to deal with parsing JSON as below.

# https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields

def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df
%%time
train = load_df("../input/train_v2.csv",nrows=20000)
test = load_df("../input/test_v2.csv")
Loaded train_v2.csv. Shape: (20000, 59)
Loaded test_v2.csv. Shape: (401589, 59)
CPU times: user 2min 10s, sys: 11.8 s, total: 2min 22s
Wall time: 2min 24s

After parsing, data looks as below.

train.head()
channelGroupingcustomDimensionsdatefullVisitorIdhitssocialEngagementTypevisitIdvisitNumbervisitStartTimedevice.browserdevice.browserSizedevice.browserVersiondevice.deviceCategorydevice.flashVersiondevice.isMobiledevice.languagedevice.mobileDeviceBrandingdevice.mobileDeviceInfodevice.mobileDeviceMarketingNamedevice.mobileDeviceModeldevice.mobileInputSelectordevice.operatingSystemdevice.operatingSystemVersiondevice.screenColorsdevice.screenResolutiongeoNetwork.citygeoNetwork.cityIdgeoNetwork.continentgeoNetwork.countrygeoNetwork.latitudegeoNetwork.longitudegeoNetwork.metrogeoNetwork.networkDomaingeoNetwork.networkLocationgeoNetwork.regiongeoNetwork.subContinenttotals.bouncestotals.hitstotals.newVisitstotals.pageviewstotals.sessionQualityDimtotals.timeOnSitetotals.totalTransactionRevenuetotals.transactionRevenuetotals.transactionstotals.visitstrafficSource.adContenttrafficSource.adwordsClickInfo.adNetworkTypetrafficSource.adwordsClickInfo.criteriaParameterstrafficSource.adwordsClickInfo.gclIdtrafficSource.adwordsClickInfo.isVideoAdtrafficSource.adwordsClickInfo.pagetrafficSource.adwordsClickInfo.slottrafficSource.campaigntrafficSource.isTrueDirecttrafficSource.keywordtrafficSource.mediumtrafficSource.referralPathtrafficSource.source
0Organic Search[{'index': '4', 'value': 'EMEA'}]201710163162355547410993243[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged150819845011508198450Firefoxnot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetEuropeGermanynot available in demo datasetnot available in demo datasetnot available in demo dataset(not set)not available in demo datasetnot available in demo datasetWestern Europe11111NaNNaNNaNNaN1NaNNaNnot available in demo datasetNaNNaNNaNNaN(not set)NaNwater bottleorganicNaNgoogle
1Referral[{'index': '4', 'value': 'North America'}]201710168934116514970143966[{'hitNumber': '1', 'time': '0', 'hour': '10',...Not Socially Engaged150817630761508176307Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetChrome OSnot available in demo datasetnot available in demo datasetnot available in demo datasetCupertinonot available in demo datasetAmericasUnited Statesnot available in demo datasetnot available in demo datasetSan Francisco-Oakland-San Jose CA(not set)not available in demo datasetCaliforniaNorthern AmericaNaN2NaN2228NaNNaNNaN1NaNNaNnot available in demo datasetNaNNaNNaNNaN(not set)NaNNaNreferral/a/google.com/transportation/mtv-services/bike...sites.google.com
2Direct[{'index': '4', 'value': 'North America'}]201710167992466427990357681[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged150820161311508201613Chromenot available in demo datasetnot available in demo datasetmobilenot available in demo datasetTruenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAndroidnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAmericasUnited Statesnot available in demo datasetnot available in demo datasetnot available in demo datasetwindjammercable.netnot available in demo datasetnot available in demo datasetNorthern AmericaNaN212138NaNNaNNaN1NaNNaNnot available in demo datasetNaNNaNNaNNaN(not set)TrueNaN(none)NaN(direct)
3Organic Search[{'index': '4', 'value': 'EMEA'}]201710169075655783635761930[{'hitNumber': '1', 'time': '0', 'hour': '9', ...Not Socially Engaged150816985111508169851Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAsiaTurkeynot available in demo datasetnot available in demo datasetnot available in demo datasetunknown.unknownnot available in demo datasetnot available in demo datasetWestern AsiaNaN21211NaNNaNNaN1NaNNaNnot available in demo datasetNaNNaNNaNNaN(not set)NaN(not provided)organicNaNgoogle
4Organic Search[{'index': '4', 'value': 'Central America'}]201710166960673291025684308[{'hitNumber': '1', 'time': '0', 'hour': '14',...Not Socially Engaged150819055211508190552Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAmericasMexiconot available in demo datasetnot available in demo datasetnot available in demo datasetprod-infinitum.com.mxnot available in demo datasetnot available in demo datasetCentral AmericaNaN212152NaNNaNNaN1NaNNaNnot available in demo datasetNaNNaNNaNNaN(not set)NaN(not provided)organicNaNgoogle

Adjust NA of ‘trafficSource.adwordsClickInfo.isVideoAd’ to True

Adjust NA of ‘trafficSource.isTrueDirect’ to False

Then set the date format as yyyy-mm-dd

# some data processing
train['trafficSource.adwordsClickInfo.isVideoAd'].fillna(True, inplace=True)
test['trafficSource.adwordsClickInfo.isVideoAd'].fillna(True, inplace=True)
train['trafficSource.isTrueDirect'].fillna(False, inplace=True)
test['trafficSource.isTrueDirect'].fillna(False, inplace=True)

train['date'] = pd.to_datetime(train['date'].apply(lambda x: str(x)[:4] + '-' + str(x)[4:6] + '-' + str(x)[6:]))
test['date'] = pd.to_datetime(test['date'].apply(lambda x: str(x)[:4] + '-' + str(x)[4:6] + '-' + str(x)[6:]))
train.head()
channelGroupingcustomDimensionsdatefullVisitorIdhitssocialEngagementTypevisitIdvisitNumbervisitStartTimedevice.browserdevice.browserSizedevice.browserVersiondevice.deviceCategorydevice.flashVersiondevice.isMobiledevice.languagedevice.mobileDeviceBrandingdevice.mobileDeviceInfodevice.mobileDeviceMarketingNamedevice.mobileDeviceModeldevice.mobileInputSelectordevice.operatingSystemdevice.operatingSystemVersiondevice.screenColorsdevice.screenResolutiongeoNetwork.citygeoNetwork.cityIdgeoNetwork.continentgeoNetwork.countrygeoNetwork.latitudegeoNetwork.longitudegeoNetwork.metrogeoNetwork.networkDomaingeoNetwork.networkLocationgeoNetwork.regiongeoNetwork.subContinenttotals.bouncestotals.hitstotals.newVisitstotals.pageviewstotals.sessionQualityDimtotals.timeOnSitetotals.totalTransactionRevenuetotals.transactionRevenuetotals.transactionstotals.visitstrafficSource.adContenttrafficSource.adwordsClickInfo.adNetworkTypetrafficSource.adwordsClickInfo.criteriaParameterstrafficSource.adwordsClickInfo.gclIdtrafficSource.adwordsClickInfo.isVideoAdtrafficSource.adwordsClickInfo.pagetrafficSource.adwordsClickInfo.slottrafficSource.campaigntrafficSource.isTrueDirecttrafficSource.keywordtrafficSource.mediumtrafficSource.referralPathtrafficSource.source
0Organic Search[{'index': '4', 'value': 'EMEA'}]2017-10-163162355547410993243[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged150819845011508198450Firefoxnot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetEuropeGermanynot available in demo datasetnot available in demo datasetnot available in demo dataset(not set)not available in demo datasetnot available in demo datasetWestern Europe11111NaNNaNNaNNaN1NaNNaNnot available in demo datasetNaNTrueNaNNaN(not set)Falsewater bottleorganicNaNgoogle
1Referral[{'index': '4', 'value': 'North America'}]2017-10-168934116514970143966[{'hitNumber': '1', 'time': '0', 'hour': '10',...Not Socially Engaged150817630761508176307Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetChrome OSnot available in demo datasetnot available in demo datasetnot available in demo datasetCupertinonot available in demo datasetAmericasUnited Statesnot available in demo datasetnot available in demo datasetSan Francisco-Oakland-San Jose CA(not set)not available in demo datasetCaliforniaNorthern AmericaNaN2NaN2228NaNNaNNaN1NaNNaNnot available in demo datasetNaNTrueNaNNaN(not set)FalseNaNreferral/a/google.com/transportation/mtv-services/bike...sites.google.com
2Direct[{'index': '4', 'value': 'North America'}]2017-10-167992466427990357681[{'hitNumber': '1', 'time': '0', 'hour': '17',...Not Socially Engaged150820161311508201613Chromenot available in demo datasetnot available in demo datasetmobilenot available in demo datasetTruenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAndroidnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAmericasUnited Statesnot available in demo datasetnot available in demo datasetnot available in demo datasetwindjammercable.netnot available in demo datasetnot available in demo datasetNorthern AmericaNaN212138NaNNaNNaN1NaNNaNnot available in demo datasetNaNTrueNaNNaN(not set)TrueNaN(none)NaN(direct)
3Organic Search[{'index': '4', 'value': 'EMEA'}]2017-10-169075655783635761930[{'hitNumber': '1', 'time': '0', 'hour': '9', ...Not Socially Engaged150816985111508169851Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAsiaTurkeynot available in demo datasetnot available in demo datasetnot available in demo datasetunknown.unknownnot available in demo datasetnot available in demo datasetWestern AsiaNaN21211NaNNaNNaN1NaNNaNnot available in demo datasetNaNTrueNaNNaN(not set)False(not provided)organicNaNgoogle
4Organic Search[{'index': '4', 'value': 'Central America'}]2017-10-166960673291025684308[{'hitNumber': '1', 'time': '0', 'hour': '14',...Not Socially Engaged150819055211508190552Chromenot available in demo datasetnot available in demo datasetdesktopnot available in demo datasetFalsenot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetWindowsnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetnot available in demo datasetAmericasMexiconot available in demo datasetnot available in demo datasetnot available in demo datasetprod-infinitum.com.mxnot available in demo datasetnot available in demo datasetCentral AmericaNaN212152NaNNaNNaN1NaNNaNnot available in demo datasetNaNTrueNaNNaN(not set)False(not provided)organicNaNgoogle

Data Exploration

At first, drop some useless columns that has only one variables like ‘socialEngagementType’

The number of unique variable in ‘socialEngagementType’ is 1, as below.

train['socialEngagementType'].nunique(dropna=False)
1
cols_to_drop = [col for col in train.columns if train[col].nunique(dropna=False) == 1]
train.drop(cols_to_drop, axis=1, inplace=True)
test.drop([col for col in cols_to_drop if col in test.columns], axis=1, inplace=True)

print(f'Dropped {len(cols_to_drop)} columns.')
Dropped 19 columns.

Change some variables into float data

for col in ['visitNumber', 'totals.hits', 'totals.pageviews', 'totals.transactionRevenue']:
    train[col] = train[col].astype(float)

Drop some features

train.drop(['customDimensions', 'hits', 'trafficSource.referralPath', 'trafficSource.source', 'totals.totalTransactionRevenue'], axis=1, inplace=True)
test.drop(['customDimensions', 'hits', 'trafficSource.referralPath', 'trafficSource.source', 'totals.totalTransactionRevenue'], axis=1, inplace=True)

This data seems moe simple and looks better.

train.head()
channelGroupingdatefullVisitorIdvisitIdvisitNumbervisitStartTimedevice.browserdevice.deviceCategorydevice.isMobiledevice.operatingSystemgeoNetwork.citygeoNetwork.continentgeoNetwork.countrygeoNetwork.metrogeoNetwork.networkDomaingeoNetwork.regiongeoNetwork.subContinenttotals.bouncestotals.hitstotals.newVisitstotals.pageviewstotals.sessionQualityDimtotals.timeOnSitetotals.transactionRevenuetotals.transactionstrafficSource.adContenttrafficSource.adwordsClickInfo.adNetworkTypetrafficSource.adwordsClickInfo.gclIdtrafficSource.adwordsClickInfo.isVideoAdtrafficSource.adwordsClickInfo.pagetrafficSource.adwordsClickInfo.slottrafficSource.campaigntrafficSource.isTrueDirecttrafficSource.keywordtrafficSource.medium
0Organic Search2017-10-16316235554741099324315081984501.01508198450FirefoxdesktopFalseWindowsnot available in demo datasetEuropeGermanynot available in demo dataset(not set)not available in demo datasetWestern Europe11.011.01NaNNaNNaNNaNNaNNaNTrueNaNNaN(not set)Falsewater bottleorganic
1Referral2017-10-16893411651497014396615081763076.01508176307ChromedesktopFalseChrome OSCupertinoAmericasUnited StatesSan Francisco-Oakland-San Jose CA(not set)CaliforniaNorthern AmericaNaN2.0NaN2.0228NaNNaNNaNNaNNaNTrueNaNNaN(not set)FalseNaNreferral
2Direct2017-10-16799246642799035768115082016131.01508201613ChromemobileTrueAndroidnot available in demo datasetAmericasUnited Statesnot available in demo datasetwindjammercable.netnot available in demo datasetNorthern AmericaNaN2.012.0138NaNNaNNaNNaNNaNTrueNaNNaN(not set)TrueNaN(none)
3Organic Search2017-10-16907565578363576193015081698511.01508169851ChromedesktopFalseWindowsnot available in demo datasetAsiaTurkeynot available in demo datasetunknown.unknownnot available in demo datasetWestern AsiaNaN2.012.011NaNNaNNaNNaNNaNTrueNaNNaN(not set)False(not provided)organic
4Organic Search2017-10-16696067329102568430815081905521.01508190552ChromedesktopFalseWindowsnot available in demo datasetAmericasMexiconot available in demo datasetprod-infinitum.com.mxnot available in demo datasetCentral AmericaNaN2.012.0152NaNNaNNaNNaNNaNTrueNaNNaN(not set)False(not provided)organic

Let’s see the revenue of each visitor.

gdf = train.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()
gdf
fullVisitorIdtotals.transactionRevenue
000002454373746753680.0
100005932557970397680.0
200007509293155233530.0
300011917661793926570.0
400036318403341890250.0
500043744018452040550.0
.........
18381999767892094019330.0
18382999809035805811210.0
1838399992500199526217380.0

18384 rows × 2 columns

Make a graph with the table above, sorting with logarithm of revenue. The graph shows very low percent of visitors make almost revenue.

plt.figure(figsize=(8,6))
plt.scatter(range(gdf.shape[0]), np.sort(np.log1p(gdf["totals.transactionRevenue"].values)))
plt.xlabel('index', fontsize=12)
plt.ylabel('TransactionRevenue', fontsize=12)
plt.show()

png

pd.notnull(train["totals.transactionRevenue"])
0        False
1        False
2        False
3        False
         ...  
19997    False
19998    False
19999    False
Name: totals.transactionRevenue, Length: 20000, dtype: bool

After calculating the ratio of real users, it shows only 1.03% of users make revenue through transaction

print("Number of unique customers with non-zero revenue : ", (gdf["totals.transactionRevenue"]>0).sum(), "and the ratio is : ", round(100*(gdf["totals.transactionRevenue"]>0).sum() / gdf.shape[0],2),"%")
Number of unique customers with non-zero revenue :  189 and the ratio is :  1.03 %

Let’s look at how many users visit how many times. It shows about 92% are unique users and only 8% users are revisited.

print("Number of unique visitors in train set : ",train.fullVisitorId.nunique(), " the ratio is : ",round(100*train.fullVisitorId.nunique()/train.shape[0],2),"%")
Number of unique visitors in train set :  18384  the ratio is :  91.92 %

Let’s explore data by device. For example, groupping the train data by “browser” with size, count and mean, then the table is as below.

train.groupby('device.browser')['totals.transactionRevenue'].agg(['size', 'count', 'mean'])
sizecountmean
device.browser
ADM10NaN
Amazon Silk200NaN
Android Browser110NaN
Android Webview3000NaN
BlackBerry40NaN
Chrome139781711.167348e+08
Coc Coc140NaN
Edge26731.348333e+07
Firefox82921.349000e+07
Internet Explorer53014.200000e+07
MRCHROME50NaN
Maxthon20NaN
Mozilla Compatible Agent50NaN
Nintendo Browser20NaN
Nokia Browser20NaN
Opera1340NaN
Opera Mini1910NaN
Playstation Vita Browser10NaN
Puffin30NaN
Safari3340156.022200e+07
Safari (in-app)13911.697000e+07
Samsung Internet1370NaN
SeaMonkey10NaN
UC Browser480NaN
YaBrowser360NaN

Using same logic to browser, device and OS, draw graphs as below.

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

def horizontal_bar_chart(cnt_srs, color):
    trace = go.Bar(
        y=cnt_srs.index[::-1],
        x=cnt_srs.values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace

# Device Browser
cnt_srs = train.groupby('device.browser')['totals.transactionRevenue'].agg(['size', 'count', 'mean'])
cnt_srs.columns = ["count", "count of non-zero revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by="count", ascending=False)
trace1 = horizontal_bar_chart(cnt_srs["count"].head(10), 'rgba(50, 171, 96, 0.6)')
trace2 = horizontal_bar_chart(cnt_srs["count of non-zero revenue"].head(10), 'rgba(50, 171, 96, 0.6)')
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(10), 'rgba(50, 171, 96, 0.6)')

# Device Category
cnt_srs = train.groupby('device.deviceCategory')['totals.transactionRevenue'].agg(['size', 'count', 'mean'])
cnt_srs.columns = ["count", "count of non-zero revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by="count", ascending=False)
trace4 = horizontal_bar_chart(cnt_srs["count"].head(10), 'rgba(71, 58, 131, 0.8)')
trace5 = horizontal_bar_chart(cnt_srs["count of non-zero revenue"].head(10), 'rgba(71, 58, 131, 0.8)')
trace6 = horizontal_bar_chart(cnt_srs["mean"].head(10), 'rgba(71, 58, 131, 0.8)')

# Operating system
cnt_srs = train.groupby('device.operatingSystem')['totals.transactionRevenue'].agg(['size', 'count', 'mean'])
cnt_srs.columns = ["count", "count of non-zero revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by="count", ascending=False)
trace7 = horizontal_bar_chart(cnt_srs["count"].head(10), 'rgba(246, 78, 139, 0.6)')
trace8 = horizontal_bar_chart(cnt_srs["count of non-zero revenue"].head(10),'rgba(246, 78, 139, 0.6)')
trace9 = horizontal_bar_chart(cnt_srs["mean"].head(10),'rgba(246, 78, 139, 0.6)')

# Creating two subplots
fig = tools.make_subplots(rows=3, cols=3, vertical_spacing=0.04, 
                          subplot_titles=["Device Browser - Count", "Device Browser - Non-zero Revenue Count", "Device Browser - Mean Revenue",
                                          "Device Category - Count",  "Device Category - Non-zero Revenue Count", "Device Category - Mean Revenue", 
                                          "Device OS - Count", "Device OS - Non-zero Revenue Count", "Device OS - Mean Revenue"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)
fig.append_trace(trace7, 3, 1)
fig.append_trace(trace8, 3, 2)
fig.append_trace(trace9, 3, 3)

fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title="Device Plots")
py.iplot(fig, filename='device-plots')
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]
[ (2,1) x4,y4 ]  [ (2,2) x5,y5 ]  [ (2,3) x6,y6 ]
[ (3,1) x7,y7 ]  [ (3,2) x8,y8 ]  [ (3,3) x9,y9 ]

png

Below graphs show the continent dependancy.

# Continent
cnt_srs = train.groupby('geoNetwork.continent')['totals.transactionRevenue'].agg(['size', 'count', 'mean'])
cnt_srs.columns = ["count", "count of non-zero revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by="count", ascending=False)
trace1 = horizontal_bar_chart(cnt_srs["count"].head(10), 'rgba(58, 71, 80, 0.6)')
trace2 = horizontal_bar_chart(cnt_srs["count of non-zero revenue"].head(10), 'rgba(58, 71, 80, 0.6)')
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(10), 'rgba(58, 71, 80, 0.6)')

# Creating subplots
fig = tools.make_subplots(rows=1, cols=3, vertical_spacing=0.08, horizontal_spacing=0.15, 
                          subplot_titles=["Continent - Count", "Continent - Non-zero Revenue Count", "Continent - Mean Revenue"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)

fig['layout'].update(height=500, width=1200, paper_bgcolor='rgb(233,233,233)', title="Geography Plots")
py.iplot(fig, filename='geo-plots')
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]

png

Until now is the part of exploratory data analysis for GA revenue prediction, which I learned from SRK and Andrew Lukyanenko ‘s kernel