MTArt

Exploratory Data Analysis & Visualization- Final project

Julien Maudet & Franck Ngamkan

Which NYC subway station should an emerging artist eager to share his work go to?



Motivation

First look at this video of Oriel Ceballos selling art in the subway:

and this photo:

Context and Inspiration

We had the idea of conducting such a project, that may sound a bit odd in the beginning, when we met Oriel Ceballos at an art show in Harlem earlier this year. After a successful career as a professor, he decided to take an early exit and started a life as a full time artist, collector and curator. More info on him can be found on his Instagram page: https://www.instagram.com/or1el/?hl=fr

In order to broaden his audience, engage with people and sell his artworks, Oriel regularly - several days a week - goes to a subway station in either Manhattan or Brooklyn, displays his artworks and paints live.

What about his station selection process? He just tries stations with traffic and where there is enough space to display the pieces. However, it rang a bell in our data science-sensitized ears.

The next step for us is to gather data about subway stations that are relevant to our use case and try to come out with a way for artists to optimally select the subway station that best suits their requirements!!

How to make it into a Data visualization use case?

In order to go from our envy and inspiration to a data visualization task, we needed to gather datasets. But before gathering datasets, we needed to know what kind of data we were looking for. In particular, what features of subway stations were relevant to our analysis.

Here are our hypothesis on the features that matter, and that are not too complicated to access:

Is there a lot of traffic in the station?
How easy and convenient is the access to the station?
Are the people commuting here interested in Art? 
Are they wealthy?

We are not stating that the best station is the station with most traffic, in a very arty place, and with very rich people. The point here is to be able to discuss those variables in order to find the best match between an artist and a subway station.

Data Sources

We got our data from different sources: MTA turnstile data, NYC Open Data platform, and by crunching some information manually.

Open Data

Traffic data

In this task, we started from the great work by Henri Dwyer, that can be found here: https://henri.io/posts/new-york-subway-traffic-data-part-1.html. The original data is here: http://web.mta.info/developers/turnstile.html

In its final format, for each subway station, it includes the mean daily traffic, as well as the daily traffic for 6 consecutive days in April 2017.

Art galleries

In order to link a subway station with an appeal for art, we decided to count the number of art galleries in a radius of 0.2 miles around the subway station. This would be a great indicator of the artiness of the zone.

We found the data to do so on NYC OpenData: https://data.cityofnewyork.us/Recreation/New-York-City-Art-Galleries/tgyc-r5jh/data

The dataset includes all art galleries in New York, many information on the galleries such as name, telephone.. and the GPS coordinates.

Details of subway stations

For each station, we needed the GPS coordinates, to link them to the art galleries and the neighborhood, the name of the station and the type of Entrance.

We found these information on NYC OpenData: https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49/data

Manual Data collection

NYC Neighborhoods : Median income & coordinates

In order to have insights on the wealth of commuters at each station, one indicator is the median income in the neighborhood of the station. We found this information on this website: http://statisticalatlas.com/county-subdivision/New-York/New-York-County/Manhattan/Household-Income#figure/neighborhood and scrapped manually.

We then collected the GPS coordinates of the centroid of each neighborhood using Google Maps, in order to link each station to the neighborhood of the centroid it is closest to.

Data Preprocessing

Data Preparation

We've had to go through the following preprocessing steps:

Prepare all GPS coordinates to the same format
Transform the traffic data in a usable format. From turnstile events to daily traffic, per station
Format all Subway station names - Entity recognition problem - Mapping between datasets
Deduplicate the subway stations, in the case where there are different entrances and entrance types

These steps are not in the following report as they don't include visualization but are available on the github repository.

Below is a snapshot of the different datasets in their preprocessed format, before merging.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import StandardScaler
import math
import operator
from geopy.distance import vincenty
import distance
import json
import colorlover as cl
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Income - Neighborhood

In [3]:
nei = pd.read_csv('data/neighborhoods.csv', sep=';')
nei['X'] = nei['X'].apply(lambda x: "-{:.6f}".format(x))
nei['Y'] = nei['Y'].apply(lambda x: "{:.6f}".format(x))

meancome = np.mean(nei['Median Income'])
nei.index = nei['Neighborhood']
nei.head()
Out[3]:
Neighborhood Median Income Y X
Neighborhood
Tribeca Tribeca 170.5 40.718649 -74.008769
Battery Park Battery Park 153.5 40.705372 -74.014410
Carnegie Hill Carnegie Hill 150.5 40.783498 -73.955369
N Sutton Area N Sutton Area 136.3 40.757554 -73.962410
Financial Dist Financial Dist 121.1 40.710181 -74.010209

Art Galleries

In [4]:
gal = pd.read_csv('data/galleries_untouched.csv', sep=',')
In [5]:
gal['Y'] = gal['the_geom'].apply(lambda x: x.split(' ')[2].strip(')')[:9])
gal['X'] = gal['the_geom'].apply(lambda x: x.split(' ')[1].strip('(')[:10])
del gal['the_geom']
del gal['ADDRESS2']
gal = gal[['NAME','Y','X','TEL','URL','ADDRESS1','CITY','ZIP']]
In [6]:
gal.head()
Out[6]:
NAME Y X TEL URL ADDRESS1 CITY ZIP
0 O'reilly William & Co Ltd 40.773800 -73.962730 (212) 396-1822 http://www.nyc.com/arts__attractions/oreilly_w... 52 E 76th St New York 10021
1 Organization of Independent Artists - Gallery 402 40.716468 -74.009385 (212) 219-9213 http://www.nonprofitgallery.com/main/usa/ny/oi... 19 Hudson St. New York 10013
2 Owen Gallery 40.774000 -73.964351 (212) 879-2415 http://www.owengallery.com/about-us 19 E 75th St New York 10021
3 P P O W Gallerie 40.749585 -74.003892 (212) 647-1044 http://www.ppowgallery.com/ 511 W 25th St New York 10001
4 P P O W Inc 40.722907 -74.001763 (212) 941-8642 http://www.nyc.com/arts__attractions/p_p_o_w_i... 476 Broome St New York 10013

Subway Stations

In [7]:
sta = pd.read_csv('data/stations_coord_entrance.csv', sep=';')
In [8]:
types={'Stair':0,'Door':1,'Walkway':2,'Ramp': 3, 'Easement':4, 'Escalator':5, 'Elevator': 6}
sta['Entrance'] = sta['Entrance'].map(types)
types_rev={'0':'Stair','1':'Door','2':'Walkway','3':'Ramp','4':'Easement','5':'Escalator','6':'Elevator'}
types_col={'Stair':'r','Door':'g','Walkway':'b','Ramp': 'y', 'Easement':'b', 'Escalator':'r', 'Elevator': 'g'}

ent = {nam:[] for nam in sta['Name']}
for k in range(len(sta)):
    ent[sta['Name'][k]].append(sta['Entrance'][k])
ent = {sta: types_rev[str(max(ent[sta]))] for sta in ent.keys()}

del sta['Entrance']
sta = sta.groupby('Name').mean()
sta['Name'] = sta.index
sta.index = range(len(sta))

sta['Entrance'] = sta['Name'].apply(lambda x: ent[x])
sta['X'] = sta['X'].apply(lambda x: "{:.6f}".format(x))
sta['Y'] = sta['Y'].apply(lambda x: "{:.6f}".format(x))

sta = sta[['Name','Y','X','Entrance']]
sta.head()
Out[8]:
Name Y X Entrance
0 103rd St 40.784001 -73.935003 Stair
1 104th St-102nd St 40.695178 -73.844330 Stair
2 104th St-Oxford Av 40.681711 -73.837683 Stair
3 110th St 40.795020 -73.944250 Stair
4 110th St-Central Park North 40.799075 -73.951822 Stair

Subway Traffic

For each station, we have the traffic (sum of entries and exits per day) for 6 consecutive days:

April 8th 2017
April 9th 2017
April 10th 2017
April 11th 2017
April 12th 2017
April 13th 2017
In [9]:
sta_traffic = pd.read_csv('data/station_traffic.csv')
In [10]:
#Map those station names to the ones in the DataFrame sta, that has all information on stations
sta_traffic['Name'] = sta_traffic['Name'].apply(lambda x:str.lower(x))
In [11]:
with open('data/map_station_names.json','r') as f:
    map_names = json.load(f)
map_names_rev = {str(bad):str(good) for good,bad in map_names.iteritems()}
In [12]:
map_names_normal = {orig_name: orig_name.lower() for orig_name in sta['Name']}
map_names_normal_rev = {low:orig for orig, low in map_names_normal.iteritems()}

map_final = {bad: map_names_normal_rev[good] for bad, good in map_names_rev.iteritems()}
sta_traffic['Name'] = sta_traffic['Name'].map(map_final)
In [13]:
sta_traffic = sta_traffic.dropna()
sta_traffic['traffic_mean'] = (sta_traffic['traffic_april8']+sta_traffic['traffic_april9']+sta_traffic['traffic_april10']+sta_traffic['traffic_april11']+sta_traffic['traffic_april12']+sta_traffic['traffic_april13'])/6
In [14]:
sta_traffic.head()
Out[14]:
Name traffic_april8 traffic_april9 traffic_april10 traffic_april11 traffic_april12 traffic_april13 traffic_mean
0 Cypress Av 3910.75 3012.75 5537.00 5536.5000 5432.00 5703.00 4855.333333
1 5th Av-53rd St 12825.00 10323.00 51630.50 53881.5000 55479.00 56423.50 40093.750000
3 Hunts Point Av 10648.25 8327.25 17013.50 17663.0734 17202.00 17638.25 14748.720567
4 Sutter Av 5609.50 4687.25 7360.75 7524.5000 7049.00 7471.50 6617.083333
5 7th Av 14372.00 12064.75 17708.75 18591.2500 19231.75 19542.75 16918.541667

Merging the different datasets

The final dataset is a dataset where for each subway station, we have all required information:

Name
Number of art galleries
Median Income
Traffic
Entrance Type

Basically, we started with the dataset where each subway station is described and went through the following steps:

Compute the number of art galleries within 0.2 miles of the station, using GPS Coordinates of the galleries

Assign a neighborhood and a median income, using GPS Coordinates of the neighborhood centroids

Join the obtained dataset with the dataset containing the traffic data, using a mapping between two different formats for the names of the stations, that we computed using Stemming and Levenstein Distance 

Note that some stations don't have a neighborhood as we focused on neighborhoods near Manhattan. We only kept those stations, afterwards. We also filtered out the stations that have less than two galleries around, in order to make the visualizations more readable and because those stations are not very interesting, based on our assumptions above.

In [15]:
#This function computes the distance, in miles, between two GPS points.
def dist(x1, x2, y1, y2):
    dist = vincenty((x1, y1), (x2, y2)).miles
    return dist
In [16]:
sta_nbgal={sta_name: 0 for sta_name in sta['Name']}
sta_neighborhood={sta_name: '' for sta_name in sta['Name']}
In [17]:
for k in range(len(sta)):
    x1 = sta['X'][k]
    y1 = sta['Y'][k]
    station = sta['Name'][k]
    for j in range(len(gal)):
        dz = dist(x1, gal['X'][j], y1, gal['Y'][j])
        if dz < 0.2:
            sta_nbgal[station] += 1
    min_dz = 999
    for l in range(len(nei)):
        dz = dist(x1, nei['X'][l], y1, nei['Y'][l])
        if dz < min_dz and dz < 4:
            min_dz = dz
            sta_neighborhood[station]=nei['Neighborhood'][l]
In [18]:
sta['Nb_gal'] = sta['Name'].map(sta_nbgal)
sta['Neighborhood'] = sta['Name'].map(sta_neighborhood)

Below is a snapshot of the dataset of the Subway stations, before merging the traffic data, but after computing the number of galleries and joining with the neighborhood dataset

In [19]:
sta = sta.join(nei, on='Neighborhood', how='left', lsuffix='', rsuffix='_nei')
sta = sta[['Name','Y','X','Entrance','Nb_gal','Neighborhood','Median Income']]
sta.head()
Out[19]:
Name Y X Entrance Nb_gal Neighborhood Median Income
0 103rd St 40.784001 -73.935003 Stair 0 East Harlem 28.5
1 104th St-102nd St 40.695178 -73.844330 Stair 0 NaN
2 104th St-Oxford Av 40.681711 -73.837683 Stair 0 NaN
3 110th St 40.795020 -73.944250 Stair 1 East Harlem 28.5
4 110th St-Central Park North 40.799075 -73.951822 Stair 1 Harlem 38.8
In [20]:
sta_traffic.index = sta_traffic['Name']
In [21]:
sta = sta.join(sta_traffic, on='Name', how='left', lsuffix='', rsuffix='_nei')
In [22]:
sta = sta.dropna()
del sta['Name_nei']
In [23]:
sta.index = sta['Name']
sta = sta[sta['Nb_gal']>2]

Final dataset, ready for the visualization tasks!

Here is a snapshot of the final dataset

In [24]:
sta.head()
Out[24]:
Name Y X Entrance Nb_gal Neighborhood Median Income traffic_april8 traffic_april9 traffic_april10 traffic_april11 traffic_april12 traffic_april13 traffic_mean
Name
14th St 14th St 40.739460 -73.999947 Easement 12 Chelsea 101.6 29834.50 24008.75 33009.50 35592.50 34184.000000 35659.714056 32048.160676
14th St-Union Square 14th St-Union Square 40.734673 -73.989951 Easement 23 Gramercy 100.1 48145.00 39143.50 55874.50 58529.00 58056.943045 57881.766546 52938.451599
157th St 157th St 40.834041 -73.944890 Stair 3 Hamilton Hts 35.9 13347.00 12104.00 17161.25 17529.75 17269.750000 17751.750000 15860.583333
18th St 18th St 40.741040 -73.997871 Stair 21 Chelsea 101.6 13178.75 11724.25 18181.00 18410.75 18407.750000 19273.000000 16529.250000
23rd St 23rd St 40.742316 -73.991510 Easement 31 Garment Dist 118.9 3044.50 2382.00 4469.75 4592.50 4524.250000 4706.000000 3953.166667

Exploratory Data Analysis

Interactive plots using plotly, please play with it!

In [25]:
import plotly.plotly as py
import cufflinks as cf
import plotly.graph_objs as go
from plotly.graph_objs import *
In [26]:
import plotly
plotly.offline.init_notebook_mode()
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

First set of simple plots, in order to understand the different variables

Type of Entrance
Density in Art Galleries
Median Income of the neighborhood
Traffic

Type of Entrance

In [27]:
data = [go.Bar(
            x=sta['Entrance'].value_counts().index,
            y=list(sta['Entrance'].value_counts()),
            marker=dict(color='rgb(62,57,193)')
    )]
layout = go.Layout(
    title="Type of Entrance - Frequency"
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Most stations have stairs, which can be a problem if the artworks are very heavy or large for instance. An artist may want to select a station that has an elevator.

Density in Art Galleries

In [28]:
data = [go.Bar(
            y=sta['Nb_gal'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['Nb_gal'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['galleries around<br><b>'+sta['Neighborhood'][sta['Nb_gal'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=170,
        r=30,
        b=100,
        t=100,
        pad=4
    ),
    title="Number of galleries around each Subway Station",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Number of art galeries within 0.2 miles'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

We find reassuring insights: 68th St-Hunter College, Lexington Av for instance are in the Upper East Side, in the Museum area - MET, Guggenheim... - where the art gallery density is indeed very high.

Spring St, Canal St, Prince St are in very arty areas downtown Manhattan, it is thus a good thing to find them at the top of our ranking.

Mean Traffic per station

In [29]:
data = [go.Bar(
            y=sta['traffic_mean'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['traffic_mean'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['<b>'+sta['Neighborhood'][sta['traffic_mean'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=900,
    height=700,
    margin=go.Margin(
        l=210,
        r=0,
        b=100,
        t=100,
        pad=4
    ),
    title="Mean daily traffic for each Subway Station",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean daily traffic'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

There is not much surprise on the Traffic either. Grand Central, 34th St and 42nd St are known to be the main train stations in Manhattan, with a massive daily traffic. We still see that the slope is pretty high at the top of the ranking, meaning that there are a few stations with a massive traffic and a lot of stations with a more homogeneous traffic, around 50k people per day.

Median Income per station

In [30]:
data = [go.Bar(
            y=sta['Median Income'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['Median Income'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['<b>'+sta['Neighborhood'][sta['Median Income'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=900,
    height=700,
    margin=go.Margin(
        l=210,
        r=0,
        b=100,
        t=100,
        pad=4
    ),
    title="Median Income around each Subway Station (k$)",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Median Income'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

The horizontal bars here are obviously grouped by neighborhood. This is due to the way we computed the median income for each station, as we assigned the income of its neighborhood to each station.

We find the stations in the richest neighborhoods on top (Chambers St in Tribeca, Whitehall St in Battery Park, Lexington Av in N. Sutton Area...)

Combining the variables

In the following charts, we combine the different variables, in order to gain insights on the subway stations to pick.

Number of art galleries vs Traffic

In this first scatter plot,

dot: a subway station
y coordinate: number of art galleries around
x coordinate: mean traffic
In [31]:
data = [go.Scatter(
        x = sta['traffic_mean'], 
        y = sta['Nb_gal'], 
        text = sta['Name'], 
        mode = 'markers', 
        name = 'Subway station',
        marker = dict(
            size = 13,
            color = 'rgb(62,57,193)'
        )
    )]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)


# IPython notebook
# py.iplot(fig, filename='pandas/multiple-scatter')

Based on this plot, we can combine our observations on the traffic and the density of art galleries. Typically the subway stations are polarized along the axis. There are no stations with a massive traffic as well as a large number of galleries. There is mostly a trade off between an arty station and a station with a lot of traffic, except for a few stations such as Canal St or West 4. We will go deeper in the analysis in the charts below.

However, we have absolute values for both features, which is not optimal for our task of selecting the subway station that would match with an artist, based on their criteria.

By scaling the variables Number of art galleries and Traffic (retrieveing the mean and dividing by the standard deviation), we will be able to see what stations are more arty than the majority, and which ones have more traffic than the mean!

Number of art galleries vs Traffic - Scaled

In the following plot, we have scaled the data. As a consequence of that, the dot in the upper right part of the graph have more traffic and more galleries than the majority of stations, those in the upper left part have more galleries but less traffic, those in the lower left part have less galleries and less traffic and those in the lower right part have more traffic but less galleries.

These four groups have been assigned different colors

In [32]:
sta['traffic_scaled'] = sta['traffic_mean']
sta['traffic_scaled'] = sta['traffic_scaled'].apply(lambda x: (x-np.mean(sta['traffic_mean']))/np.std(sta['traffic_mean']))

sta['nbgal_scaled'] = sta['Nb_gal']
sta['nbgal_scaled'] = sta['nbgal_scaled'].apply(lambda x: (x-np.mean(sta['Nb_gal']))/np.std(sta['Nb_gal']))
In [33]:
sta['quad'] = 0
for k in range(len(sta['quad'])):
    if sta['traffic_scaled'][k]<0:
        if sta['nbgal_scaled'][k]<0:
            sta['quad'][k]=1
        else:
            sta['quad'][k]=2
    else:
        if sta['nbgal_scaled'][k]<0:
            sta['quad'][k]=3
        else:
            sta['quad'][k]=4
In [34]:
data = [go.Scatter(
        x = sta[sta['quad']==1]['traffic_scaled'],y = sta[sta['quad']==1]['nbgal_scaled'],
        text = sta[sta['quad']==1]['Name'],mode = 'markers', 
        name = 'Low art / Low traffic',marker = dict(size = 13,color = 'rgb(255,215,0)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==2]['traffic_scaled'],y = sta[sta['quad']==2]['nbgal_scaled'],
        text = sta[sta['quad']==2]['Name'],mode = 'markers', 
        name = 'High art / Low traffic',marker = dict(size = 13,color = 'rgb(34,139,34)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==3]['traffic_scaled'],y = sta[sta['quad']==3]['nbgal_scaled'],
        text = sta[sta['quad']==3]['Name'],mode = 'markers', 
        name = 'Low art / High traffic',marker = dict(size = 13,color = 'rgb(240,128,128)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==4]['traffic_scaled'],y = sta[sta['quad']==4]['nbgal_scaled'],
        text = sta[sta['quad']==4]['Name'],mode = 'markers', 
        name = 'High art / High traffic',marker = dict(size = 13,color = 'rgb(100,149,237)',opacity=0.9)),
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

As described above and in the legend of the plot, we have defined four groups of art galleries.

Below is an example of how to read this plot:

An artist who doesn't want to engage with arty people but wants to reach the larger audience as possible would probably pick a station in the red group, such as 42nd St.

An artist who wants a lot of traffic as well as engaging with arty people would try a station in the blue group, such as West 4 or Canal St.

An artist who wants to be in a rather calm station - low traffic - but with arty people may want to go to Spring St or Lexington Av!

And an artist who wants neither traffic nor arty commuters may pick a station in the yellow group!

Yet, this plot doesn't talk about the mean income of people living near the station, and thus likely to commute through the station. We will add this feature in the next plot.

Number of art galleries vs Traffic vs Income

In this plot, the size of the dot is correlated to the median income of the neighborhood the station is located in.

In [35]:
def bin_income(x):
    if x<50:
        return 11
    elif 50<x<100:
        return 15
    elif 100<x<150:
        return 20
    else:
        return 27
In [36]:
sta['income_binned'] = sta['Median Income'] 
sta['income_binned'] = sta['income_binned'].apply(lambda x: bin_income(x))
In [37]:
data = [go.Scatter(
        x = sta[sta['quad']==1]['traffic_scaled'],y = sta[sta['quad']==1]['nbgal_scaled'],
        text = sta[sta['quad']==1]['Name'],mode = 'markers',name = 'Low art / Low traffic',
        marker = dict(size = sta[sta['quad']==1]['income_binned'],color = 'rgb(255,215,0)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==2]['traffic_scaled'],y = sta[sta['quad']==2]['nbgal_scaled'],
        text = sta[sta['quad']==2]['Name'],mode = 'markers',name = 'High art / Low traffic',
        marker = dict(size = sta[sta['quad']==2]['income_binned'],color = 'rgb(34,139,34)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==3]['traffic_scaled'],y = sta[sta['quad']==3]['nbgal_scaled'],
        text = sta[sta['quad']==3]['Name'],mode = 'markers',name = 'Low art / High traffic',
        marker = dict(size = sta[sta['quad']==3]['income_binned'],color = 'rgb(240,128,128)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==4]['traffic_scaled'],y = sta[sta['quad']==4]['nbgal_scaled'],
        text = sta[sta['quad']==4]['Name'],mode = 'markers',name = 'High art / High traffic',
        marker = dict(size = sta[sta['quad']==4]['income_binned'],color = 'rgb(100,149,237)',opacity=0.9)),
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

We can now complete our analysis!

Let's say the artist really wants to sell his artworks and not only display it.

If he finds himself in the blue group, he may prefer West 4th over Canal St.

If he chose the red group, he may stay away from Bedford Ave and go to 59th Columbus Circle or 42nd st.

If the green group took his preference, Lexington Av would be a better option than Prince St for instance!

Number of art galleries vs Traffic vs Entrance type

We replace the feature 'Income' by the feature 'Entrance' to maintain a great readability of the graph. As there are 5 types of Entrance, we assign a color to each type of Entrance, as descirbed in the legend.

In [38]:
data = [go.Scatter(
        x = sta[sta['Entrance']=='Stair']['traffic_scaled'],y = sta[sta['Entrance']=='Stair']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Stair']['Name'],mode = 'markers',name = 'Stair',
        marker = dict(size = 15,color = 'rgb(255,192,203)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Door']['traffic_scaled'],y = sta[sta['Entrance']=='Door']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Door']['Name'],mode = 'markers',name = 'Door',
        marker = dict(size = 15,color = 'rgb(0,0,205)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Easement']['traffic_scaled'],y = sta[sta['Entrance']=='Easement']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Easement']['Name'],mode = 'markers',name = 'Easement',
        marker = dict(size = 15,color = 'rgb(138,43,226)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Escalator']['traffic_scaled'],y = sta[sta['Entrance']=='Escalator']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Escalator']['Name'],mode = 'markers',name = 'Escalator',
        marker = dict(size = 15,color = 'rgb(139,0,139)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Elevator']['traffic_scaled'],y = sta[sta['Entrance']=='Elevator']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Elevator']['Name'],mode = 'markers',name = 'Elevator',
        marker = dict(size = 15,color = 'rgb(255,20,147)',opacity=0.9))
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations - Entrance Type",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

If the artist says "I only want to showcase my art, not sell it! But it is very heavy and I have broken my ankle, I may need an elevator!"

This artist should look for the Light Pink dots, in the upper left corner, as he doesn't particularly want to sell.

We would recommend him to try Lexington Av or 34 St Hudson Yards!

What next?

In the nearest future, build a recommandation algorithm, that would recommand a subway station to an artist, based on its preferences. We would then add this recommandation system to this web app :)