MTArt¶

Exploratory Data Analysis & Visualization- Final project¶

Julien Maudet & Franck Ngamkan

Link to the source code https://github.com/julmaud/MTArt ¶

Motivation¶

First look at this video of Oriel Ceballos selling art in the subway:

A video of my #subway #gallery. FYI: Come see my Poetic Justice Exhibit this Monday, April 17th from 7pm-10pm @daddygreenspizzabk 352 Malcolm X BLVD, Brooklyn, NY 11233. I am the curator for this amazing exhibition, which will feature talented artists, musicians, and poets. Hope to see you there. #or1el#artist#nyc#nycsubway#nycartist#nycblogger#nycart#nyc❤️#nyctattoo#nyclife#nycphotography#subwayart#subwaysketch#subwaypeople#artwork#artfido#art#artlife#artstudio

Une publication partagée par Oriel Ceballos (@or1el) le 15 Avril 2017 à 10h32 PDT

and this photo:

My original #GetOut 5"x7 painting was pricey, but after watching the movie, this young queen was compelled to buy it. I was happy to learn that it's a birthday gift for her cousin who loves art. We've got good people out here! I'm glad we met. Follow @or1el for more #dopeness and to attend my forthcoming Poetic Justice Exhibit on April 17th from 7pm-10pm @daddygreenspizzabk. 🎨👑✈️️

Une publication partagée par Oriel Ceballos (@or1el) le 7 Avril 2017 à 7h36 PDT

Context and Inspiration¶

We had the idea of conducting such a project, that may sound a bit odd in the beginning, when we met Oriel Ceballos at an art show in Harlem earlier this year. After a successful career as a professor, he decided to take an early exit and started a life as a full time artist, collector and curator. More info on him can be found on his Instagram page: https://www.instagram.com/or1el/?hl=fr

In order to broaden his audience, engage with people and sell his artworks, Oriel regularly - several days a week - goes to a subway station in either Manhattan or Brooklyn, displays his artworks and paints live.

What about his station selection process? He just tries stations with traffic and where there is enough space to display the pieces. However, it rang a bell in our data science-sensitized ears.

The next step for us is to gather data about subway stations that are relevant to our use case and try to come out with a way for artists to optimally select the subway station that best suits their requirements!!

How to make it into a Data visualization use case?¶

In order to go from our envy and inspiration to a data visualization task, we needed to gather datasets. But before gathering datasets, we needed to know what kind of data we were looking for. In particular, what features of subway stations were relevant to our analysis.

Here are our hypothesis on the features that matter, and that are not too complicated to access:

Is there a lot of traffic in the station?
How easy and convenient is the access to the station?
Are the people commuting here interested in Art? 
Are they wealthy?

We are not stating that the best station is the station with most traffic, in a very arty place, and with very rich people. The point here is to be able to discuss those variables in order to find the best match between an artist and a subway station.

Data Sources¶

We got our data from different sources: MTA turnstile data, NYC Open Data platform, and by crunching some information manually.

Open Data¶

Traffic data¶

In this task, we started from the great work by Henri Dwyer, that can be found here: https://henri.io/posts/new-york-subway-traffic-data-part-1.html. The original data is here: http://web.mta.info/developers/turnstile.html

In its final format, for each subway station, it includes the mean daily traffic, as well as the daily traffic for 6 consecutive days in April 2017.

Art galleries¶

In order to link a subway station with an appeal for art, we decided to count the number of art galleries in a radius of 0.2 miles around the subway station. This would be a great indicator of the artiness of the zone.

We found the data to do so on NYC OpenData: https://data.cityofnewyork.us/Recreation/New-York-City-Art-Galleries/tgyc-r5jh/data

The dataset includes all art galleries in New York, many information on the galleries such as name, telephone.. and the GPS coordinates.

Details of subway stations¶

For each station, we needed the GPS coordinates, to link them to the art galleries and the neighborhood, the name of the station and the type of Entrance.

We found these information on NYC OpenData: https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49/data

Manual Data collection¶

NYC Neighborhoods : Median income & coordinates¶

In order to have insights on the wealth of commuters at each station, one indicator is the median income in the neighborhood of the station. We found this information on this website: http://statisticalatlas.com/county-subdivision/New-York/New-York-County/Manhattan/Household-Income#figure/neighborhood and scrapped manually.

We then collected the GPS coordinates of the centroid of each neighborhood using Google Maps, in order to link each station to the neighborhood of the centroid it is closest to.

Data Preprocessing¶

Data Preparation¶

We've had to go through the following preprocessing steps:

Prepare all GPS coordinates to the same format
Transform the traffic data in a usable format. From turnstile events to daily traffic, per station
Format all Subway station names - Entity recognition problem - Mapping between datasets
Deduplicate the subway stations, in the case where there are different entrances and entrance types

These steps are not in the following report as they don't include visualization but are available on the github repository.

Below is a snapshot of the different datasets in their preprocessed format, before merging.

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import StandardScaler
import math
import operator
from geopy.distance import vincenty
import distance
import json
import colorlover as cl
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Income - Neighborhood¶

nei = pd.read_csv('data/neighborhoods.csv', sep=';')
nei['X'] = nei['X'].apply(lambda x: "-{:.6f}".format(x))
nei['Y'] = nei['Y'].apply(lambda x: "{:.6f}".format(x))

meancome = np.mean(nei['Median Income'])
nei.index = nei['Neighborhood']
nei.head()

Art Galleries¶

gal = pd.read_csv('data/galleries_untouched.csv', sep=',')

gal['Y'] = gal['the_geom'].apply(lambda x: x.split(' ')[2].strip(')')[:9])
gal['X'] = gal['the_geom'].apply(lambda x: x.split(' ')[1].strip('(')[:10])
del gal['the_geom']
del gal['ADDRESS2']
gal = gal[['NAME','Y','X','TEL','URL','ADDRESS1','CITY','ZIP']]

gal.head()

Subway Stations¶

sta = pd.read_csv('data/stations_coord_entrance.csv', sep=';')

types={'Stair':0,'Door':1,'Walkway':2,'Ramp': 3, 'Easement':4, 'Escalator':5, 'Elevator': 6}
sta['Entrance'] = sta['Entrance'].map(types)
types_rev={'0':'Stair','1':'Door','2':'Walkway','3':'Ramp','4':'Easement','5':'Escalator','6':'Elevator'}
types_col={'Stair':'r','Door':'g','Walkway':'b','Ramp': 'y', 'Easement':'b', 'Escalator':'r', 'Elevator': 'g'}

ent = {nam:[] for nam in sta['Name']}
for k in range(len(sta)):
    ent[sta['Name'][k]].append(sta['Entrance'][k])
ent = {sta: types_rev[str(max(ent[sta]))] for sta in ent.keys()}

del sta['Entrance']
sta = sta.groupby('Name').mean()
sta['Name'] = sta.index
sta.index = range(len(sta))

sta['Entrance'] = sta['Name'].apply(lambda x: ent[x])
sta['X'] = sta['X'].apply(lambda x: "{:.6f}".format(x))
sta['Y'] = sta['Y'].apply(lambda x: "{:.6f}".format(x))

sta = sta[['Name','Y','X','Entrance']]
sta.head()

Subway Traffic¶

For each station, we have the traffic (sum of entries and exits per day) for 6 consecutive days:

April 8th 2017
April 9th 2017
April 10th 2017
April 11th 2017
April 12th 2017
April 13th 2017

sta_traffic = pd.read_csv('data/station_traffic.csv')

#Map those station names to the ones in the DataFrame sta, that has all information on stations
sta_traffic['Name'] = sta_traffic['Name'].apply(lambda x:str.lower(x))

with open('data/map_station_names.json','r') as f:
    map_names = json.load(f)
map_names_rev = {str(bad):str(good) for good,bad in map_names.iteritems()}

map_names_normal = {orig_name: orig_name.lower() for orig_name in sta['Name']}
map_names_normal_rev = {low:orig for orig, low in map_names_normal.iteritems()}

map_final = {bad: map_names_normal_rev[good] for bad, good in map_names_rev.iteritems()}
sta_traffic['Name'] = sta_traffic['Name'].map(map_final)

sta_traffic = sta_traffic.dropna()
sta_traffic['traffic_mean'] = (sta_traffic['traffic_april8']+sta_traffic['traffic_april9']+sta_traffic['traffic_april10']+sta_traffic['traffic_april11']+sta_traffic['traffic_april12']+sta_traffic['traffic_april13'])/6

sta_traffic.head()

Merging the different datasets¶

The final dataset is a dataset where for each subway station, we have all required information:

Name
Number of art galleries
Median Income
Traffic
Entrance Type

Basically, we started with the dataset where each subway station is described and went through the following steps:

Compute the number of art galleries within 0.2 miles of the station, using GPS Coordinates of the galleries

Assign a neighborhood and a median income, using GPS Coordinates of the neighborhood centroids

Join the obtained dataset with the dataset containing the traffic data, using a mapping between two different formats for the names of the stations, that we computed using Stemming and Levenstein Distance

Note that some stations don't have a neighborhood as we focused on neighborhoods near Manhattan. We only kept those stations, afterwards. We also filtered out the stations that have less than two galleries around, in order to make the visualizations more readable and because those stations are not very interesting, based on our assumptions above.

#This function computes the distance, in miles, between two GPS points.
def dist(x1, x2, y1, y2):
    dist = vincenty((x1, y1), (x2, y2)).miles
    return dist

sta_nbgal={sta_name: 0 for sta_name in sta['Name']}
sta_neighborhood={sta_name: '' for sta_name in sta['Name']}

for k in range(len(sta)):
    x1 = sta['X'][k]
    y1 = sta['Y'][k]
    station = sta['Name'][k]
    for j in range(len(gal)):
        dz = dist(x1, gal['X'][j], y1, gal['Y'][j])
        if dz < 0.2:
            sta_nbgal[station] += 1
    min_dz = 999
    for l in range(len(nei)):
        dz = dist(x1, nei['X'][l], y1, nei['Y'][l])
        if dz < min_dz and dz < 4:
            min_dz = dz
            sta_neighborhood[station]=nei['Neighborhood'][l]

sta['Nb_gal'] = sta['Name'].map(sta_nbgal)
sta['Neighborhood'] = sta['Name'].map(sta_neighborhood)

Below is a snapshot of the dataset of the Subway stations, before merging the traffic data, but after computing the number of galleries and joining with the neighborhood dataset

sta = sta.join(nei, on='Neighborhood', how='left', lsuffix='', rsuffix='_nei')
sta = sta[['Name','Y','X','Entrance','Nb_gal','Neighborhood','Median Income']]
sta.head()

sta_traffic.index = sta_traffic['Name']

sta = sta.join(sta_traffic, on='Name', how='left', lsuffix='', rsuffix='_nei')

sta = sta.dropna()
del sta['Name_nei']

sta.index = sta['Name']
sta = sta[sta['Nb_gal']>2]

Final dataset, ready for the visualization tasks!¶

Here is a snapshot of the final dataset

sta.head()

Exploratory Data Analysis¶

Interactive plots using plotly, please play with it!¶

import plotly.plotly as py
import cufflinks as cf
import plotly.graph_objs as go
from plotly.graph_objs import *

import plotly
plotly.offline.init_notebook_mode()
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

First set of simple plots, in order to understand the different variables¶

Type of Entrance
Density in Art Galleries
Median Income of the neighborhood
Traffic

Type of Entrance¶

data = [go.Bar(
            x=sta['Entrance'].value_counts().index,
            y=list(sta['Entrance'].value_counts()),
            marker=dict(color='rgb(62,57,193)')
    )]
layout = go.Layout(
    title="Type of Entrance - Frequency"
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Most stations have stairs, which can be a problem if the artworks are very heavy or large for instance. An artist may want to select a station that has an elevator.

Density in Art Galleries¶

data = [go.Bar(
            y=sta['Nb_gal'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['Nb_gal'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['galleries around<br><b>'+sta['Neighborhood'][sta['Nb_gal'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=170,
        r=30,
        b=100,
        t=100,
        pad=4
    ),
    title="Number of galleries around each Subway Station",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Number of art galeries within 0.2 miles'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

We find reassuring insights: 68th St-Hunter College, Lexington Av for instance are in the Upper East Side, in the Museum area - MET, Guggenheim... - where the art gallery density is indeed very high.

Spring St, Canal St, Prince St are in very arty areas downtown Manhattan, it is thus a good thing to find them at the top of our ranking.

Mean Traffic per station¶

data = [go.Bar(
            y=sta['traffic_mean'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['traffic_mean'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['<b>'+sta['Neighborhood'][sta['traffic_mean'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=900,
    height=700,
    margin=go.Margin(
        l=210,
        r=0,
        b=100,
        t=100,
        pad=4
    ),
    title="Mean daily traffic for each Subway Station",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean daily traffic'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

There is not much surprise on the Traffic either. Grand Central, 34th St and 42nd St are known to be the main train stations in Manhattan, with a massive daily traffic. We still see that the slope is pretty high at the top of the ranking, meaning that there are a few stations with a massive traffic and a lot of stations with a more homogeneous traffic, around 50k people per day.

Median Income per station¶

data = [go.Bar(
            y=sta['Median Income'].sort_values(ascending=False)[:30].index[::-1],
            x=sta['Median Income'].sort_values(ascending=False)[:30][::-1],
            marker=dict(color='rgb(62,57,193)'),
            text= ['<b>'+sta['Neighborhood'][sta['Median Income'].sort_values(ascending=False)[:30].index[::-1][k]]+'</b>' for k in range(30)],
            orientation = 'h'
    )]
layout = go.Layout(
    autosize=False,
    width=900,
    height=700,
    margin=go.Margin(
        l=210,
        r=0,
        b=100,
        t=100,
        pad=4
    ),
    title="Median Income around each Subway Station (k$)",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        autotick=False,
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Median Income'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

The horizontal bars here are obviously grouped by neighborhood. This is due to the way we computed the median income for each station, as we assigned the income of its neighborhood to each station.

We find the stations in the richest neighborhoods on top (Chambers St in Tribeca, Whitehall St in Battery Park, Lexington Av in N. Sutton Area...)

Combining the variables¶

In the following charts, we combine the different variables, in order to gain insights on the subway stations to pick.

Number of art galleries vs Traffic¶

In this first scatter plot,

dot: a subway station
y coordinate: number of art galleries around
x coordinate: mean traffic

data = [go.Scatter(
        x = sta['traffic_mean'], 
        y = sta['Nb_gal'], 
        text = sta['Name'], 
        mode = 'markers', 
        name = 'Subway station',
        marker = dict(
            size = 13,
            color = 'rgb(62,57,193)'
        )
    )]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    bargap=0.4
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)


# IPython notebook
# py.iplot(fig, filename='pandas/multiple-scatter')

Based on this plot, we can combine our observations on the traffic and the density of art galleries. Typically the subway stations are polarized along the axis. There are no stations with a massive traffic as well as a large number of galleries. There is mostly a trade off between an arty station and a station with a lot of traffic, except for a few stations such as Canal St or West 4. We will go deeper in the analysis in the charts below.

However, we have absolute values for both features, which is not optimal for our task of selecting the subway station that would match with an artist, based on their criteria.

By scaling the variables Number of art galleries and Traffic (retrieveing the mean and dividing by the standard deviation), we will be able to see what stations are more arty than the majority, and which ones have more traffic than the mean!

Number of art galleries vs Traffic - Scaled¶

In the following plot, we have scaled the data. As a consequence of that, the dot in the upper right part of the graph have more traffic and more galleries than the majority of stations, those in the upper left part have more galleries but less traffic, those in the lower left part have less galleries and less traffic and those in the lower right part have more traffic but less galleries.

These four groups have been assigned different colors

sta['traffic_scaled'] = sta['traffic_mean']
sta['traffic_scaled'] = sta['traffic_scaled'].apply(lambda x: (x-np.mean(sta['traffic_mean']))/np.std(sta['traffic_mean']))

sta['nbgal_scaled'] = sta['Nb_gal']
sta['nbgal_scaled'] = sta['nbgal_scaled'].apply(lambda x: (x-np.mean(sta['Nb_gal']))/np.std(sta['Nb_gal']))

sta['quad'] = 0
for k in range(len(sta['quad'])):
    if sta['traffic_scaled'][k]<0:
        if sta['nbgal_scaled'][k]<0:
            sta['quad'][k]=1
        else:
            sta['quad'][k]=2
    else:
        if sta['nbgal_scaled'][k]<0:
            sta['quad'][k]=3
        else:
            sta['quad'][k]=4

data = [go.Scatter(
        x = sta[sta['quad']==1]['traffic_scaled'],y = sta[sta['quad']==1]['nbgal_scaled'],
        text = sta[sta['quad']==1]['Name'],mode = 'markers', 
        name = 'Low art / Low traffic',marker = dict(size = 13,color = 'rgb(255,215,0)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==2]['traffic_scaled'],y = sta[sta['quad']==2]['nbgal_scaled'],
        text = sta[sta['quad']==2]['Name'],mode = 'markers', 
        name = 'High art / Low traffic',marker = dict(size = 13,color = 'rgb(34,139,34)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==3]['traffic_scaled'],y = sta[sta['quad']==3]['nbgal_scaled'],
        text = sta[sta['quad']==3]['Name'],mode = 'markers', 
        name = 'Low art / High traffic',marker = dict(size = 13,color = 'rgb(240,128,128)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==4]['traffic_scaled'],y = sta[sta['quad']==4]['nbgal_scaled'],
        text = sta[sta['quad']==4]['Name'],mode = 'markers', 
        name = 'High art / High traffic',marker = dict(size = 13,color = 'rgb(100,149,237)',opacity=0.9)),
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

As described above and in the legend of the plot, we have defined four groups of art galleries.

Below is an example of how to read this plot:

An artist who doesn't want to engage with arty people but wants to reach the larger audience as possible would probably pick a station in the red group, such as 42nd St.

An artist who wants a lot of traffic as well as engaging with arty people would try a station in the blue group, such as West 4 or Canal St.

An artist who wants to be in a rather calm station - low traffic - but with arty people may want to go to Spring St or Lexington Av!

And an artist who wants neither traffic nor arty commuters may pick a station in the yellow group!

Yet, this plot doesn't talk about the mean income of people living near the station, and thus likely to commute through the station. We will add this feature in the next plot.

Number of art galleries vs Traffic vs Income¶

In this plot, the size of the dot is correlated to the median income of the neighborhood the station is located in.

def bin_income(x):
    if x<50:
        return 11
    elif 50<x<100:
        return 15
    elif 100<x<150:
        return 20
    else:
        return 27

sta['income_binned'] = sta['Median Income'] 
sta['income_binned'] = sta['income_binned'].apply(lambda x: bin_income(x))

data = [go.Scatter(
        x = sta[sta['quad']==1]['traffic_scaled'],y = sta[sta['quad']==1]['nbgal_scaled'],
        text = sta[sta['quad']==1]['Name'],mode = 'markers',name = 'Low art / Low traffic',
        marker = dict(size = sta[sta['quad']==1]['income_binned'],color = 'rgb(255,215,0)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==2]['traffic_scaled'],y = sta[sta['quad']==2]['nbgal_scaled'],
        text = sta[sta['quad']==2]['Name'],mode = 'markers',name = 'High art / Low traffic',
        marker = dict(size = sta[sta['quad']==2]['income_binned'],color = 'rgb(34,139,34)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==3]['traffic_scaled'],y = sta[sta['quad']==3]['nbgal_scaled'],
        text = sta[sta['quad']==3]['Name'],mode = 'markers',name = 'Low art / High traffic',
        marker = dict(size = sta[sta['quad']==3]['income_binned'],color = 'rgb(240,128,128)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['quad']==4]['traffic_scaled'],y = sta[sta['quad']==4]['nbgal_scaled'],
        text = sta[sta['quad']==4]['Name'],mode = 'markers',name = 'High art / High traffic',
        marker = dict(size = sta[sta['quad']==4]['income_binned'],color = 'rgb(100,149,237)',opacity=0.9)),
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

We can now complete our analysis!

Let's say the artist really wants to sell his artworks and not only display it.

If he finds himself in the blue group, he may prefer West 4th over Canal St.

If he chose the red group, he may stay away from Bedford Ave and go to 59th Columbus Circle or 42nd st.

If the green group took his preference, Lexington Av would be a better option than Prince St for instance!

Number of art galleries vs Traffic vs Entrance type¶

We replace the feature 'Income' by the feature 'Entrance' to maintain a great readability of the graph. As there are 5 types of Entrance, we assign a color to each type of Entrance, as descirbed in the legend.

data = [go.Scatter(
        x = sta[sta['Entrance']=='Stair']['traffic_scaled'],y = sta[sta['Entrance']=='Stair']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Stair']['Name'],mode = 'markers',name = 'Stair',
        marker = dict(size = 15,color = 'rgb(255,192,203)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Door']['traffic_scaled'],y = sta[sta['Entrance']=='Door']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Door']['Name'],mode = 'markers',name = 'Door',
        marker = dict(size = 15,color = 'rgb(0,0,205)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Easement']['traffic_scaled'],y = sta[sta['Entrance']=='Easement']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Easement']['Name'],mode = 'markers',name = 'Easement',
        marker = dict(size = 15,color = 'rgb(138,43,226)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Escalator']['traffic_scaled'],y = sta[sta['Entrance']=='Escalator']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Escalator']['Name'],mode = 'markers',name = 'Escalator',
        marker = dict(size = 15,color = 'rgb(139,0,139)',opacity=0.9)),
        go.Scatter(
        x = sta[sta['Entrance']=='Elevator']['traffic_scaled'],y = sta[sta['Entrance']=='Elevator']['nbgal_scaled'],
        text = sta[sta['Entrance']=='Elevator']['Name'],mode = 'markers',name = 'Elevator',
        marker = dict(size = 15,color = 'rgb(255,20,147)',opacity=0.9))
       ]
layout = go.Layout(
    hovermode="closest", 
    autosize=False,
    width=1000,
    height=700,
    margin=go.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title="Scatter plot of the Subway Stations - Entrance Type",
    yaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showticklabels=True,
        ticks='outside',
        title='Density of art Galleries'
    ),
    xaxis=dict(
        titlefont=dict(
            family='Arial, sans-serif',
            size=18,
            color='lightgrey'
        ),
        showgrid=True,
        showticklabels=True,
        ticks='outside',
        title='Mean Traffic'
    ),
    showlegend=True
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

If the artist says "I only want to showcase my art, not sell it! But it is very heavy and I have broken my ankle, I may need an elevator!"¶

This artist should look for the Light Pink dots, in the upper left corner, as he doesn't particularly want to sell.

We would recommend him to try Lexington Av or 34 St Hudson Yards!

What next?¶

In the nearest future, build a recommandation algorithm, that would recommand a subway station to an artist, based on its preferences. We would then add this recommandation system to this web app :)

	Neighborhood	Median Income	Y	X
Neighborhood
Tribeca	Tribeca	170.5	40.718649	-74.008769
Battery Park	Battery Park	153.5	40.705372	-74.014410
Carnegie Hill	Carnegie Hill	150.5	40.783498	-73.955369
N Sutton Area	N Sutton Area	136.3	40.757554	-73.962410
Financial Dist	Financial Dist	121.1	40.710181	-74.010209

	NAME	Y	X	TEL	URL	ADDRESS1	CITY	ZIP
0	O'reilly William & Co Ltd	40.773800	-73.962730	(212) 396-1822	http://www.nyc.com/arts__attractions/oreilly_w...	52 E 76th St	New York	10021
1	Organization of Independent Artists - Gallery 402	40.716468	-74.009385	(212) 219-9213	http://www.nonprofitgallery.com/main/usa/ny/oi...	19 Hudson St.	New York	10013
2	Owen Gallery	40.774000	-73.964351	(212) 879-2415	http://www.owengallery.com/about-us	19 E 75th St	New York	10021
3	P P O W Gallerie	40.749585	-74.003892	(212) 647-1044	http://www.ppowgallery.com/	511 W 25th St	New York	10001
4	P P O W Inc	40.722907	-74.001763	(212) 941-8642	http://www.nyc.com/arts__attractions/p_p_o_w_i...	476 Broome St	New York	10013

	Name	Y	X	Entrance
0	103rd St	40.784001	-73.935003	Stair
1	104th St-102nd St	40.695178	-73.844330	Stair
2	104th St-Oxford Av	40.681711	-73.837683	Stair
3	110th St	40.795020	-73.944250	Stair
4	110th St-Central Park North	40.799075	-73.951822	Stair

	Name	traffic_april8	traffic_april9	traffic_april10	traffic_april11	traffic_april12	traffic_april13	traffic_mean
0	Cypress Av	3910.75	3012.75	5537.00	5536.5000	5432.00	5703.00	4855.333333
1	5th Av-53rd St	12825.00	10323.00	51630.50	53881.5000	55479.00	56423.50	40093.750000
3	Hunts Point Av	10648.25	8327.25	17013.50	17663.0734	17202.00	17638.25	14748.720567
4	Sutter Av	5609.50	4687.25	7360.75	7524.5000	7049.00	7471.50	6617.083333
5	7th Av	14372.00	12064.75	17708.75	18591.2500	19231.75	19542.75	16918.541667

	Name	Y	X	Entrance	Nb_gal	Neighborhood	Median Income
0	103rd St	40.784001	-73.935003	Stair	0	East Harlem	28.5
1	104th St-102nd St	40.695178	-73.844330	Stair	0		NaN
2	104th St-Oxford Av	40.681711	-73.837683	Stair	0		NaN
3	110th St	40.795020	-73.944250	Stair	1	East Harlem	28.5
4	110th St-Central Park North	40.799075	-73.951822	Stair	1	Harlem	38.8

	Name	Y	X	Entrance	Nb_gal	Neighborhood	Median Income	traffic_april8	traffic_april9	traffic_april10	traffic_april11	traffic_april12	traffic_april13	traffic_mean
Name
14th St	14th St	40.739460	-73.999947	Easement	12	Chelsea	101.6	29834.50	24008.75	33009.50	35592.50	34184.000000	35659.714056	32048.160676
14th St-Union Square	14th St-Union Square	40.734673	-73.989951	Easement	23	Gramercy	100.1	48145.00	39143.50	55874.50	58529.00	58056.943045	57881.766546	52938.451599
157th St	157th St	40.834041	-73.944890	Stair	3	Hamilton Hts	35.9	13347.00	12104.00	17161.25	17529.75	17269.750000	17751.750000	15860.583333
18th St	18th St	40.741040	-73.997871	Stair	21	Chelsea	101.6	13178.75	11724.25	18181.00	18410.75	18407.750000	19273.000000	16529.250000
23rd St	23rd St	40.742316	-73.991510	Easement	31	Garment Dist	118.9	3044.50	2382.00	4469.75	4592.50	4524.250000	4706.000000	3953.166667