Python Cheat Sheet for Data Science

February 12, 2019

Scientifics Computing Libraries

Pandas - Data structures & tools
NumPy - Arrays and matrices
SciPy - Integrals, diffierntial equations, optimization

Visualization Libraries

Matplotlib - plots & graphs
Seaborn - heat maps, time series, violin plots

Algorithmic Libraries

Scikit-learn - Machine Learning, regression
Statsmodels - Explore data, estimate statistical models, perform tests

Data Collection and Exporting

Importing csv data with no header. Omit parameter if data has a header.

import pandas as pd
url = "https://someurl.com/thedata.data"
df = pd.read_csv(url,header = None)

Exporting csv data

path="C:\myfile.csv"
df.to_csv(path)

Other formats and function calls

CSV: read_csv() | to_csv()
json: read_json() | to_json()
Excel: read_excel() | to_excel()
sql: read_sql() | to_sql()

Describing Data

Pandas types are generally

object (strings) : “Hello”
int64 : 1,2,3,4,5
float64 : 2.12, 3.14, 5.00
datetime64 : 2019-05-13

Check data types

df.dtypes

Set data types

df["bore"].astype("Int")

Return statistical summary

df.describe()

Return statistical summary with all data

df.describe(include="all")

Return first 10 lines of dataframe

df.head(10)

Modifying Data

Add 1 to each row

df["dollars"]=df["dollars"]+1

Converting Data

Convert mpg to L/100km

df["city-mpg"]=235/df["city-mpg"]
df.rename(columns={"city-mpg":"city-L/100km"},inplace=True)

Missing Data

Drop missing values where axis=0 is rows, axis=1 is columns

df.dropna(subset=["price"],axis=0, inplace=True)

Replace NaN with average

avg=df["bore"].mean(axis=0)
df["bore"].replace(np.nan,avg,inplace=True)

Replace NaN in the column peak-rpm with 5

df["peak-rpm"].replace(np.nan,5,inplace=True)

Or, of course simply keep the missing data in analysis!

Data Normalization

Simple feature scaling

df["length"]= df["length"]/df["length"].max()

Min-max

df["length"]= (df["length"]-df["length"].min())/
				(df["length"].max()-df["length"].min())

Z-score

df["length"]= (df["length"]-df["length"].mean())/df["length"].std()

Binning

Put pricing into 3 groups uniformly - low,medium, and high

bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names = ["Low","Medium","High"]
df["price-binned"]=pd.cut(df["price"]),bins,labels=group_names,include_lowest=True)

One Hot Encoding

Convert categorical values to dummy variables (0 or 1)

pd.get_dummies(df['fuel'])

Exploratory Data Analysis

Descripive Statistics
GroupBy
Pearsons Correlation
Correlation Heatmaps
ANOVA

Descriptive Statistics

mean, data points, stddev, extreme values

df.describe()

summarize categorical data

drive_wheels_counts=df["drive-wheels"].value_counts()

box-plot

sns.boxplot(x="body-style", y="price", data=df)

scatter-plot : predictor is x-axis and target is y-axis

y=df["engine-size"]
x=df["price"]
plt.scatter(x,y)
plt.title("Engine Size vs Price")
plt.xlabel("Engine Size")
plt.ylabel("Price")

GroupBy

group by categories showing mean of groupings

df_test = df['drive-wheels','body-style','price']
df_grp = df_test.groupby(['drive-wheels','body-style'],as_index=False).mean()

pivot table from above grouping

df_pivot = df_grp.pivot(index='drive-wheels',columns='body-style')

heatmap from above pivot table

plt.pcolor(df_pivot,cmap='RdBBu')
plt.colorbar()
plt.show()

Pearsons Correlation

Correlation between two features

sns.regplot(x="engine-size",y="price", data=df)
plt.ylim(0,)

Correlation coefficient:

close to +1: Large positive relationship
close to -1: Large negative relationship
close to 0: No relationship

P-value:

Less than 0.001 : Strong certainty
Less than 0.05 : Moderate certainty
Less than 0.1 : Weak certainty
Greater than 0.1 : No Certainty

coefficient and p-value

import scistat as stats
pearson_coef,p_value = stats.pearson[df["horsepower"],df["price"]]

correelation heatmap

TBD

ANOVA - Analysis of Variance

F-test score: Variation between sample group means divided by variation within the sample group.

p-value: confidence degree

ANOVA using scipy

df_anova=df[["make","price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results = stats.f_oneway(\
	grouped_anova.get_group("honda")["price"],\
	grouped_anova.get_group("subaru")["price"])