Python Cheat Sheet for Data Science
February 12, 2019Scientifics Computing Libraries
- Pandas - Data structures & tools
- NumPy - Arrays and matrices
- SciPy - Integrals, diffierntial equations, optimization
Visualization Libraries
- Matplotlib - plots & graphs
- Seaborn - heat maps, time series, violin plots
Algorithmic Libraries
- Scikit-learn - Machine Learning, regression
- Statsmodels - Explore data, estimate statistical models, perform tests
Data Collection and Exporting
Importing csv data with no header. Omit parameter if data has a header.
import pandas as pd
url = "https://someurl.com/thedata.data"
df = pd.read_csv(url,header = None)
Exporting csv data
path="C:\myfile.csv"
df.to_csv(path)
Other formats and function calls
- CSV: read_csv() | to_csv()
- json: read_json() | to_json()
- Excel: read_excel() | to_excel()
- sql: read_sql() | to_sql()
Describing Data
Pandas types are generally
- object (strings) : “Hello”
- int64 : 1,2,3,4,5
- float64 : 2.12, 3.14, 5.00
- datetime64 : 2019-05-13
Check data types
df.dtypes
Set data types
df["bore"].astype("Int")
Return statistical summary
df.describe()
Return statistical summary with all data
df.describe(include="all")
Return first 10 lines of dataframe
df.head(10)
Modifying Data
Add 1 to each row
df["dollars"]=df["dollars"]+1
Converting Data
Convert mpg to L/100km
df["city-mpg"]=235/df["city-mpg"]
df.rename(columns={"city-mpg":"city-L/100km"},inplace=True)
Missing Data
Drop missing values where axis=0 is rows, axis=1 is columns
df.dropna(subset=["price"],axis=0, inplace=True)
Replace NaN with average
avg=df["bore"].mean(axis=0)
df["bore"].replace(np.nan,avg,inplace=True)
Replace NaN in the column peak-rpm with 5
df["peak-rpm"].replace(np.nan,5,inplace=True)
Or, of course simply keep the missing data in analysis!
Data Normalization
Simple feature scaling
df["length"]= df["length"]/df["length"].max()
Min-max
df["length"]= (df["length"]-df["length"].min())/
(df["length"].max()-df["length"].min())
Z-score
df["length"]= (df["length"]-df["length"].mean())/df["length"].std()
Binning
Put pricing into 3 groups uniformly - low,medium, and high
bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names = ["Low","Medium","High"]
df["price-binned"]=pd.cut(df["price"]),bins,labels=group_names,include_lowest=True)
One Hot Encoding
Convert categorical values to dummy variables (0 or 1)
pd.get_dummies(df['fuel'])
Exploratory Data Analysis
- Descripive Statistics
- GroupBy
- Pearsons Correlation
- Correlation Heatmaps
- ANOVA
Descriptive Statistics
mean, data points, stddev, extreme values
df.describe()
summarize categorical data
drive_wheels_counts=df["drive-wheels"].value_counts()
box-plot
sns.boxplot(x="body-style", y="price", data=df)
scatter-plot : predictor is x-axis and target is y-axis
y=df["engine-size"]
x=df["price"]
plt.scatter(x,y)
plt.title("Engine Size vs Price")
plt.xlabel("Engine Size")
plt.ylabel("Price")
GroupBy
group by categories showing mean of groupings
df_test = df['drive-wheels','body-style','price']
df_grp = df_test.groupby(['drive-wheels','body-style'],as_index=False).mean()
pivot table from above grouping
df_pivot = df_grp.pivot(index='drive-wheels',columns='body-style')
heatmap from above pivot table
plt.pcolor(df_pivot,cmap='RdBBu')
plt.colorbar()
plt.show()
Pearsons Correlation
Correlation between two features
sns.regplot(x="engine-size",y="price", data=df)
plt.ylim(0,)
Correlation coefficient:
- close to +1: Large positive relationship
- close to -1: Large negative relationship
- close to 0: No relationship
P-value:
- Less than 0.001 : Strong certainty
- Less than 0.05 : Moderate certainty
- Less than 0.1 : Weak certainty
- Greater than 0.1 : No Certainty
coefficient and p-value
import scistat as stats
pearson_coef,p_value = stats.pearson[df["horsepower"],df["price"]]
correelation heatmap
TBD
ANOVA - Analysis of Variance
F-test score: Variation between sample group means divided by variation within the sample group.
p-value: confidence degree
ANOVA using scipy
df_anova=df[["make","price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results = stats.f_oneway(\
grouped_anova.get_group("honda")["price"],\
grouped_anova.get_group("subaru")["price"])