Skip to content

Commit

Permalink
Pandas and Stats (#5)
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 authored Jun 28, 2024
1 parent b847664 commit c1df91a
Show file tree
Hide file tree
Showing 22 changed files with 9,428 additions and 9 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

.jupyter_cache

# HTML files recursive:


# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
41 changes: 32 additions & 9 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -200,16 +200,39 @@ website:
#
# PANDAS PACKAGE OVERVIEW
#
#- section:
# href: notes/pandas/overview.ipynb
# text: "Pandas Package Overview"
# contents:
# - section:
# href: notes/pandas/dataframes.qmd
# text: "Dataframes"


- section:
href: notes/pandas/overview.qmd
text: "Pandas Package Overview"
contents:
- section:
href: notes/pandas/dataframes.qmd
text: "Dataframes"
#- section:
# href: notes/pandas/dataframes.qmd
# text: "Dataframes"
- section:
href: notes/pandas/grouping-pivoting.qmd
text: "Grouping and Pivoting"
- section:
href: notes/pandas/shift-methods.qmd
text: "Shift based Methods" # "Growth and Cumulative Growth"
- section:
href: notes/pandas/moving-averages.qmd
text: "Moving Averages"
- section:
href: notes/pandas/joining-merging.ipynb
text: "Joining and Merging"

- section:
href: notes/applied-stats/overview.qmd
text: "Applied Statistics"
contents:
- section:
href: notes/applied-stats/basic-tests.ipynb
text: "Statistical Tests"
- section:
href: notes/applied-stats/correlation.ipynb
text: "Correlation Analysis"



Expand Down
Binary file added docs/images/joins-inner-outer.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,462 changes: 1,462 additions & 0 deletions docs/notes/applied-stats/basic-tests.ipynb

Large diffs are not rendered by default.

205 changes: 205 additions & 0 deletions docs/notes/applied-stats/basic_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# -*- coding: utf-8 -*-
"""Basic Statistics Overview
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1A-RKDqX_l3C87eFt73m2jkmLiAt-mg9V
# Basic Summary Statistics
"""

from pandas import read_csv

df = read_csv("https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-indicators.csv")
df.head()

print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())

"""https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
We can use the describe method to quickly see the basic summary statistics for each column:
"""

df.describe()

"""As you may be aware, we can calculate these individually, using `Series` aggregations:"""

# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
# https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html

series = df["fed"]

print("COUNT:", len(series))
print("MEAN:", series.mean().round(6))
print("STD:", series.std().round(6))
print("-------------")
print("MIN:", series.min())
print("25TH:", series.quantile(.25))
print("MED:", series.median())
print("75TH:", series.quantile(.75))
print("MAX:", series.max())

series.describe() # for comparison

"""## Distribution Plots
Let's view some distribution plots of the federal funds rate, to tell a story about the summary statistics for this indicator.
"""

import plotly.express as px

px.box(df, x="fed", orientation="h", points="all", title="Distribution of Federal Funds Rate (Monthly)", hover_data=["timestamp"],)

# https://plotly.com/python-api-reference/generated/plotly.express.violin.html
import plotly.express as px

#px.violin(df, y="fed", points="all", box=True, title="Distribution of Federal Funds Rate (Monthly)", hover_data=["timestamp"])
px.violin(df, x="fed", orientation="h", points="all", box=True, title="Distribution of Federal Funds Rate (Monthly)", hover_data=["timestamp"])

# https://plotly.com/python-api-reference/generated/plotly.express.histogram.html
px.histogram(df, x="fed", #nbins=12,
title="Distribution of Federal Funds Rate (Monthly)", height=350)

"""Looks like the recent higher funds rates are potential outliers. It is hard to say for sure if this data is normally distributed, or whether it is too skewed by the outliers.
# Statistical Tests with `Scipy`
We can use the Scipy package to perform basic statistical tests.
https://pypi.org/project/scipy/
## Normality Tests
We can conduct a normality test to see if a given distribution is normally distributed.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
> This function tests the null hypothesis that a sample comes from a normal distribution.
>
> If the p-value is "small" - that is, if there is a low probability of sampling data from a normally distributed population that produces such an extreme value of the statistic - this may be taken as evidence against the null hypothesis in favor of the alternative: the weights were not drawn from a normal distribution.
"""

from scipy.stats import normaltest

x = df["fed"]

result = normaltest(x)
print(result)

"""Interpreting the results.
https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistics/basic-statistics/how-to/normality-test/interpret-the-results/key-results/
> To determine whether the data do not follow a normal distribution, compare the p-value to the significance level. Usually, a significance level (denoted as α or alpha) of 0.05 works well. A significance level of 0.05 indicates a 5% risk of concluding that the data do not follow a normal distribution when the data do follow a normal distribution.
>
> P-value ≤ α: The data do not follow a normal distribution (Reject H0)
> If the p-value is less than or equal to the significance level, the decision is to reject the null hypothesis and conclude that your data do not follow a normal distribution.
>
> P-value > α: You cannot conclude that the data do not follow a normal distribution (Fail to reject H0). If the p-value is larger than the significance level, the decision is to fail to reject the null hypothesis. You do not have enough evidence to conclude that your data do not follow a normal distribution.
"""

if result.pvalue <= 0.05:
print("REJECT (NOT NORMAL)")
else:
print("NOT ABLE TO REJECT (COULD BE NORMAL)")

"""Looks like the federal fuds rate does not have a normal distribution (as this notebook was run on June 28th 2024).
How about the market?
"""

x = df["spy"]

result = normaltest(x)
print(result)

if result.pvalue <= 0.05:
print("REJECT (NOT NORMAL)")
else:
print("NOT ABLE TO REJECT (COULD BE NORMAL)")

"""## T-Tests
https://www.investopedia.com/terms/t/t-test.asp
> A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances, like the data set recorded from flipping a coin 100 times.
### T-Test Considerations
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/#sec-2title
In order to conduct a T-Test, the data needs to be normally distributed. So the examples below may not be the most methodologically sound. However they should provide code examples you can adapt for other use cases in the future.
### 2 Sample T-Test
A two sample T-test is used to determine whether two independent samples come from the same distribution.
Let's split the most recent year's rates from the rest. And see if the most recent years are statistically different.
"""

#cutoff_date = "2022-06-01" # you can chose a different one if you'd like
cutoff_date = "2022-10-01"

rates_recent = df[df["timestamp"] >= cutoff_date]["fed"]
print(len(rates_recent))
print(rates_recent)

rates_historic = df[df["timestamp"] < cutoff_date]["fed"]
print(len(rates_historic))
print(rates_historic)

"""https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
> Calculate the T-test for the means of two independent samples of scores.
>
> This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
>
> The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.
"""

print(rates_recent.var())
print(rates_historic.var())

from scipy.stats import ttest_ind

result = ttest_ind(rates_recent, rates_historic)
print(result)

if result.pvalue <= 0.05:
print("REJECT (MEANS NOT THE SAME)")
else:
print("NOT ABLE TO REJECT (MEANS COULD BE THE SAME)")

"""### 1 Sample T-Test
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html
> Calculate the T-test for the mean of ONE group of scores.
>
> This is a test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean, popmean.
>
> Under certain assumptions about the population from which a sample is drawn, the confidence interval with confidence level 95% is expected to contain the true population mean in 95% of sample replications.
Suppose we wish to test the null hypothesis that the mean of the fed funds rates is equal to 2.5%.
"""

from scipy.stats import ttest_1samp

x = df["fed"]
print(x.mean())

popmean = 2.5 # for example
result = ttest_1samp(x, popmean=popmean)
print(result)

if result.pvalue <= 0.05:
print("REJECT (MEAN NOT EQUAL TO POPMEAN)")
else:
print("NOT ABLE TO REJECT (MEAN COULT BE EQUAL TO POPMEAN)")

ci = result.confidence_interval(confidence_level=0.95)
print(ci)
Loading

0 comments on commit c1df91a

Please sign in to comment.