diff --git a/docs/_quarto.yml b/docs/_quarto.yml index 24aa486..1d32c16 100644 --- a/docs/_quarto.yml +++ b/docs/_quarto.yml @@ -231,7 +231,7 @@ website: href: notes/applied-stats/basic-tests.ipynb text: "Statistical Tests" - section: - href: notes/applied-stats/correlation.ipynb + href: notes/applied-stats/correlation-2.ipynb text: "Correlation Analysis" diff --git a/docs/notes/applied-stats/correlation-2.ipynb b/docs/notes/applied-stats/correlation-2.ipynb new file mode 100644 index 0000000..86d9fda --- /dev/null +++ b/docs/notes/applied-stats/correlation-2.ipynb @@ -0,0 +1,1424 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Correlation" + ], + "metadata": { + "id": "IXoPRAfSXiNb" + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "Let's revisit our dataset of economic indicators.\n", + "\n", + "We will focus on correlation, and determining which of these indicators may be positively or negatively correlated with eachother. This will allow us to answer questions like, \"is gold a good hedge against inflation?\"." + ], + "metadata": { + "id": "WJDXvggrpqav" + } + }, + { + "cell_type": "code", + "source": [ + "from pandas import read_csv\n", + "\n", + "df = read_csv(\"https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-indicators.csv\")\n", + "df.head()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "TRqYcMkMURas", + "outputId": "e1f9b690-2483-42cf-dee7-5f91fdbb6e28" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " timestamp cpi fed spy gld\n", + "0 2024-05-01 314.069 5.33 525.6718 215.30\n", + "1 2024-04-01 313.548 5.33 500.3636 211.87\n", + "2 2024-03-01 312.332 5.33 521.3857 205.72\n", + "3 2024-02-01 310.326 5.33 504.8645 189.31\n", + "4 2024-01-01 308.417 5.33 479.8240 188.45" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timestampcpifedspygld
02024-05-01314.0695.33525.6718215.30
12024-04-01313.5485.33500.3636211.87
22024-03-01312.3325.33521.3857205.72
32024-02-01310.3265.33504.8645189.31
42024-01-01308.4175.33479.8240188.45
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "df", + "summary": "{\n \"name\": \"df\",\n \"rows\": 234,\n \"fields\": [\n {\n \"column\": \"timestamp\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 234,\n \"samples\": [\n \"2018-08-01\",\n \"2007-03-01\",\n \"2009-05-01\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cpi\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 30.29922027086648,\n \"min\": 190.3,\n \"max\": 314.069,\n \"num_unique_values\": 231,\n \"samples\": [\n 196.8,\n 252.038,\n 307.026\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fed\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.8796570098480136,\n \"min\": 0.05,\n \"max\": 5.33,\n \"num_unique_values\": 106,\n \"samples\": [\n 3.04,\n 3.08,\n 4.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"spy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 122.94739811506473,\n \"min\": 55.1488,\n \"max\": 525.6718,\n \"num_unique_values\": 233,\n \"samples\": [\n 213.6153,\n 92.7153,\n 81.5622\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gld\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 40.170234378613976,\n \"min\": 41.65,\n \"max\": 215.3,\n \"num_unique_values\": 229,\n \"samples\": [\n 56.7,\n 115.15,\n 180.02\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "source": [ + "print(len(df))\n", + "print(df[\"timestamp\"].min(), \"...\", df[\"timestamp\"].max())" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "w8UKMi9WUasL", + "outputId": "709b7098-a42c-40ee-e12d-ad7a4789afed" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "234\n", + "2004-12-01 ... 2024-05-01\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "The primary reason why we fetched all these different datasets and merged them together, is so we can explore the correlation between them." + ], + "metadata": { + "id": "WvOcYANIOceH" + } + }, + { + "cell_type": "markdown", + "source": [ + "**Correlation** is a measure of how two datasets are related to eachother.\n" + ], + "metadata": { + "id": "y7JKi-_ohF37" + } + }, + { + "cell_type": "markdown", + "source": [ + "https://www.investopedia.com/terms/c/correlation.asp\n", + "\n", + "\n", + "\n", + "\n", + "> Investment managers, traders, and analysts find it very important to calculate correlation because the risk reduction benefits of diversification rely on this statistic." + ], + "metadata": { + "id": "W_K7rK_TVDHr" + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's take a quick detour to make a scaled version of this data, to make it easier to plot all these different series on a graph, so we can perhaps start to get a sense of how their movements might correlate (in an unofficial way)." + ], + "metadata": { + "id": "4AxAx5WzSojE" + } + }, + { + "cell_type": "code", + "source": [ + "scaled_df = df.copy()\n", + "scaled_df.index = df[\"timestamp\"] # save the ts for charting, knowing we will remove it\n", + "scaled_df.drop(columns=[\"timestamp\"], inplace=True) # remove the ts column, in preparation to operate on all numeric columns\n", + "scaled_df = scaled_df / scaled_df.max() # dividing all numeric col values by their column's max. there are many alternative methods for scaling the data\n", + "scaled_df.head()\n", + "\n", + "import plotly.express as px\n", + "px.line(scaled_df, y=[\"cpi\", \"fed\", \"spy\", \"gld\",\n", + " #\"btc\"\n", + " ],\n", + " title=\"Scaled data over time\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 542 + }, + "id": "jYW5YzrbRDks", + "outputId": "0a055557-f6c0-4161-fd91-1951281df652" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "\n", + "" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Looks like the [...] has been moving [upward/downward] at a time when [...] has been moving [upward/downward]. We might start to suspect they are correlated in a [pos/neg] way.\n", + "\n", + "> NOTE: correlation does not imply causation!" + ], + "metadata": { + "id": "DfJbRoy6TO_T" + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's now perform tests for correlation in more official / formal ways.\n", + "\n" + ], + "metadata": { + "id": "FbVefIAfUjKx" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Correlation Considerations" + ], + "metadata": { + "id": "02Y6JN77j0ti" + } + }, + { + "cell_type": "markdown", + "source": [ + "Certain methods for calculating correlation may depend on the normality of our data's distribution, or the sample size, so we should keep these in mind as we determine if we are able to calculate correlation, and which method to use.\n" + ], + "metadata": { + "id": "b3iuo-AIjDqZ" + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "https://www.investopedia.com/terms/n/nonparametric-method.asp\n", + "\n", + "\n", + "> The nonparametric method refers to a type of statistic that does not make any assumptions about the characteristics of the sample (its parameters) or whether the observed data is quantitative or qualitative.\n", + ">\n", + "> Nonparametric statistics can include certain descriptive statistics, statistical models, inference, and statistical tests. The model structure of nonparametric methods is not specified a priori but is instead determined from data.\n", + ">\n", + "> Common nonparametric tests include Chi-Square, Wilcoxon rank-sum test, Kruskal-Wallis test, and Spearman's rank-order correlation.\n", + ">\n", + "> In contrast, well-known statistical methods such as ANOVA, Pearson's correlation, t-test, and others do make assumptions about the data being analyzed. One of the most common parametric assumptions is that population data have a \"normal distribution.\"\n" + ], + "metadata": { + "id": "OMFsf-9-iqS9" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Correlation with `scipy`" + ], + "metadata": { + "id": "Tmhw-IlgWWVY" + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "We can always calculate correlation between two lists of numbers, using the `pearsonr` and `spearmanr` functions from the `scipy` package.\n", + "\n", + "One difference between these two correlation methods is that Spearman is more robust to (i.e. less affected by) outliers. Also being nonparametric, the Spearman method does not assume our data is normally distributed.\n" + ], + "metadata": { + "id": "Z6WfgjKSWjyX" + } + }, + { + "cell_type": "markdown", + "source": [ + "https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html\n", + "\n", + "> Pearson correlation coefficient and p-value for testing non-correlation.\n", + ">\n", + "> The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.\n", + ">\n", + "> This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets." + ], + "metadata": { + "id": "Qm0kJFOiQeM1" + } + }, + { + "cell_type": "markdown", + "source": [ + "https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html\n", + "\n", + "> Calculate a Spearman correlation coefficient with associated p-value.\n", + ">\n", + "> The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.\n", + ">\n", + "> The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. Although calculation of the p-value does not make strong assumptions about the distributions underlying the samples, it is only accurate for very large samples (>500 observations). For smaller sample sizes, consider a permutation test instead (see docs for examples)." + ], + "metadata": { + "id": "JB0RbCo1UqtD" + } + }, + { + "cell_type": "code", + "source": [ + "from scipy.stats import pearsonr\n", + "\n", + "x = df[\"fed\"]\n", + "y = df[\"spy\"]\n", + "\n", + "result = pearsonr(x, y)\n", + "print(result)" + ], + "metadata": { + "id": "Nz5EsO5QWY76", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "a1457efb-7f6e-4c57-8e3e-2bc5a5ffc46e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "PearsonRResult(statistic=0.17282057382978896, pvalue=0.008062179433931187)\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "from scipy.stats import spearmanr\n", + "\n", + "x = df[\"fed\"]\n", + "y = df[\"spy\"]\n", + "\n", + "result = spearmanr(x, y)\n", + "print(result)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fKnyEs4aQguI", + "outputId": "1b8af722-2acc-487b-ac5d-9abd0d21c914" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "SignificanceResult(statistic=0.005936198901328186, pvalue=0.9280322090398303)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Correlation Matrix with `pandas`" + ], + "metadata": { + "id": "eqvrD4BYWUMs" + } + }, + { + "cell_type": "markdown", + "source": [ + "OK sure we can calculate correlation between two sets of data. But what if we wanted to calculate correlation between many different data sets? We could perhaps set up a loop, but there is an easier way.\n", + "\n", + "If we have a pandas dataframe, we can use it's `corr()` method to produce a \"correlation matrix\", which shows us the \"pairwise correlation of columns\", in other words, the correlation of each column with respect to each other column.\n", + "\n", + "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html\n", + "\n" + ], + "metadata": { + "id": "QeUeVEI4VBwB" + } + }, + { + "cell_type": "code", + "source": [ + "#df.corr(method=\"pearson\") # method is pearson by default\n", + "df.corr(method=\"pearson\", numeric_only=True) # numeric_only to suppress warning" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + }, + "id": "weE3kA5NcjR5", + "outputId": "5e4ab037-5b93-4e72-eb1e-448d3325d6e7" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " cpi fed spy gld\n", + "cpi 1.000000 0.078102 0.949065 0.823717\n", + "fed 0.078102 1.000000 0.172821 -0.263213\n", + "spy 0.949065 0.172821 1.000000 0.719160\n", + "gld 0.823717 -0.263213 0.719160 1.000000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
cpifedspygld
cpi1.0000000.0781020.9490650.823717
fed0.0781021.0000000.172821-0.263213
spy0.9490650.1728211.0000000.719160
gld0.823717-0.2632130.7191601.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"df\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"cpi\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.4295147697055192,\n \"min\": 0.07810235316943387,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.07810235316943387,\n 0.8237168740513675,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fed\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5358342284057406,\n \"min\": -0.263212680762633,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 1.0,\n -0.263212680762633,\n 0.07810235316943387\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"spy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.37854882497720416,\n \"min\": 0.1728205738297885,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.1728205738297885,\n 0.7191601809353221,\n 0.9490654041714747\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gld\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5673812225119731,\n \"min\": -0.263212680762633,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n -0.263212680762633,\n 1.0,\n 0.8237168740513675\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 42 + } + ] + }, + { + "cell_type": "code", + "source": [ + "#df.corr(method=\"spearman\")\n", + "df.corr(method=\"spearman\", numeric_only=True) # numeric_only to suppress warning" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + }, + "id": "0LxOOqZ1dknJ", + "outputId": "56265273-7931-4c60-fa48-ea2729b94188" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " cpi fed spy gld\n", + "cpi 1.000000 -0.102732 0.953588 0.790661\n", + "fed -0.102732 1.000000 0.005936 -0.308626\n", + "spy 0.953588 0.005936 1.000000 0.714306\n", + "gld 0.790661 -0.308626 0.714306 1.000000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
cpifedspygld
cpi1.000000-0.1027320.9535880.790661
fed-0.1027321.0000000.005936-0.308626
spy0.9535880.0059361.0000000.714306
gld0.790661-0.3086260.7143061.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"df\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"cpi\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5165997176052513,\n \"min\": -0.10273218370450073,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n -0.10273218370450073,\n 0.7906610391016119,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fed\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5823683713456621,\n \"min\": -0.3086264001658012,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 1.0,\n -0.3086264001658012,\n -0.10273218370450073\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"spy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.4590773818179508,\n \"min\": 0.005936198901328188,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.005936198901328188,\n 0.7143055161417022,\n 0.9535876162464759\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gld\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5844227878347225,\n \"min\": -0.3086264001658012,\n \"max\": 1.0,\n \"num_unique_values\": 4,\n \"samples\": [\n -0.3086264001658012,\n 1.0,\n 0.7906610391016119\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 43 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "We may begin to notice the diagonal of 1s values. This is because each dataset is perfectly positively correlated with itself.\n", + "\n", + "We may also start to notice the symmetry of values mirrored across the diagonal. In other words, the value in column 1, row 4 is the same as the value in column 4, row 1." + ], + "metadata": { + "id": "PXqd16CIzQpN" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Plotting Correlation Matrix" + ], + "metadata": { + "id": "_eCIjPrWVm11" + } + }, + { + "cell_type": "markdown", + "source": [ + "It may not be easy to quickly interpret the rest of the values in the correlation matrix, but if we plot it with colors as a \"heat map\" then we will be able to use color to more easily interpret the data and tell a story." + ], + "metadata": { + "id": "CnbhJJ5gVdvq" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Correlation Heatmap with `plotly`" + ], + "metadata": { + "id": "asvEa2V9fEZm" + } + }, + { + "cell_type": "markdown", + "source": [ + "https://plotly.com/python-api-reference/generated/plotly.express.imshow.html" + ], + "metadata": { + "id": "Mb8wY3D82iak" + } + }, + { + "cell_type": "code", + "source": [ + "# https://plotly.com/python/heatmaps/\n", + "# https://plotly.com/python-api-reference/generated/plotly.express.imshow.html\n", + "import plotly.express as px\n", + "\n", + "cor_mat = df.corr(method=\"spearman\", numeric_only=True) # using numeric_only to suppress warning\n", + "\n", + "title= \"Spearman Correlation between Economic Indicators\"\n", + "fig = px.imshow(cor_mat,\n", + " height=600, # title=title,\n", + " text_auto= \".2f\", # round to two decimal places\n", + " color_continuous_scale=\"Blues\",\n", + " color_continuous_midpoint=0, # set color midpoint at zero because correlation coeficient ranges from -1 to 1 (see correlation notes)\n", + " labels={\"x\": \"Indicator\", \"y\": \"Indicator\"}\n", + ")\n", + "fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'}) # https://stackoverflow.com/questions/64571789/center-plotly-title-by-default\n", + "fig.show()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 617 + }, + "id": "z41ZxGpMe2S8", + "outputId": "ff96425e-f5cc-448f-b4d7-deccac060c4e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "\n", + "" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "What stories can we tell with the correlation heatmap? Which indicators are most positively correlated? Which are most negatively correlated?\n", + "\n", + "Is gold a hedge against inflation, or is there another indicator which may be a better hedge?\n" + ], + "metadata": { + "id": "_pL3AkGotQXk" + } + } + ] +} \ No newline at end of file diff --git a/docs/notes/applied-stats/overview.qmd b/docs/notes/applied-stats/overview.qmd index 8e49f2e..964d623 100644 --- a/docs/notes/applied-stats/overview.qmd +++ b/docs/notes/applied-stats/overview.qmd @@ -2,4 +2,4 @@ + [Statistical Tests](./basic-tests.ipynb) - + [Correlation](./correlation.ipynb) + + [Correlation](./correlation-2.ipynb)