Fama-Macbeth Regression in Python

Mehvish Ashiq Oct 10, 2023
  1. Fama-Macbeth Regression and Its Importance
  2. Steps to Implement the Fama-Macbeth Regression in Python
  3. Use LinearModels to Implement the Fama-Macbeth Regression in Python
  4. Alternative Approach to Implement the Fama-Macbeth Regression in Python
Fama-Macbeth Regression in Python

Today’s article educates about Fama-Macbeth regression, its importance, and its implementation.

Fama-Macbeth Regression and Its Importance

In asset pricing theories, we use risk factors to describe asset returns. These risk factors can be microeconomic or macroeconomic.

The microeconomic risk factors include firm size and the firm’s different financial metrics, while macroeconomic risk factors are consumer inflation and unemployment.

The Fama-Macbeth regression is a two-step regression model used to test the asset pricing models. It is a practical approach to measure how correctly these risk factors describe portfolio or asset returns.

This model is useful for determining the risk premium associated with exposure to these factors.

Now, the point is, why do we call Fama-Macbeth, a two-step regression model? Let’s find out these steps below.

  1. This step is about regressing every asset’s return against one or multiple risk factors using the time-series approach. We get the return exposure to every factor known as betas, factor loadings, or factor exposure.
  2. This step is about regressing all assets’ returns against the asset betas acquired in the previous step (step 1) using a cross-section approach. Here, we get a risk premium for every factor.

The projected premium over time for the unit exposure to every risk factor is then calculated by averaging coefficients once for every element, according to Fama and Macbeth.

Steps to Implement the Fama-Macbeth Regression in Python

According to the update to reflect library circumstances for Fama-Macbeth as of Fall 2018, fama_macbeth has been eradicated from the Python pandas module for a while now.

So, how can we implement Fama-Macbeth if we are working with Python? We have the following two approaches that we will learn one by one in this tutorial.

  1. We can use the fama_macbeth function in LinearModels if we use Python version 3.
  2. If we are not required to use LinearModels, or we are using Python version 2, then most probably the best case would be to write our implementation for Fama-Macbeth.

Use LinearModels to Implement the Fama-Macbeth Regression in Python

  • Import the necessary modules and libraries.
    import pandas as pd
    import numpy as np
    from pathlib import Path
    from linearmodels.asset_pricing import LinearFactorModel
    from statsmodels.api import OLS, add_constant
    import matplotlib.pyplot as plt
    import pandas_datareader.data as web
    import seaborn as sns
    import warnings
    
    warnings.filterwarnings("ignore")
    sns.set_style("whitegrid")
    

    First, we import all the modules and libraries we will need to implement Fama-Macbeth using LinearModels. A brief description of all of them is given below:

    • We import pandas for working with data frames and numpy for arrays.
    • The pathlib makes a path to the specified file by placing this specific script in the Path object.
    • We import LinearFactorModel from linearmodels.asset_pricing. The linear factor models relates the return on the asset to values of a restricted/limited number of factors, with a relationship that is described by a linear equation.
    • Next, we import OLS to evaluate the linear regression model and add_constant to add a column of ones to the array. You can learn more about statsmodels here.
    • After that, we import pandas_datareader to access the latest remote data to use with pandas. It works for various pandas versions.
    • We import matplotlib and seaborn libraries for data plotting and visualization purposes.
    • We import Warnings to use its filterwarnings() method, which ignores warnings.
    • Lastly, we use the set_style() method from the seaborn module, which sets the parameters that control plots’ general style.
  • Access remote risk factor & research portfolio dataset.
    ff_research_data_five_factor = "F-F_Research_Data_5_Factors_2x3"
    ff_factor_dataset = web.DataReader(
        ff_research_data_five_factor, "famafrench", start="2010", end="2017-12"
    )[0]
    

    Here, we are using the updated risk factor & research portfolio dataset (the five Fama-French factors) available on their website to return to a monthly frequency which we get for 2010-2017 as given above.

    We use DataReader() to extract data from the specified internet resource into the pandas data frame, which is ff_factor_dataset in our code fence. The DataReader() supports various sources, for instance, Tiingo, IEX, Alpha Vantage, and more that you can read on this page.

    print("OUTPUT for info(): \n")
    print(ff_factor_dataset.info())
    

    Next, we use df.info() and df.describe() methods where info() prints the data frame’s information which includes the number of columns, data types of columns, column labels, range index, memory usage, and the number of cells in every column (non-null values).

    You can see the output produced by info() below.

    OUTPUT:

    OUTPUT for info():
    
    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 96 entries, 2010-01 to 2017-12
    Freq: M
    Data columns (total 6 columns):
    #   Column Non-Null Count Dtype
    ---  ------  --------------  -----
    0   Mkt-RF 96 non-null     float64
    1   SMB     96 non-null     float64
    2   HML     96 non-null     float64
    3   RMW     96 non-null     float64
    4   CMA     96 non-null     float64
    5   RF      96 non-null     float64
    dtypes: float64(6)
    memory usage: 5.2 KB
    None
    

    Next, we use the describe() method as follows.

    print("OUTPUT for describe(): \n")
    print(ff_factor_dataset.describe())
    

    The describe() method shows the data frame’s statistical summary; we can also use this method for Python series. This statistical summary contains mean, median, count, standard deviation, columns’ percentile values, and min-max.

    You can find the output of the describe() method below.

    OUTPUT:

    OUTPUT for describe():
    
    		Mkt-RF        SMB        HML        RMW        CMA         RF
    count 96.000000 96.000000 96.000000 96.000000 96.000000 96.000000
    mean    1.158438   0.060000  -0.049271   0.129896   0.047708   0.012604
    std     3.580012   2.300292   2.202912   1.581930   1.413033   0.022583
    min    -7.890000  -4.580000  -4.700000  -3.880000  -3.240000   0.000000
    25%    -0.917500  -1.670000  -1.665000  -1.075000  -0.952500   0.000000
    50%     1.235000   0.200000  -0.275000   0.210000   0.010000   0.000000
    75%     3.197500   1.582500   1.205000   1.235000   0.930000   0.010000
    max    11.350000   7.040000   8.190000   3.480000   3.690000   0.090000
    
  • Access 17 industry portfolios and subtract risk factor rate.
    ff_industry_portfolio = "17_Industry_Portfolios"
    ff_industry_portfolio_dataset = web.DataReader(
        ff_industry_portfolio, "famafrench", start="2010", end="2017-12"
    )[0]
    ff_industry_portfolio_dataset = ff_industry_portfolio_dataset.sub(
        ff_factor_dataset.RF, axis=0
    )
    

    Here, we use DataReader() to access 17 industry portfolios or assets at a monthly frequency and subtract the risk-free rate (RF) from the data frame returned by DataReader(). Why? It is because the factor model works with the excess returns.

    Now, we will use the info() method to get information about the chained data frame.

    print(ff_industry_portfolio_dataset.info())
    

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 96 entries, 2010-01 to 2017-12
    Freq: M
    Data columns (total 17 columns):
    #   Column Non-Null Count Dtype
    ---  ------  --------------  -----
    0   Food    96 non-null     float64
    1   Mines   96 non-null     float64
    2   Oil     96 non-null     float64
    3   Clths   96 non-null     float64
    4   Durbl   96 non-null     float64
    5   Chems   96 non-null     float64
    6   Cnsum   96 non-null     float64
    7   Cnstr   96 non-null     float64
    8   Steel   96 non-null     float64
    9   FabPr   96 non-null     float64
    10 Machn   96 non-null     float64
    11 Cars    96 non-null     float64
    12 Trans   96 non-null     float64
    13 Utils   96 non-null     float64
    14 Rtail   96 non-null     float64
    15 Finan   96 non-null     float64
    16 Other   96 non-null     float64
    dtypes: float64(17)
    memory usage: 13.5 KB
    None
    

    Similarly, we describe this data frame using the describe() method.

    print(ff_industry_portfolio_dataset.describe())
    

    OUTPUT:

    		 Food       Mines      Oil        Clths      Durbl      Chems  \
    count 96.000000 96.000000 96.000000 96.000000 96.000000 96.000000
    mean    1.046771   0.202917   0.597187   1.395833   1.151458   1.305000
    std     2.800555   7.904401   5.480938   5.024408   5.163951   5.594161
    min    -5.170000 -24.380000 -11.680000 -10.000000 -13.160000 -17.390000
    25%    -0.785000  -5.840000  -3.117500  -1.865000  -2.100000  -1.445000
    50%     0.920000  -0.435000   0.985000   1.160000   1.225000   1.435000
    75%     3.187500   5.727500   4.152500   3.857500   4.160000   4.442500
    max     6.670000 21.940000 15.940000 17.190000 16.610000 18.370000
    
    		 Cnsum      Cnstr      Steel      FabPr      Machn      Cars   \
    count 96.000000 96.000000 96.000000 96.000000 96.000000 96.000000
    mean    1.186979   1.735521   0.559167   1.350521   1.217708   1.279479
    std     3.142989   5.243314   7.389679   4.694408   4.798098   5.719351
    min    -7.150000 -14.160000 -20.490000 -11.960000  -9.070000 -11.650000
    25%    -0.855000  -2.410000  -4.395000  -1.447500  -2.062500  -1.245000
    50%     1.465000   2.175000   0.660000   1.485000   1.525000   0.635000
    75%     3.302500   5.557500   4.212500   3.837500   4.580000   4.802500
    max     8.260000 15.510000 21.350000 17.660000 14.750000 20.860000
    
    		 Trans      Utils      Rtail      Finan      Other
    count 96.000000 96.000000 96.000000 96.000000 96.000000
    mean    1.463750   0.896458   1.233958   1.248646   1.290938
    std     4.143005   3.233107   3.512518   4.839150   3.697608
    min    -8.560000  -6.990000  -9.180000 -11.140000  -7.890000
    25%    -0.810000  -0.737500  -0.952500  -1.462500  -1.090000
    50%     1.480000   1.240000   0.865000   1.910000   1.660000
    75%     4.242500   2.965000   3.370000   4.100000   3.485000
    max    12.980000   7.840000 12.440000 13.410000 10.770000
    
  • Compute the excess returns.

    Before moving toward the calculation of excess returns, we have to perform a few more steps.

    data_store = Path("./data/assets.h5")
    
    wiki_prices_df = pd.read_csv(
        "./dataset/wiki_prices.csv",
        parse_dates=["date"],
        index_col=["date", "ticker"],
        infer_datetime_format=True,
    ).sort_index()
    
    us_equities_data_df = pd.read_csv("./dataset/us_equities_data.csv")
    
    with pd.HDFStore(data_store) as hdf_store:
        hdf_store.put("quandl/wiki/prices", wiki_prices_df)
    
    with pd.HDFStore(data_store) as hdf_store:
        hdf_store.put("us_equities/stocks", us_equities_data_df.set_index("ticker"))
    

    First, we create an assets.h5 file in a data folder which resides in the current directory. Next, we use the read_csv() method to read the wiki_prices.csv and us_equities_data.csv files in the dataset folder from the same directory.

    After that, we use HDFStore() as given above to store data in HDF5 format.

    with pd.HDFStore("./data/assets.h5") as hdf_store:
        prices = hdf_store["/quandl/wiki/prices"].adj_close.unstack().loc["2010":"2017"]
        equities = hdf_store["/us_equities/stocks"].drop_duplicates()
    
    sectors = equities.filter(prices.columns, axis=0).sector.to_dict()
    prices = prices.filter(sectors.keys()).dropna(how="all", axis=1)
    
    returns_df = prices.resample("M").last().pct_change().mul(100).to_period("M")
    returns_df = returns.dropna(how="all").dropna(axis=1)
    print(returns_df.info())
    

    In the above code fence, we read the /quandl/wiki/prices & /us_equities/stocks that we just stored in the assets.h5 file and saved them in the prices and equities variables.

    Then, we apply some filters on prices and equities, resample, and remove missing values. Finally, we print the information of a data frame returns_df using the info() method; you can see the output below.

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 95 entries, 2010-02 to 2017-12
    Freq: M
    Columns: 1986 entries, A to ZUMZ
    dtypes: float64(1986)
    memory usage: 1.4 MB
    None
    

    Now, execute the following code to align the data.

    ff_factor_dataset = ff_factor_dataset.loc[returns_df.index]
    ff_industry_portfolio_dataset = ff_industry_portfolio_dataset.loc[returns_df.index]
    print(ff_factor_dataset.describe())
    

    OUTPUT:

    		Mkt-RF        SMB        HML        RMW        CMA         RF
    count 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000
    mean    1.206000   0.057053  -0.054316   0.144632   0.043368   0.012737
    std     3.568382   2.312313   2.214041   1.583685   1.419886   0.022665
    min    -7.890000  -4.580000  -4.700000  -3.880000  -3.240000   0.000000
    25%    -0.565000  -1.680000  -1.670000  -0.880000  -0.965000   0.000000
    50%     1.290000   0.160000  -0.310000   0.270000   0.010000   0.000000
    75%     3.265000   1.605000   1.220000   1.240000   0.940000   0.010000
    max    11.350000   7.040000   8.190000   3.480000   3.690000   0.090000
    

    Now, we are ready to calculate the excess returns.

    excess_returns_df = returns_df.sub(ff_factor_dataset.RF, axis=0)
    excess_returns_df = excess_returns_df.clip(
        lower=np.percentile(excess_returns_df, 1),
        upper=np.percentile(excess_returns_df, 99),
    )
    excess_returns_df.info()
    

    In the above code block, we subtract the risk factor from the ff_factor_dataset and save the returned data frame in excess_returns_df.

    Next, we use the .clip() method that trims the values at the specified input threshold(s). Lastly, use info() to print the information of the excess_returns_df data frame.

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 95 entries, 2010-02 to 2017-12
    Freq: M
    Columns: 1986 entries, A to ZUMZ
    dtypes: float64(1986)
    memory usage: 1.4 MB
    

    Before moving towards the first stage of Fama-Macbeth regression, we use the .drop() method, which drops the specified label from columns or rows. Note that axis=1 means a drop from a column, and axis=0 means a drop from rows.

    ff_factor_dataset = ff_factor_dataset.drop("RF", axis=1)
    print(ff_factor_dataset.info())
    

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 95 entries, 2010-02 to 2017-12
    Freq: M
    Data columns (total 5 columns):
    #   Column Non-Null Count Dtype
    ---  ------  --------------  -----
    0   Mkt-RF 95 non-null     float64
    1   SMB     95 non-null     float64
    2   HML     95 non-null     float64
    3   RMW     95 non-null     float64
    4   CMA     95 non-null     float64
    dtypes: float64(5)
    memory usage: 4.5 KB
    None
    
  • Implement the Fama-Macbeth regression step 1: Factor Exposures.
    betas = []
    for industry in ff_industry_portfolio_dataset:
        step_one = OLS(
            endog=ff_industry_portfolio_dataset.loc[ff_factor_dataset.index, industry],
            exog=add_constant(ff_factor_dataset),
        ).fit()
        betas.append(step_one.params.drop("const"))
    
    betas = pd.DataFrame(
        betas,
        columns=ff_factor_dataset.columns,
        index=ff_industry_portfolio_dataset.columns,
    )
    print(betas.info())
    

    The code snippet above implements the first step of Fama-Macbeth regression and accesses the 17-factor loading estimates. Here, we use OLS() to evaluate the linear regression model and add_constant() to add a column of ones to the array.

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    Index: 17 entries, Food to Other
    Data columns (total 5 columns):
    #   Column Non-Null Count Dtype
    ---  ------  --------------  -----
    0   Mkt-RF 17 non-null     float64
    1   SMB     17 non-null     float64
    2   HML     17 non-null     float64
    3   RMW     17 non-null     float64
    4   CMA     17 non-null     float64
    dtypes: float64(5)
    memory usage: 1.3+ KB
    None
    
  • Implement the Fama-Macbeth regression step 2: Risk Premia.
    lambdas = []
    for period in ff_industry_portfolio_dataset.index:
        step_two = OLS(
            endog=ff_industry_portfolio_dataset.loc[period, betas.index], exog=betas
        ).fit()
        lambdas.append(step_two.params)
    
    lambdas = pd.DataFrame(
        lambdas, index=ff_industry_portfolio_dataset.index, columns=betas.columns.tolist()
    )
    print(lambdas.info())
    

    In the second step, we run ninety-six regressions of period returns for the portfolio’s cross-section on factor loadings.

    OUTPUT:

    <class 'pandas.core.frame.DataFrame'>
    PeriodIndex: 95 entries, 2010-02 to 2017-12
    Freq: M
    Data columns (total 5 columns):
    #   Column Non-Null Count Dtype
    ---  ------  --------------  -----
    0   Mkt-RF 95 non-null     float64
    1   SMB     95 non-null     float64
    2   HML     95 non-null     float64
    3   RMW     95 non-null     float64
    4   CMA     95 non-null     float64
    dtypes: float64(5)
    memory usage: 6.5 KB
    None
    

    We can visualize the results as follows:

    window = 24  # here 24 is the number of months
    axis1 = plt.subplot2grid((1, 3), (0, 0))
    axis2 = plt.subplot2grid((1, 3), (0, 1), colspan=2)
    lambdas.mean().sort_values().plot.barh(ax=axis1)
    lambdas.rolling(window).mean().dropna().plot(
        lw=1, figsize=(14, 5), sharey=True, ax=axis2
    )
    sns.despine()
    plt.tight_layout()
    

    OUTPUT:

    fama macbeth regression in python - results

  • Implement the Fama-Macbeth regression with the LinearModels module.
    model = LinearFactorModel(
        portfolios=ff_industry_portfolio_dataset, factors=ff_factor_dataset
    )
    result = model.fit()
    print(result)
    

    Here, we use the LinearModels to implement the two-step Fama-Macbeth procedure that produces the following output. We can use print(result.full_summary) instead of print(result) to get the full summary.

    OUTPUT:

    					LinearFactorModel Estimation Summary
    ================================================================================
    No. Test Portfolios:                 17   R-squared:                      0.6879
    No. Factors:                          5   J-statistic:                    15.619
    No. Observations:                    95   P-value                         0.2093
    Date:                  Mon, Oct 24 2022   Distribution:                 chi2(12)
    Time:                          20:53:52
    Cov. Estimator:                  robust
    
    						  Risk Premia Estimates
    ==============================================================================
    		  Parameter Std. Err.     T-stat    P-value    Lower CI    Upper CI
    ------------------------------------------------------------------------------
    Mkt-RF         1.2355     0.4098     3.0152     0.0026      0.4324      2.0386
    SMB            0.0214     0.8687     0.0246     0.9804     -1.6813      1.7240
    HML           -1.1140     0.6213    -1.7931     0.0730     -2.3317      0.1037
    RMW           -0.2768     0.8133    -0.3403     0.7336     -1.8708      1.3172
    CMA           -0.5078     0.5666    -0.8962     0.3701     -1.6183      0.6027
    ==============================================================================
    
    Covariance estimator:
    HeteroskedasticCovariance
    See full_summary for complete results
    

Alternative Approach to Implement the Fama-Macbeth Regression in Python

We can use this approach if we don’t want to use LinearModels or using Python version 2.

  • Import the modules and libraries.
    import pandas as pd
    import numpy as np
    import statsmodels.formula.api as sm
    

    We import pandas to work with data frames, numpy to play with arrays, and statsmodels.formula.api, which is a convenience interface to specify models via formula string and data frames.

  • Read and query dataset.

    Assume that we have Fama-French industry assets/portfolios in the panel as given below (we have also calculated a few variables, for instance, past beta and returns to utilize as our x variables)

    data_df = pd.read_csv("industry_data.csv", parse_dates=["caldt"])
    data_df.query("caldt == '1995-07-01'")
    

    OUTPUT:

    	industry      caldt    ret    beta r12to2 r36to13
    18432     Aero 1995-07-01   6.26 0.9696 0.2755   0.3466
    18433    Agric 1995-07-01   3.37 1.0412 0.1260   0.0581
    18434    Autos 1995-07-01   2.42 1.0274 0.0293   0.2902
    18435    Banks 1995-07-01   4.82 1.4985 0.1659   0.2951
    
  • Use groupby() to compute cross-sectional regression model month-by-month.
    def ols_coefficient(x, formula):
        return sm.ols(formula, data=x).fit().params
    
    
    gamma_df = data_df.groupby("caldt").apply(
        ols_coefficient, "ret ~ 1 + beta + r12to2 + r36to13"
    )
    gamma_df.head()
    

    Here, we use groupby because Fama-Macbeth involves calculating the exact cross-sectional regression model month-by-month. We can create a function that accepts a data frame (which is returned by groupby) and a pasty formula; it will then fit the model and return parameter estimates.

    OUTPUT:

    		  Intercept      beta     r12to2   r36to13
    caldt
    1963-07-01  -1.497012 -0.765721   4.379128 -1.918083
    1963-08-01 11.144169 -6.506291   5.961584 -2.598048
    1963-09-01  -2.330966 -0.741550 10.508617 -4.377293
    1963-10-01   0.441941 1.127567   5.478114 -2.057173
    1963-11-01   3.380485 -4.792643   3.660940 -1.210426
    
  • Compute mean and standard error on mean.
    def fm_summary(p):
        s = p.describe().T
        s["std_error"] = s["std"] / np.sqrt(s["count"])
        s["tstat"] = s["mean"] / s["std_error"]
        return s[["mean", "std_error", "tstat"]]
    
    
    fm_summary(gamma_df)
    

    Next, we calculate the mean, t-test (you can use any statistics you want), and standard error on the mean, which is something as above.

    OUTPUT:

    			 mean std_error     tstat
    Intercept 0.754904   0.177291 4.258000
    beta      -0.012176   0.202629 -0.060092
    r12to2     1.794548   0.356069 5.039896
    r36to13    0.237873   0.186680 1.274230
    
  • Improve speed and use the fama_macbeth function.
    def ols_np(dataset, y_var, x_var):
        gamma_df, _, _, _ = np.linalg.lstsq(dataset[x_var], dataset[y_var], rcond=None)
        return pd.Series(gamma_df)
    

    This step is important if we are concerned about efficiency. If so, we can switch from statsmodels to the numpy.linalg.lstsq.

    To perform ols estimations, we can write a function similar to the above. Note that we are not doing anything to check these matrices’ ranks.

    If you are using the older pandas version, then the following should work. Let’s have an example using the fama_macbeth in pandas.

    print(data_df)
    fm = pd.fama_macbeth(y=data_df["y"], x=data_df[["x"]])
    print(fm)
    

    Here, observe the structure below. The fama_macbeth wants the x-var and y-var to have a multi-index with date as the first variable and stock/firm/entity id as the second variable in the index.

    OUTPUT:

    			  y    x
    date       id
    2012-01-01 1   0.1 0.4
    		 2   0.3 0.6
    		 3   0.4 0.2
    		 4   0.0 1.2
    2012-02-01 1   0.2 0.7
    		 2   0.4 0.5
    		 3   0.2 0.1
    		 4   0.1 0.0
    2012-03-01 1   0.4 0.8
    		 2   0.6 0.1
    		 3   0.7 0.6
    		 4   0.4 -0.1
    
    
    ----------------------Summary of the Fama-Macbeth Analysis-------------------------
    
    Formula: Y ~ x + intercept
    # betas :   3
    
    ----------------------Summary of the Estimated Coefficients------------------------
       Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
    		(x)       -0.0227        0.1276         -0.18       -0.2728        0.2273
    (intercept)        0.3531        0.0842          4.19        0.1881        0.5181
    
    --------------------------------End of the Summary---------------------------------
    

    Note that we are calling fm.summary by just printing the fm as we did in the above code fence. Additionally, the fama_macbeth automatically adds the intercept and x-var must be a data frame.

    We can do as follows if we don’t want to have intercept.

    fm = pd.fama_macbeth(y=data_df["y"], x=data_df[["x"]], intercept=False)
    
Mehvish Ashiq avatar Mehvish Ashiq avatar

Mehvish Ashiq is a former Java Programmer and a Data Science enthusiast who leverages her expertise to help others to learn and grow by creating interesting, useful, and reader-friendly content in Computer Programming, Data Science, and Technology.

LinkedIn GitHub Facebook

Related Article - Python Regression