Get Pandas DataFrame Column Headers as a List

Aliaksei Yursha Dec 10, 2020
Get Pandas DataFrame Column Headers as a List

Pandas is an open-source package for data analysis in Python. pandas.DataFrame is the primary Pandas data structure. It is a two-dimensional tabular data structure with labeled axes (rows and columns).

A widespread use case is to get a list of column headers from a DataFrame object.

We will reuse the DataFrame object, that we define below, in all other code examples of this tutorial.

>>> import pandas
>>> cities = {
...   'name': ['New York', 'Los Angeles', 'Chicago'],
...   'population': [8601186, 4057841, 2679044],
...   'state': ['NY', 'CA', 'IL'],
... }
>>> data_frame = pandas.DataFrame(cities)

One way to get a hold of DataFrame column names is to iterate over a DataFrame object itself. DataFrame iterator returns column names in the order of definition.

>>> for column in data_frame:
...   print(column)

When there is a necessity to convert an iterable into a list, you can call Python’s built-in list function on it.

>>> list(data_frame)
['name', 'population', 'state']

However, the performance of this method is sluggish.

>>> from timeit import timeit
>>> timeit(lambda: list(data_frame))

We can also traverse deeper into a DataFrame object to access its columns from a DataFrame.columns property.

>>> list(data_frame.columns)
['name', 'population', 'state']

Otherwise, we can use the DataFrame.columns.tolist() function to achieve the same thing.

>>> data_frame.columns.tolist()
['name', 'population', 'state']

The performance of both of these methods is not much better.

>>> timeit(lambda: list(data_frame.columns))
>>> timeit(lambda: data_frame.columns.tolist())

Things change a lot when traversing even further into DataFrame.columns.values property. Similarly, as with DataFrame object and DataFrame.columns property, we can use it to get a sequence of DataFrame column names.

>>> list(data_frame.columns.values)
['name', 'population', 'state']

The performance of this approach is 5 to 6 times better when compared to the previous methods.

>>> timeit(lambda: list(data_frame.columns.values))

Still, the best runtime can be achieved if we use the built-in DataFrame.columns.values.tolist() method.

>>> data_frame.columns.values.tolist()
['name', 'population', 'state']
>>> timeit(lambda: data_frame.columns.values.tolist())

As we can see, the performance of this approach is more than ten times better than if we had iterated directly over the DataFrame object. Most engineers will be curious about the reasons behind such a discrepancy in performance.

The answer hides in the data type of DataFrame.columns.values property. It’s a NumPy array. NumPy is a Python package for scientific computing, and maintainers optimize it highly for performance.

Pandas is built on top of NumPy and provides convenient high-level abstractions. Thus, performing direct operations on lower-level NumPy data structures will almost always be faster than performing similar operations on Pandas higher-level data structures.

Related Article - Pandas DataFrame

Related Article - Pandas DataFrame Column