Common operations in Pandas

This is another “do not forget” post.

If you munge data in Python then you really should be using Pandas. These are some notes about common operations.

Hierachical index

Given a set of data like this (refered to as Table A):

  Bid Mid Offer Variable
1994-01-03 16:00:00 0.9980 0.9981 0.9981 ARS
1994-01-03 16:00:00 12.2190 12.2215 12.2240 ATS
1994-01-03 16:00:00 0.6835 0.6838 0.6840 AUD
1994-01-03 16:00:00 39.0000 39.0000 39.0000 BDT
1994-01-03 16:00:00 36.2060 36.2425 36.2790 BEF

It often makes more sense to represent it as a hierachical index (referred to as Table B), e.g.

    Bid Mid Offer
index Variable      
1994-01-03 16:00:00 ARS 0.9980 0.9981 0.9981
  ATS 12.2190 12.2215 12.2240
  AUD 0.6835 0.6838 0.6840
  BDT 39.0000 39.0000 39.0000
  BEF 36.2060 36.2425 36.2790

This can be done for a pandas.DataFrame df like so

df.reset_index().set_index(['index', 'Variable'])

Subset of columns and a pivot

Consider again Table A above loaded as df, if we wish to filter out a subset of values in the Variable column, and consider the Mid values only for then we can do the following to have a continuous time series:

g10_set = [
'USD', 'EUR', 'JPY', 'CHF', 'GBP', 'AUD', 'CAD', 'SEK',
'NOK', 'NZD'
]

df = df[df['Variable'].isin(g10_set)]
df = df.reset_index().pivot(
    index='index', columns='Variable', values='Mid'
)

Which results in Table C:

Variable AUD CAD CHF EUR GBP JPY NOK  
index                
1994-01-03 16:00:00 0.6838 1.3143 1.4827 NaN 1.4805 112.395 7.5251  
1994-01-04 16:00:00 0.6848 1.3164 1.4861 NaN 1.4819 113.155 7.5259  
1994-01-05 16:00:00 0.6873 1.3153 1.4800 NaN 1.4864 112.760 7.5020  
1994-01-06 16:00:00 0.6843 1.3220 1.4745 NaN 1.4880 112.555 7.4903  
1994-01-07 16:00:00 0.6868 1.3225 1.4705 NaN 1.4905 111.825 7.4865  

Working with the index

To extract indexed data points which are common to df1 from df2, you can do the following:

df2 = df2.ix[df1.index]

To extract ranges of index from df2, the DataFrame may be sliced:

# extract ranges of index
df2.ix[df1.index[0]:df1.index[-1]]

# and similarly specific values may also be accesed, pandas will coerce
# these to pd.Timestamp
df2.ix['2013-12-31 16:00:00']

Other bits and bobs

  • Find row where value of column is max: df['AUD'].argmax()
  • Find indexes of null values pd.isnull(df).any(1).nonzero() which is utilising the nonzero numpy object, i.e. return indicies of elements which are nonzero.