---
title: pandas练习-多层索引的创建和各种操作(multiindex)第二部分
date: 2019-01-24 20:17:55
tags: [pandas]
toc: true
xiongzhang: true
xiongzhang_images: [main.jpg]

---

<span></span>
<!-- more -->

### 使用切片(slicers)

你可以使用切片来选择MultiIndex, `slice`是python内置的函数(其实是一个类), 他的用法是这样的:

In [10]:
alist = list('abcdefg' * 3)
selector = slice(1, 6, 2)
alist[selector]

['b', 'd', 'f']

我们可以使用`slice`来选择MultiIndex。

下面先创建一个DataFrame:

In [11]:
import pandas as pd
import numpy as np
def mklbl(prefix,n):
    return ["%s%s" % (prefix,i)  for i in range(n)]


miindex = pd.MultiIndex.from_product([mklbl('A',4),
                                     mklbl('B',2),
                                mklbl('C',4),
                                   mklbl('D',2)])


micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
                                   ('b','foo'),('b','bah')],
                                 names=['lvl0', 'lvl1'])


dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
                  index=miindex,
                columns=micolumns).sort_index().sort_index(axis=1)

dfmi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,9,8,11,10
A0,B0,C1,D1,13,12,15,14
A0,B0,C2,D0,17,16,19,18


下面我们需要选择出MultiIndex第一层为A1或A2或A3, 第二层不做选择, 第三层只包括C1和C3的行:

In [13]:
dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78
A1,B0,C3,D0,89,88,91,90
A1,B0,C3,D1,93,92,95,94
A1,B1,C1,D0,105,104,107,106
A1,B1,C1,D1,109,108,111,110
A1,B1,C3,D0,121,120,123,122
A1,B1,C3,D1,125,124,127,126
A2,B0,C1,D0,137,136,139,138
A2,B0,C1,D1,141,140,143,142


你还可以使用`pandas.IndexSlic`类来实现类似的选择:

In [15]:
idx = pd.IndexSlice

dfmi.loc[idx['A1': 'A3', :, ['C1', 'C3']], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78
A1,B0,C3,D0,89,88,91,90
A1,B0,C3,D1,93,92,95,94
A1,B1,C1,D0,105,104,107,106
A1,B1,C1,D1,109,108,111,110
A1,B1,C3,D0,121,120,123,122
A1,B1,C3,D1,125,124,127,126
A2,B0,C1,D0,137,136,139,138
A2,B0,C1,D1,141,140,143,142


同样是上面的例子, 我们可以选择出列索引第二层为bar的列:

In [17]:
dfmi.loc[idx['A1': 'A3', :, ['C1', 'C3']], idx[:, 'foo']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78
A1,B0,C3,D0,88,90
A1,B0,C3,D1,92,94
A1,B1,C1,D0,104,106
A1,B1,C1,D1,108,110
A1,B1,C3,D0,120,122
A1,B1,C3,D1,124,126
A2,B0,C1,D0,136,138
A2,B0,C1,D1,140,142


另外, 我们可以使用布尔的蒙版来配合`IndexSlice`选择数据, 下面我们选择出foo列的数值小于100的行:

In [20]:
mask = (dfmi[('a', 'foo')] < 100) & (dfmi[('b', 'foo')] < 100)

dfmi.loc[idx[mask, :, ['C1', 'C2']], idx[:, 'foo']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A0,B0,C1,D0,8,10
A0,B0,C1,D1,12,14
A0,B0,C2,D0,16,18
A0,B0,C2,D1,20,22
A0,B1,C1,D0,40,42
A0,B1,C1,D1,44,46
A0,B1,C2,D0,48,50
A0,B1,C2,D1,52,54
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78


### 按索引聚合数据和数据对齐

在多层索引中, 我们可以依据某一层进行数据聚合, 比如求和, 求均值, 下面我们先来创建一个dataframe:

In [23]:
midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']],
                      labels=[[1,1,0,0],[1,0,1,0]])


df = pd.DataFrame(np.random.randn(4,2), index=midx)

df

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.634407,0.272985
one,x,-0.546991,0.001771
zero,y,1.801089,-1.132311
zero,x,0.2131,2.339203


求第一层索引的均值:

In [24]:
df2 = df.mean(level=0)
df2

Unnamed: 0,0,1
one,-0.590699,0.137378
zero,1.007094,0.603446


如果我们想用均值替换原先的所有值, 我们可以恢复到原始数据的形状和索引:

In [28]:
df3 = df2.reindex(df.index, level=0)
df3

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.590699,0.137378
one,x,-0.590699,0.137378
zero,y,1.007094,0.603446
zero,x,1.007094,0.603446


上面就是一个数据对齐的过程, df2的索引和df的索引按照第一层对齐, 也就是[one, zero]对齐, 假如不对齐, 我们会得到什么结果?

In [30]:
df4 = df2.reindex(df.index)
df4

Unnamed: 0,Unnamed: 1,0,1
one,y,,
one,x,,
zero,y,,
zero,x,,


我们可以使用更直观的方式去对齐数据:

In [33]:
df_a, df2_a = df.align(df2, level=0)
df2_a

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.590699,0.137378
one,x,-0.590699,0.137378
zero,y,1.007094,0.603446
zero,x,1.007094,0.603446


需要注意的是, 上面的方法可能会更改df和df2, 所以有两个返回值。

### 交换多层索引的层序

直接看例子就好了, 对比交换前后的index:

In [34]:
df

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.634407,0.272985
one,x,-0.546991,0.001771
zero,y,1.801089,-1.132311
zero,x,0.2131,2.339203


In [35]:
df.swaplevel(0, 1, axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-0.634407,0.272985
x,one,-0.546991,0.001771
y,zero,1.801089,-1.132311
x,zero,0.2131,2.339203


另外, 可以使用reorder_levels达到相同的目的, 只不过它可以一次性修改多层index的次序:

In [37]:
df.reorder_levels([1, 0], axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-0.634407,0.272985
x,one,-0.546991,0.001771
y,zero,1.801089,-1.132311
x,zero,0.2131,2.339203


### 排序

我们可以使用sort_index对索引进行排序。

In [43]:
import random; 

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
 

tuples = list(zip(*arrays))
random.shuffle(tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s

baz  one    0.035299
foo  one   -1.021257
baz  two   -0.225705
foo  two   -0.369259
bar  one   -0.681788
     two    0.873609
qux  two    0.325956
     one   -1.330222
dtype: float64

默认情况下, sort_index可以逐层排序, 首先排level=0的层:

In [44]:
s.sort_index()

bar  one   -0.681788
     two    0.873609
baz  one    0.035299
     two   -0.225705
foo  one   -1.021257
     two   -0.369259
qux  one   -1.330222
     two    0.325956
dtype: float64

但是我们可以选择只对某一层排序:

In [45]:
s.sort_index(level=1)

bar  one   -0.681788
baz  one    0.035299
foo  one   -1.021257
qux  one   -1.330222
bar  two    0.873609
baz  two   -0.225705
foo  two   -0.369259
qux  two    0.325956
dtype: float64

如果多层索引设置了names属性, 我们可以使用名称作为参数:

In [46]:
s.index.names=['a', 'b']
s.sort_index(level='b')

a    b  
bar  one   -0.681788
baz  one    0.035299
foo  one   -1.021257
qux  one   -1.330222
bar  two    0.873609
baz  two   -0.225705
foo  two   -0.369259
qux  two    0.325956
dtype: float64


除了对索引进行排序, 我们还可以对DataFrame.columns排序, 先来看一下我们的数据:

In [47]:
dft = df.T
dft

Unnamed: 0_level_0,one,one,zero,zero
Unnamed: 0_level_1,y,x,y,x
0,-0.634407,-0.546991,1.801089,0.2131
1,0.272985,0.001771,-1.132311,2.339203


In [48]:
dft.sort_index(level=1, axis=1)

Unnamed: 0_level_0,one,zero,one,zero
Unnamed: 0_level_1,x,x,y,y
0,-0.546991,0.2131,-0.634407,1.801089
1,0.001771,2.339203,0.272985,-1.132311


index排序后有一个好处, 就是你可以使用切片来选择数据, 但是如果index没有排序, 你可能会遇到错误:

In [62]:
s.loc[('baz', 'one' ): ('bar', 'one')]

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (0)'

我们可以使用is_lexsorted来判断是否经过了排序:

In [53]:
s.index.is_lexsorted()

False

In [61]:
ss = s.sort_index()
ss.loc[('bar', 'one' ): ('baz', 'one')]

a    b  
bar  one   -0.681788
     two    0.873609
baz  one    0.035299
dtype: float64