代码收藏家技术教程 2025-02-11

系统入门 Python Pandas

前言，这次有点呕心沥血的感觉了😂

对pandas的吐槽：功能复杂，参数复杂，特别乱，存在名称相同的方法，但功能截然不同的情况；文档异常散乱，看文档跟吃*一样；自动数据对齐是双刃剑，没有明确哪些方法会使用它，造成很多方法容易出现无法预测的行为，但它又在分组、聚合及多级索引等地方发挥重要作用。

给初学者的忠告：尽量按照官方用例来写代码，不要自己异想天开，尝试使用广播或其他机制，妄图简化代码或炫技，这很可能引入无法预测的行为。还有那些axis参数，还是现查来得实在。

pandas数据结构基础

Series，类似于一维数组

可通过Series.index 获取标签；Series.array 获取数据；Series.shape 获取形状；若来自DataFrame的单个列，Series.name将为列名。可以将index属性重新赋值，但不能对index中的单个元素直接修改。

Series的创建

pd.Series(data, index) data可以是numpy数组、dict、标量值等。若不指定index，则会创建一个数字索引；若指定了index，则会按index顺序取出data。index可以重复，但如果尝试执行不支持重复索引值的操作时会报错。

pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])  # 从numpy数组创建
# a    0.469112
# b   -0.282863
# c   -1.509059
# d   -1.135632
# e    1.212112
# dtype: float64
s.index
# Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
pd.Series(np.random.randn(5))  # 若不指定index，则会创建一个数字索引
# 0   -0.173215
# 1    0.119209
# 2   -1.044236
# 3   -0.861849
# 4   -2.104569
# dtype: float64
pd.Series({"b": 1, "a": 0, "c": 2})  # 从字典创建
# b    1
# a    0
# c    2
# dtype: int64 
pd.Series({"b": 1, "a": 0, "c": 2}, index=['a', 'b', 'c', 'd'])  # 若指定了index，则会按index顺序取出data
# a    0.0
# b    1.0
# c    2.0
# d    NaN
# dtype: float64 
pd.Series(5.0, index=range(3))  # 从标量值创建
# 0    5.0
# 1    5.0
# 2    5.0
# dtype: float64

DataFrame，类似于二维数组

可通过DataFrame.index 获取行标签；DataFrame.columns获取列标签；DataFrame.values 获取数据；DataFrame.shape 获取形状。可以将index或columns属性重新赋值，但不能对它们中的单个元素直接修改(如 df.index[0]=1 会报错)。

DataFrame的创建

pd.DataFrame(data, index, columns) data可以是Series对象，一个值为Series的字典，二维numpy数组，字典列表等。若不指定index或columns，则会创建一个数字索引；若指定了index或columns，则会按index或columns顺序取出data。index和columns可以重复，但如果尝试执行不支持重复索引值的操作时会报错。

当使用字典创建时，df的列索引来自于字典的key，行索引来自于字典value的索引并集，即这种创建方法是按列输入，字典中的一个键值对代表一列；而当使用字典列表创建时，列索引来自于列表元素-字典的key并集，行索引一般为默认数字索引，即这种创建方法是按行输入，一个字典代表一行。另见pd.DataFrame.from_dict(data, orient='column') 中的orient参数，该参数可以控制按行或列输入。

d = {  # 从值为Series的字典中创建
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
pd.DataFrame(d)  # 生成的索引是各个列(Series)索引的并集
#    one  two
# a  1.0  1.0
# b  2.0  2.0
# c  3.0  3.0
# d  NaN  4.0 
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])  # 按照给定的index和columns取出数据
#    two three
# d  4.0   NaN
# b  2.0   NaN
# a  1.0   NaN 
pd.DataFrame({"one": [1.0, 2.0, 3.0], "two": [4.0, 3.0, 2.0]})  # 从值为numpy数组或列表的字典中创建
#    one  two
# 0  1.0  4.0
# 1  2.0  3.0
# 2  3.0  2.0 
pd.DataFrame([{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}])  # 从字典列表中创建
#    a   b     c
# 0  1   2   NaN
# 1  5  10  20.0

数据类型 `.dtype`

分为：日期时间 datetime64 period，切片(范围) interval，稀疏 Sparse，分类 category，数字 Int8/16/... Float32/64，字符串 object(默认) string，布尔 boolean 。

使用dtype='...'参数在创建时指定数据类型，使用.astype()方法转换类型，使用.infer_objects() 推测object中隐含的类型并将其转换。

使用Series.to_numeric() 强制转换为数字类型，Series.to_datetime() 强制转换为日期时间类型。这些强制转换方法具有errors='raise'参数，默认行为是转换过程中出现错误即报错；errors='coerce' 时，转换出错的元素将不会报错，而转变为NA(nan或NaT)。对于DataFrame，可以使用apply函数来对列应用函数。

不要使用形如df.loc[:, [...]] = df.loc[:, [...]].astype(...) 的代码，因为.loc会尝试将右侧数据转换为原本数据类型，导致数据类型更改无效。

dft = pd.DataFrame(
{
"A": np.random.rand(3),
"B": 1,
"C": "foo",
"D": pd.Timestamp("20010102"),
"E": pd.Series([1.0] * 3).astype("float32"),
"F": False,
"G": pd.Series([1] * 3, dtype="int8"),
}
)
#           A  B    C          D    E      F  G
# 0  0.035962  1  foo 2001-01-02  1.0  False  1
# 1  0.701379  1  foo 2001-01-02  1.0  False  1
# 2  0.281885  1  foo 2001-01-02  1.0  False  1 
dft.dtypes
# A          float64
# B            int64
# C           object
# D    datetime64[s]
# E          float32
# F             bool
# G             int8
# dtype: object 
dft['A'] = dft['A'].astype('int')  # 使用astype更改数据类型，将A列变为int类型
#    A  B    C          D    E      F  G
# 0  0  1  foo 2001-01-02  1.0  False  1
# 1  0  1  foo 2001-01-02  1.0  False  1
# 2  0  1  foo 2001-01-02  1.0  False  1
dft.A.dtypes 
# dtype('int64') 
dft = dft.astype({'A':'int', 'F':'O'}).dtypes  # 传递一个字典以转换多列的数据类型(O表示object)
# A            int64
# B            int64
# C           object
# D    datetime64[s]
# E          float32
# F           object
# G             int8

使用numpy函数进行运算

支持numpy的数学运算及函数，但会根据标签来自动对齐数据，见：自动数据对齐。

提取值或转numpy

现在不推荐使用Series.values 进行值提取了，因为不确定返回的是pandas扩展数组还是numpy数组。推荐使用Series.array或Series.to_numpy()，前者返回pandas扩展数组，提供更佳性能；而后者将始终进行值复制并返回numpy数组。但对于DataFrame，似乎.values只会返回numpy数组，考虑美观(无歧义)的原因，现在推荐也使用DataFrame.to_numpy()。(但本笔记因为这块内容写得较晚，很多地方都使用了.values ，就不对那些做修改了)。

Index：存储所有pandas对象的轴标签的基本对象

创建Index：

将一个list或其他序列传递给pd.Index(...) 。

修改Index：

pd.Index元素不可修改，因此需要依靠df.rename()修改，或手动复制出来修改后再赋给Index。详见重命名列/行。(Index.rename()是不同的方法，它更改Index的名称)

常用操作：

a.union(b) 求两个index的并集， a.intersection(b) 求两个index的交集，a.difference(b) 求差集，a有但b没有的(x∣x∈A,且x∉B)，a.symmetric_difference(b) 求对称差集，返回a和b中只有a有或只有b有的元素。

df.reset_index()，index_to_某列，令参数drop=True以直接丢弃index。使用df.columns = pd.RangeIndex(df.columns.size) 将列标签或行标签重置为数字。

自动数据对齐

当对pandas对象执行基于标签的操作时，对齐就会被触发，使不同的数据具有相同的形状和相同顺序的行列标签。

触发对齐机制：算术运算(`+-*/`)，使用`.loc`进行赋值，合并或连接pandas对象，`reindex`，`align`，`where`等。

由于.loc是对齐后再赋值，因此df.loc[:, ['B', 'A']] = df[['A', 'B']] 是无法交换列数据的，因为它们赋值前，顺序就已经与原数据框相同了。要实现列交换，可除去其列标签。

df = pd.DataFrame(
np.arange(12).reshape(3,4), 
index=['row'+str(x) for x in range(1,4)],
columns=['col'+x for x in list('ABCD')])
#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11  
df.loc[:, ['colA', 'colB']] = df[['colB', 'colA']]  # 列没有交换
#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11 
df.loc[:, ['colA', 'colB']] = df[['colB', 'colA']].values  # 列被交换，使用.to_numpy()也可以
#       colA  colB  colC  colD
# row1     1     0     2     3
# row2     5     4     6     7
# row3     9     8    10    11

不触发对齐机制：使用`[]` 进行选择，使用`.iloc`进行赋值，当操作对象没有标签时(`.to_numpy()`或`.values`)等。

#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11
df.iloc[:, [0, 1]] = df[['colB', 'colA']]  # 使用.iloc，抑制了对齐机制
df[['B', 'A']] = df[['A', 'B']]  # 使用[]进行直接选择，也可抑制对齐机制
#       colA  colB  colC  colD
# row1     1     0     2     3
# row2     5     4     6     7
# row3     9     8    10    11

磁盘数据导入和保存(仅csv)

导入csv：`df.read_csv()`

read_csv(file, sep=',', [delimiter], header='infer', [names], [index_col], skipinitialspace=False, [skiprows], [na_values], keep_default_na=True, quotechar='"', [quoting], [comment])

该函数的参数分为几个方面：

基础方面：

sep，指定分隔符；delimiter，sep的别名；skiprows，跳过若干行；comment，指定注释符以跳过注释行；skipinitialspace，是否忽略分割符后面的空白

行列索引方面：

names，指定列标签(索引)。

header='infer'，默认行为有些复杂，若指定了names参数，则header=None，将文件第一有效行看作数据；若未指定names参数，则header=0，将文件第一有效行(索引为0)看作列标签。此外，header还可以被整数和整数数组指定，表示使用特定行或多行(嵌套)作为列标签。

s = StringIO(  # read_csv运行后，需要重新运行一下这个
'''
col1, col2, col3
1, 2, 3
4, 5, 6
''')
pd.read_csv(s)  # 未指定names，将第一个有效行看作列标签
#    col1   col2   col3
# 0     1      2      3
# 1     4      5      6
pd.read_csv(s, names=['new1', 'new2', 'new3']) 
#    new1   new2   new3
# 0  col1   col2   col3  # 将第一个有效行看作了数据
# 1     1      2      3
# 2     4      5      6 
pd.read_csv(s, names=['new1', 'new2', 'new3'], header=0)
#    new1  new2  new3
# 0     1     2     3
# 1     4     5     6

index_col，指定用作行标签的列。若未指定，当列标签行中的字段数等于数据字段数，则使用默认数字行标签；当数据字段数更多时，则使用第一列作为行索引。也可手动使用列标签或数字索引指定，如果被数组指定，则生成嵌套标签。当指定index_col=False时，则以数字索引作为行标签。

s = StringIO(  # read_csv运行后，需要重新运行一下这个
'''
col1, col2, col3
r1, 1, 2, 3
r2, 4, 5, 6
''')
pd.read_csv(s)
#     col1   col2   col3  # 数据字段更多，使用第一列作为行索引
# r1     1      2      3
# r2     4      5      6 
pd.read_csv(s, index_col=1)  # 指定第二列作为行索引
#   col1   col2   col3
# 1   r1      2      3
# 4   r2      5      6 
pd.read_csv(s, index_col=False)  # 由于没有指定列为行索引，数据的最后一列是多余的，被截断
#   col1   col2   col3
# 0   r1      1      2
# 1   r2      4      5

NA值识别方面：

na_values，指定要识别为NA或NAN的字符串。与下方的keep_default_na参数联用。

keep_default_na=True，若该参数为True，且指定了na_values，则将na_values附加到默认na值中；若该参数为False，则不会将默认的na值识别为NA或NAN，当na_values也未指定时，则不会将任何字符识别为NA。

s = StringIO(
'''
col1, col2, col3
1, na, 3
nan, 5, 6
''')
df = pd.read_csv(s)
#    col1  col2   col3
# 0   1.0    na      3
# 1   NaN     5      6 
df.iloc[1,0]  # np.float64(nan)
df.iloc[0,1]  # ' na'  # 不是NaN
df = pd.read_csv(s, na_values=' na')  # 这样将' na'也识别为了NA值，但注意前面的空格
#    col1   col2   col3
# 0   1.0    NaN      3
# 1   NaN    5.0      6 
df = pd.read_csv(s, skipinitialspace=True, na_values='na')  # na前面的空格被忽略了
#    col1  col2  col3
# 0   1.0   NaN     3
# 1   NaN   5.0     6

引用处理方面：默认情况下，当有引号位于某个字段的第一个字符(不能有前导空白字符)，才需要注意；若引号出现在字段中间，则该引号没有作用。当skipinitialspace=True时，引号为字段的第一个非空白字符才有引用作用。

quotechar，控制将什么字符看作引用项开始符与结束符，引用项间的分隔符将被忽略。

quoting，虽然分成了很多种参数，但读取csv时，参数值0、1、2的行为似乎是相同的，因此只需要注意参数值3 。012参数值会处理引用，而3参数值则不会处理。当处理双引号在字段最前面且可能不成对的文件时，可使用参数值3来禁用引用。

s = StringIO(
'''
col1, col2, col3
1,"str1", 3
4, str2, 6
''')
pd.read_csv(s)
#    col1   col2   col3
# 0     1   str1      3
# 1     4   str2      6
pd.read_csv(s, quoting=3)  # 不处理引用
#    col1    col2   col3
# 0     1  "str1"      3
# 1     4    str2      6 
s = StringIO(
'''
col1, col2, col3
1, "str1, 3
4, str2, 6
''') 
pd.read_csv(s, skipinitialspace=True)  # 报错
pd.read_csv(s, skipinitialspace=True, quoting=3)  # 不处理引用
#    col1   col2  col3
# 0     1  "str1     3
# 1     4   str2     6

导出csv：`df.to_csv()`

to_csv(file, sep=',', na_rep='', header=True, index=True)

sep，输出文件的字段分隔符；na_rep，缺失值的字符串表现形式；header，是否输出列标签；index，是否输出行标签(索引)。

s = pd.DataFrame({'col1': [1,2,np.nan], 'col2': [4,5.1,6]})
#    col1  col2
# 0   1.0   4.0
# 1   2.0   5.1
# 2   NaN   6.0 
s.to_csv('test.csv')
# ,col1,col2
# 0,1.0,4.0
# 1,2.0,5.1
# 2,,6.0    # 这里的缺失值为空白
s.to_csv('test.csv', na_rep='nan', header=False)  # 设定缺失值，并不输出列标签
# 0,1.0,4.0
# 1,2.0,5.1
# 2,nan,6.0
s.to_csv('test.csv', index=False)  # 不输出行标签
# col1,col2
# 1.0,4.0
# 2.0,5.1
# ,6.0

索引、数据选择与元素_列的更改

.head()和.tail()

使用`[]` 进行选择和就地修改

对于Series，支持按位置、位置切片、标签和布尔数组进行选择。但对于dataframe，支持按标签(列)、数字切片(行)、布尔Series(行筛选)和布尔DataFrame(形同.where)进行选择，不能是单个数字。输入列表以选择多个元素或列。

s = pd.Series(np.arange(3,6), index=list('abc'))
# a    3
# b    4
# c    5
s[1]
s['b']
# 4 
s[s >= 4]
# b    4
# c    5

df = pd.DataFrame(
np.arange(12).reshape(3,4), 
index=['row'+str(x) for x in range(1,4)],
columns=['col'+x for x in list('ABCD')]) 
#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11
df['colB'] 
# row1    1
# row2    5
# row3    9
# Name: colB 
df[1]  # 不能是单个数字，除非列标签恰好为数字
df['colB'][1]  # 若要选择行，则需再对返回的Series进行选择(注意返回的需为Series)
# 5
df[['colB', 'colA']]  # 输入列表选择多列
#       colB  colA
# row1     1     0
# row2     5     4
# row3     9     8 
df[1:3]  # 行切片
#       colA  colB  colC  colD
# row2     4     5     6     7
# row3     8     9    10    11
df[::-1]
#       colA  colB  colC  colD
# row3     8     9    10    11
# row2     4     5     6     7
# row1     0     1     2     3  
df[df['colA'] > 0]  # 布尔Series
#       colA  colB  colC  colD
# row2     4     5     6     7
# row3     8     9    10    11 
df[df > 5]  # 布尔DataFrame，等同于df.where(df > 5)
#       colA  colB  colC  colD
# row1   NaN   NaN   NaN   NaN
# row2   NaN   NaN   6.0   7.0
# row3   8.0   9.0  10.0  11.0

赋值操作

可以使用df[['B', 'A']] = df[['A', 'B']] 直接选择列，以实现列交换，见自动数据对齐。

注意：应尽可能避免使用形如df['col1'][...]=xxx 的代码(链式赋值)，因为赋值可能成功，也可能失败，这取决于df['col1'] 返回的是df的副本还是视图，因此很难预测df是否被成功修改。

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11
df['A']=1  # 对列修改
#     A  B   C   D
# r1  1  1   2   3
# r2  1  5   6   7
# r3  1  9  10  11 
df[1:2]=1  # 使用切片对行修改
#     A  B   C   D
# r1  0  1   2   3
# r2  1  1   1   1
# r3  8  9  10  11

此外，当对一个不存在的标签赋值后，Series和DataFrame将会扩展。对于df，输入字符串或数字将被看作列标签，导致新增列，无法通过输入切片来新增行。

s = pd.Series([1, 2, 3], index=list('abc'))
s['e'] = 4
# a    1
# b    2
# c    3
# e    4 

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11
df[5]=1  # 新增了列标签为5的列
#     A  B   C   D  5
# r1  0  1   2   3  1
# r2  4  5   6   7  1
# r3  8  9  10  11  1

将Series的index或DataFrame的列作为属性访问

仅当索引元素是有效的python标识符，且不与对象的现有方法冲突时才可以作为属性访问，例如不允许使用s.1、s.min 。

可以使用属性访问来修改series的元素及dataframe的列。但不能用它来给series和dataframe增加新的元素或列，因为这会给series或dataframe创建一个新的属性，而不是新元素或列。

s = pd.Series([1, 2, 3], index=list('abc'))
s.b
# 2
df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
df.A
# r1    0
# r2    4
# r3    8
# Name: A

赋值操作

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
df.A = [11,12,13]
#      A  B   C   D
# r1  11  1   2   3
# r2  12  5   6   7
# r3  13  9  10  11 
df.A = 1
#     A  B   C   D
# r1  1  1   2   3
# r2  1  5   6   7
# r3  1  9  10  11

`.loc[]` 主要基于行列标签选择或修改数据

由于值修改发生在列对齐后，因此无法实现列交换。

支持单个标签、标签列表、标签切片(全闭区间)、布尔数组。对于DataFrame，若只填入一个维度，则默认按行取。(布尔条件的组合使用| & ~)

s = pd.Series(np.arange(5), index=list('abcde'))
s.loc['c']
# 2
s.loc['b':'d']  # 全闭区间
# b    1
# c    2
# d    3
s.loc['b':'d'] = 10  # 赋值修改
# a     0
# b    10
# c    10
# d    10
# e     4

df = pd.DataFrame(np.arange(12).reshape(4, 3),
index=list('abcd'),
columns=list('ABC'))
#    A   B   C
# a  0   1   2
# b  3   4   5
# c  6   7   8
# d  9  10  11
df.loc['a']  # 默认按行取
# A    0
# B    1
# C    2
# Name: a 
df.loc[['a', 'c', 'd']]
#    A   B   C
# a  0   1   2
# c  6   7   8
# d  9  10  11 
df.loc[['a', 'c'], 'B':]
#    B  C
# a  1  2
# c  7  8 
df.loc[:, df.loc['b'] > 3]  # 将行b > 3的列取出
#     B   C
# a   1   2
# b   4   5
# c   7   8
# d  10  11 
df.loc[:, df.loc['b'].map(lambda x: x%4>0)]  # 使用Series(即单列)的map方法来设置更复杂的筛选 
#    A   C
# a  0   2
# b  3   5
# c  6   8
# d  9  11

赋值操作

df = pd.DataFrame(np.arange(12).reshape(4, 3),
index=list('abcd'),
columns=list('ABC'))
#    A   B   C
# a  0   1   2
# b  3   4   5
# c  6   7   8
# d  9  10  11                  
df.loc['a':'c', ['A', 'C']] = 40
#     A   B   C
# a  40   1  40
# b  40   4  40
# c  40   7  40
# d   9  10  11

此外，当对一个不存在的标签赋值后，Series和DataFrame将会扩展。

s = pd.Series([1, 2, 3], index=list('abc'))
s.loc['e'] = 4
# a    1
# b    2
# c    3
# e    4 
df = pd.DataFrame(np.arange(6).reshape(3,2), index=list('abc'), columns=list('AB'))
#    A  B
# a  0  1
# b  2  3
# c  4  5
df.loc['e','D'] = 10 
#      A    B     D
# a  0.0  1.0   NaN
# b  2.0  3.0   NaN
# c  4.0  5.0   NaN
# e  NaN  NaN  10.0 
df.loc[:,'C'] = 0
#    A  B  C
# a  0  1  0
# b  2  3  0
# c  4  5  0

`.iloc[]` 主要基于行列整数位置选择或修改数据

支持一个整数、整数列表、整数切片(左闭右开)、布尔数组。对于DataFrame，若只填入一个维度，则默认按行取。

s = pd.Series(np.random.randn(4), index=list(range(0, 8, 2)))
# 0    1.094079
# 2    0.391677
# 4    1.678891
# 6   -0.444310 
s.iloc[:2]  # 左闭右开
# 0    1.094079
# 2    0.391677 
s.iloc[2]
# 1.678891 

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD')) 
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11
df.iloc[:2]  # 默认按行取
#     A  B  C  D
# r1  0  1  2  3
# r2  4  5  6  7 
df.iloc[[0,2], [0,1,3]]  # 整数列表
#     A  B   D
# r1  0  1   3
# r3  8  9  11 
df.iloc[:, df.iloc[1]>5]  # 报错，可能激活了自动数据对齐，但iloc又不允许这种行为
df.iloc[:, (df.iloc[1]>5).values]  # 弃去标签就可以了
#      C   D
# r1   2   3
# r2   6   7
# r3  10  11

赋值操作

df = pd.DataFrame(np.arange(12).reshape(4, 3),
index=list('abcd'),
columns=list('ABC'))
#    A   B   C
# a  0   1   2
# b  3   4   5
# c  6   7   8
# d  9  10  11
df.iloc[:,2] = df.loc[:,'B']
#    A   B   C
# a  0   1   1
# b  3   4   4
# c  6   7   7
# d  9  10  10

iloc无法对一个超出整数位置的索引赋值，即，无法使用iloc进行dataframe的扩展。

同时基于标签和整数位置来索引行和列(结合`.loc`与`.iloc`)

使用.loc ，标签正常使用，但整数位置需要df.index[...]或df.columns[...] 转化为标签

df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=list('abc'))
#    A  B
# a  1  4
# b  2  5
# c  3  6
df.loc[df.index[[0, 2]], 'A']
# a    1
# c    3
# Name: A 
df.loc[['a','b'], df.columns[1]]
# a    4
# b    5
# Name: B

使用.iloc ，整数位置正常使用，但标签需要df.index.get_loc(...)或df.columns.get_loc(...)或.get_indexer([...]) 转化为整数位置，get_loc仅支持单个标签的转换，.get_indexer支持多个标签的转换。

df.iloc[[0, 2], df.columns.get_loc('A')]  # get_loc仅支持单个标签的转换
# a    1
# c    3
# Name: A 
df.iloc[df.index.get_indexer(['a']), [1,0]]   # get_indexer支持多个标签，但即使是一个标签，也需包括在列表中
#    B  A
# a  4  1

上面都是取子集，`.where(condition, other=nan)` 接受一个布尔对象`condition`，不改变原始对象的形状，将对应着False的元素替换为`other` (默认NA)，逻辑与`if-else`相似。

对于df.where(condition, other, axis) ，condition的形状应与df相同(如果不相同理应会报错，但实际上可能会返回无法预测的dataframe)。other可为标量、Series或DataFrame，标量很好理解，NA就是标量；若为DataFrame，则应与df形状相同，other与df将通过自动数据对齐一一对应；若为Series，则可以设置axis参数，other与df的各列或各行通过自动数据对齐一一对应。注意：由于进行了自动数据对齐，other的标签需与df标签一致才可正常对应。

s = pd.Series(np.arange(3,6), index=list('abc'))
# a    3
# b    4
# c    5
s[s>3]  # 仅返回符合条件的元素
# b    4
# c    5 
s.where(s>3)  # where不改变原始对象的形状
# a    NaN
# b    4.0
# c    5.0 

df = pd.DataFrame(
    np.arange(12).reshape(3,4), 
    index=['row'+str(x) for x in range(1,4)],
    columns=['col'+x for x in list('ABCD')]) 
#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11
df.where(df>5, other=15)  # other是标量
#       colA  colB  colC  colD
# row1    15    15    15    15
# row2    15    15     6     7
# row3     8     9    10    11 
df.where(df>5, other=df-5)  # order是与df形状相同的dataframe，且标签顺序一致
#       colA  colB  colC  colD
# row1    -5    -4    -3    -2
# row2    -1     0     6     7
# row3     8     9    10    11 
other_df = pd.DataFrame(
    np.arange(12,24).reshape(3,4), 
    index=['row'+str(x) for x in range(3,0,-1)],
    columns=['col'+x for x in list('ACBD')]) 
#       colA  colC  colB  colD
# row3    12    13    14    15
# row2    16    17    18    19
# row1    20    21    22    23
df.where(df>5, other=other_df)  # other_df被自动对齐
#       colA  colB  colC  colD
# row1    20    22    21    23
# row2    16    18     6     7
# row3     8     9    10    11
df.where(df>5, other=other_df.values)  # 删去other_df的标签以抑制自动对齐
#       colA  colB  colC  colD
# row1    12    13    14    15
# row2    16    17     6     7
# row3     8     9    10    11
df.where(  # 分别设定各列的替换值(other)，注意index对应，不对应会赋值NA
    df%3==0, 
    other=pd.Series(['c1','c2','c3','c4'], index = df.columns), 
    axis='columns'
) 
#      colA colB colC colD
# row1    0   c2   c3    3
# row2   c1   c2    6   c4
# row3   c1    9   c3   c4 
df.where(  # 分别设定各行的替换值(other)，注意index对应，不对应会赋值NA
    df%3==0, 
    other=pd.Series(['r1','r2','r3'], index=df.index), 
    axis='index'
)
#      colA colB colC colD
# row1    0   r1   r1    3
# row2   r2   r2    6   r2
# row3   r3    9   r3   r3

`.where()` 只可以替换布尔对象为False的值，而`np.where(condition, x, y)` 可以将布尔对象为True的值赋为x，False的值赋为y。此外还有`np.select()` 可用于更复杂的判断。

df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})
#   col1 col2
# 0    A    Z
# 1    B    Z
# 2    B    X
# 3    C    Y 
df['color'] = np.where(df['col2'] == 'Z', 'green', 'red') 
#   col1 col2  color
# 0    A    Z  green
# 1    B    Z  green
# 2    B    X    red
# 3    C    Y    red 
conditions = [
(df['col2'] == 'Z') & (df['col1'] == 'A'),
(df['col2'] == 'Z') & (df['col1'] == 'B'),
(df['col1'] == 'B')
]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
#   col1 col2   color
# 0    A    Z  yellow
# 1    B    Z    blue
# 2    B    X  purple
# 3    C    Y   black

`DataFrame.query('')` 用于DataFrame对象的行选择(筛选)，语法更加简单

df = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'))
#           a         b         c
# 0  0.699926  0.193127  0.735952
# 1  0.150201  0.380380  0.241509
# 2  0.914392  0.015050  0.216910
# 3  0.210421  0.244153  0.488899
# 4  0.289512  0.974012  0.985863
# 5  0.364995  0.126003  0.996186
df[(df['a'] < df['b']) & (df['b'] < df['c'])] 
df.query('(a < b) & (b < c)')
df.query('a < b < c')  # 这三个是等同的
#           a         b         c
# 3  0.210421  0.244153  0.488899
# 4  0.289512  0.974012  0.985863

行/列的插入和删除

列的插入

df['new']=... 或 df.loc[:,'new']=... 插入列在列末尾，就地修改。

df.insert(location, 新列名, 值) 插入指定位置(仅支持数字索引)，就地修改。

df = df.assign() 跟前面两个不同，assign不执行就地修改，而返回修改后的df对象，因此需要额外赋值，但这也使它可以实现类似于R语言的dplyr管道(在pandas中称作方法链)。返回修改对象。

注意，在使用方法链时，若assign的修改对象是修改后的df，对于形如df['A']+df['B'] 的代码，要改为lambda dfx: dfx['A']+dfx['B'] ，这是因为df已被修改过了，使用df['A']无法得到现在的df的A列，因此要使用匿名函数，将修改后的df赋值为dfx。

df = pd.DataFrame(np.arange(12).reshape(3, 4).T,
index=list('abcd'),
columns=list('ABC'))
#    A  B   C
# a  0  4   8
# b  1  5   9
# c  2  6  10
# d  3  7  11
df.assign(D = df['A'] + df['B']) 
#    A  B   C   D
# a  0  4   8   4
# b  1  5   9   6
# c  2  6  10   8
# d  3  7  11  10 
(  # 最外层要加括号，不然.assign那里不能换行
df.query('0 < A < 3')
.assign(
D = lambda dfx: dfx.A + dfx.B,
E = lambda dfx: dfx.A / dfx.D  # 可以同时新建多个列，且后面的列可以使用刚新建的列
)
)

列的删除

del df['A']

droped_column = df.pop('A')

`.drop([...], axis=0)` 从指定轴上删除某个标签

df = pd.DataFrame(
np.arange(12).reshape(3,4), 
index=['row'+str(x) for x in range(1,4)],
columns=['col'+x for x in list('ABCD')]) 
#       colA  colB  colC  colD
# row1     0     1     2     3
# row2     4     5     6     7
# row3     8     9    10    11
df.drop('row1')
#       colA  colB  colC  colD
# row2     4     5     6     7
# row3     8     9    10    11
df.drop(['colB','colC'],axis='columns')
#       colA  colD
# row1     0     3
# row2     4     7
# row3     8    11

重命名列/行

基于确定的整数位置重命名：间接地用位置修改标签

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11
temp = df.columns.values  # 由于pd.Index元素不可修改，所以只能将值赋给一个临时变量
temp[2] = 'NEW'           # 修改后再替换原来的标签
df.columns = temp
#     A  B  NEW   D
# r1  0  1    2   3
# r2  4  5    6   7
# r3  8  9   10  11 
temp[2],temp[3] = temp[3], temp[2]  # 交换列标签，但不更改内容
df.columns = temp 
#     A  B   D   C
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11

基于标签的名称重命名：`df.rename({原值:更改值, ...}或function, axis)` 或 `df.index/columns=df.index.to_series().replace({原值:更改值, ...})`

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11
df = df.rename({'C':'NEW', 'D':'NEW2'}, axis='columns')
#     A  B  NEW  NEW2
# r1  0  1    2     3
# r2  4  5    6     7
# r3  8  9   10    11 
df = df.rename(lambda x: 'c'+x, axis='columns')
#     cA  cB  cC  cD
# r1   0   1   2   3
# r2   4   5   6   7
# r3   8   9  10  11 
df = df.rename(str.upper, axis='index')
#     A  B   C   D
# R1  0  1   2   3
# R2  4  5   6   7
# R3  8  9  10  11 
df.columns = df.columns.to_series().replace({'C':'D', 'D':'C'})  # 利用Series的replace方法 
#     A  B   D   C
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11

`df.set_index()`，某列_to_index；`df.reset_index()`，index_to_某列，令参数`drop=True`以直接丢弃index。

df = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
'b': ['one', 'two', 'one', 'two'],
'c': ['z', 'y', 'x', 'w'],
'd': [1., 2., 3, 4]},
index=list('ABCD'))
#      a    b  c    d
# A  bar  one  z  1.0
# B  bar  two  y  2.0
# C  foo  one  x  3.0
# D  foo  two  w  4.0 
df = df.set_index('c')  # 删除原有index
#      a    b    d
# c               
# z  bar  one  1.0
# y  bar  two  2.0
# x  foo  one  3.0
# w  foo  two  4.0 
df = df.set_index('c', drop=False)  # 使用drop=False以不删除列c
#      a    b  c    d
# c                  
# z  bar  one  z  1.0
# y  bar  two  y  2.0
# x  foo  one  x  3.0
# w  foo  two  w  4.0 
df = df.reset_index()  # 将index转为列，当参数drop=True时，则直接删除index
#   index    a    b  c    d
# 0     A  bar  one  z  1.0
# 1     B  bar  two  y  2.0
# 2     C  foo  one  x  3.0
# 3     D  foo  two  w  4.0

将自定义函数作用于pandas对象：`pipe`，`apply`，`agg`，`transform`和`map`

在使用apply、agg、transform时，由于这些方法是通过迭代器实现的，若自定义函数更改了pandas对象，会导致迭代器失效而导致意外行为。要解决这个问题，只需要注意不要更改现在正在迭代的对象即可，因此可以在自定义函数开头df.copy()一份副本，然后在副本上进行更改并返回更改结果。

`pipe()` 传递给函数整个DataFrame或Series。用于构建类似于R中dplyr的管道，pandas叫做"方法链接"

df = pd.DataFrame({'name_age':['A_13', 'B_15', 'C_22']})
#   name_age
# 0     A_13
# 1     B_15
# 2     C_22 
def split_name_age(df):
    df['name'] = df.name_age.str.split('_').str.get(0)
    df['age'] = df.name_age.str.split('_').str.get(1)
return df
def check_adult(df):
    df.age = df.age.astype('int32')
    df['adult'] = df.age >= 18
return df
split_name_age(df)  # 分割name_age列
#   name_age name age
# 0     A_13    A  13
# 1     B_15    B  15
# 2     C_22    C  22 
check_adult(split_name_age(df))  # 使用第一个函数返回的df，检查是否成年
#   name_age name  age  adult
# 0     A_13    A   13  False
# 1     B_15    B   15  False
# 2     C_22    C   22   True 
df.pipe(split_name_age).pipe(check_adult)  # 使用pipe构建类似管道的"方法链接"

`apply()` 将DataFrame各列或各行作为Series依次传递给函数。

np_df = np.random.randn(4,3)
np_df[[0,2], [0,1]] = np.nan
df = pd.DataFrame(np_df,index=list('abcd'),columns=list('ABC'))
#           A         B         C
# a       NaN  0.854297 -0.169923
# b  0.080411  1.481673  0.172472
# c -2.248234       NaN -0.190191
# d -1.826412  0.539858  1.415044
df.apply(np.mean)  # axis默认为0/'index'，把行压扁
df.apply(lambda x: np.mean(x))  # 等同
# A   -1.331412
# B    0.958609
# C    0.306850 
df.apply(np.mean, axis='columns')  # 把列压扁
# a    0.342187
# b    0.578185
# c   -1.219213
# d    0.042830 

# apply中也可以给自定义函数传参
# 对于 def subtract_and_divide(x, sub, divide=1): ...
# 可以使用 df.apply(subtract_and_divide, args=(5,), divide=3) 传递参数
# arg=(...)传递位置参数，后面的divide=3，直接作为关键字参数传给自定义函数

`agg()` 是`aggregate()` 的别名，与`apply()`类似，将各列或各行作为Series传递给函数，但支持同时传递给多个函数。

np_df = np.random.randn(4,3)
np_df[[0,2], [0,1]] = np.nan
df = pd.DataFrame(np_df,index=list('abcd'),columns=list('ABC'))
#           A         B         C
# a       NaN  0.854297 -0.169923
# b  0.080411  1.481673  0.172472
# c -2.248234       NaN -0.190191
# d -1.826412  0.539858  1.415044
df.agg(['mean', 'sum'])
#              A         B         C
# mean -1.331412  0.958609  0.306850
# sum  -3.994235  2.875827  1.227402 
df.agg(["sum", lambda x: x.mean(), lambda x: sum(x > 0)])  # 传递lambda函数将产生<lambda>行
#                  A         B         C
# sum      -3.994235  2.875827  1.227402
# <lambda> -1.331412  0.958609  0.306850
# <lambda>  1.000000  3.000000  2.000000 
df.agg({"A": ["mean", "min"], "B": "sum"})  # 使用字典来指定将哪些函数应用于哪些列 
#              A         B
# mean -1.331412       NaN
# min  -2.248234       NaN
# sum        NaN  2.875827

`map` 仅传递给函数一个值(逐值传递)，要求函数返回一个值。它还有一个特殊功能：按键值对修改值。

np_df = np.random.randn(4,3)
np_df[[0,2], [0,1]] = np.nan
df = pd.DataFrame(np_df,index=list('abcd'),columns=list('ABC'))
#           A         B         C
# a       NaN -0.984173  0.294658
# b -0.750238  0.180998 -0.338067
# c  0.008502       NaN  0.285375
# d  0.209068  1.512674  0.639443 
df.map(lambda x: len(str(x)))
#     A   B   C
# a   3  18  18
# b  19  18  19
# c  19   3  19
# d  19  18  18 

s = pd.Series(
    ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
) 
s.map({"six": 6.0, "seven": 7.0})  # 特殊功能：按键值对修改值
# a    6.0
# b    7.0
# c    6.0
# d    7.0
# e    NaN

`transform()` 将DataFrame各列或各行作为Series依次传递给函数。但返回的Series与原来的形状(长度)要相同。和agg一样，支持传入多个函数，每次运行函数时，会新建一个子列或子行。

描述性统计与聚合

大多数描述性统计函数(方法)是聚合函数，它们使用`axis="index"或0`参数(压缩行，单列中的行被聚合)来进行聚合，`skipna=True`参数指示是否忽略NA。

主要包括：count 非NA值的数量，sum 加和，mean 平均，median 算术中位数，min 最小值，max 最大值，mode 众数，abs 绝对值(非聚合函数)，std 校正样本标准差，var 无偏方差等。

np_df = np.random.randn(4,3)
np_df[[0,2], [0,1]] = np.nan
df = pd.DataFrame(np_df,index=list('abcd'),columns=list('ABC'))
#           A         B         C
# a       NaN -0.984173  0.294658
# b -0.750238  0.180998 -0.338067
# c  0.008502       NaN  0.285375
# d  0.209068  1.512674  0.639443 
df.mean()  # 求平均值，压缩行，把行聚合了
# A   -0.274622
# B    0.870803
# C   -0.042969 
df.mean(axis='columns')  # 压缩列，把列聚合了
# a    0.546119
# b    0.363779
# c   -0.013569
# d   -0.453625 
df.mean(skipna=False)  # 不忽略NA
# A         NaN
# B         NaN
# C    0.753553

使用`describe()`自动汇总统计信息。

series = pd.Series(np.random.randn(1000))
series[::2] = np.nan
series.describe()
# count    500.000000
# mean      -0.021292
# std        1.015906
# min       -2.683763
# 25%       -0.699070
# 50%       -0.069718
# 75%        0.714483
# max        3.160915
series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])  # 选择要包含在输出中的特定百分位数
# count    500.000000
# mean      -0.021292
# std        1.015906
# min       -2.683763
# 5%        -1.645423
# 25%       -0.699070
# 50%       -0.069718
# 75%        0.714483
# 95%        1.711409
# max        3.160915 
s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])  # 非数字Series
s.describe()
# count     9
# unique    4
# top       a
# freq      5

`value_counts()`计算Series各元素出现的次数，对dataframe计算特定列的组合出现的次数。

data = np.random.randint(0, 7, size=50)
# array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
#        2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
#        6, 2, 6, 1, 5, 4]) 
pd.Series(data).value_counts()
# 6    10
# 2    10
# 4     9
# 3     8
# 5     8
# 0     3
# 1     2 
data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}
#    a  b
# 0  1  x
# 1  2  x
# 2  3  y
# 3  4  y
pd.DataFrame(data).value_counts() 
# a  b
# 1  x    1
# 2  x    1
# 3  y    1
# 4  y    1

`df.info()` 打印df的列的名称、非空值个数及数据类型

int_values = [1, 2, 3, 4, 5]
text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
df = pd.DataFrame({"int_col": int_values, "text_col": text_values, "float_col": float_values})
#    int_col text_col  float_col
# 0        1    alpha       0.00
# 1        2     beta       0.25
# 2        3    gamma       0.50
# 3        4    delta       0.75
# 4        5  epsilon       1.00   
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 5 entries, 0 to 4
# Data columns (total 3 columns):
#  #   Column     Non-Null Count  Dtype  
# ---  ------     --------------  -----  
#  0   int_col    5 non-null      int64  
#  1   text_col   5 non-null      object 
#  2   float_col  5 non-null      float64
# dtypes: float64(1), int64(1), object(1)
# memory usage: 252.0+ bytes

DataFrame的GroupBy分组聚合

要进行分组聚合，需要先使用df.groupby() 创建GroupBy对象，再使用GroupBy对象方法，输出相应结果。

分组

分组需要提供标签-组的映射，因此，要创建GroupBy对象，可指定.group_by() 的by参数：一个与标签长度相同的Series；一个函数(输入为标签列表，输出为相同长度的列表)；一个字典或Series(index-标签, value-组名)，提供从标签到组名的映射；对于DataFrame，指定一个要分组的列标签；或者一个包含上述内容的列表。

.groupby()的其他参数：dropna=True 默认忽略NA行(不将NA看作一个标签名)；as_index=True 默认将分组变量(键)作为index，False时，将分组变量包含在列中；

使用gb.groups 获取所有组对应的成员；gb.get_group() 来获取某组的成员。

df = pd.DataFrame(
    {"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
    "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
    "C": np.random.randn(8),
    "D": np.random.randn(8),}
)
#      A      B         C         D
# 0  foo    one -1.004903  0.589395
# 1  bar    one -0.736420 -0.090586
# 2  foo    two -0.542409 -0.779030
# 3  bar  three -0.387050  0.501419
# 4  foo    two -0.141339  0.673433
# 5  bar    two  1.385044  0.583636
# 6  foo    one  0.958883  0.532114
# 7  foo  three  1.780702  1.219029
df.groupby('A')  # 对变量A进行分组
df.groupby(df['A'])  # 与上等同
df.groupby(['A', 'B']).sum()  # 对多个变量进行GroupBy，并聚合
#                   C         D
# A   B                        
# bar one   -0.736420 -0.090586
#     three -0.387050  0.501419
#     two    1.385044  0.583636
# foo one   -0.046020  1.121509
#     three  1.780702  1.219029
#     two   -0.683747 -0.105596
df2 = df.copy()
df2.loc[4,'A'] = np.nan
df2.groupby(['A', 'B'], dropna=False, as_index=False).sum()  # 将NA值当作标签且将分组变量包含在列中
#      A      B         C         D
# 0  bar    one -0.736420 -0.090586
# 1  bar  three -0.387050  0.501419
# 2  bar    two  1.385044  0.583636
# 3  foo    one -0.046020  1.121509
# 4  foo  three  1.780702  1.219029
# 5  foo    two -0.542409 -0.779030
# 6  NaN    two -0.141339  0.673433  # <---
df.groupby(['A', 'B']).groups  # 获取所有组对应的成员
# { ('bar', 'one'): [1], 
#   ('bar', 'three'): [3], 
#   ('bar', 'two'): [5], 
#   ('foo', 'one'): [0, 6], 
#   ('foo', 'three'): [7], 
#   ('foo', 'two'): [2, 4]}  
df.groupby(['A', 'B']).get_group(('foo', 'one'))  # 获取某组成员
#      A    B         C         D
# 0  foo  one -1.004903  0.589395
# 6  foo  one  0.958883  0.532114

聚合

使用gb['A'] 或 gb.A 或 gb[['A','B']] 以对指定列进行后续操作。

聚合函数与前面介绍的相似；也可以使用`gb.describe()`生成汇总；还可以使用`gb.agg()`，要更改agg输出的列名，可以使用rename重命名列，也可以在agg中直接指定。

df = pd.DataFrame(
    {"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
    "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
    "C": np.random.randn(8),
    "D": np.random.randn(8),}
)
#      A      B         C         D
# 0  foo    one -1.004903  0.589395
# 1  bar    one -0.736420 -0.090586
# 2  foo    two -0.542409 -0.779030
# 3  bar  three -0.387050  0.501419
# 4  foo    two -0.141339  0.673433
# 5  bar    two  1.385044  0.583636
# 6  foo    one  0.958883  0.532114
# 7  foo  three  1.780702  1.219029 
df.groupby(['A', 'B'])['C'].sum()  # 只对C列聚合(返回Series)
df['C'].groupby([df['A'], df['B']]).sum()  # 同上，可通过给定一个包含了"与标签长度相同的Series"的列表进行分组
# A    B    
# bar  one     -0.736420
#      three   -0.387050
#      two      1.385044
# foo  one     -0.046020
#      three    1.780702
#      two     -0.683747
# Name: C, dtype: float64  
df.groupby(['A', 'B']).agg(['sum', 'std'])  # 使用agg聚合
#                   C                   D          
#                 sum       std       sum       std
# A   B                                            
# bar one   -0.736420       NaN -0.090586       NaN
#     three -0.387050       NaN  0.501419       NaN
#     two    1.385044       NaN  0.583636       NaN
# foo one   -0.046020  1.388607  1.121509  0.040504
#     three  1.780702       NaN  1.219029       NaN
#     two   -0.683747  0.283599 -0.105596  1.027046 
df.groupby('A')[['C','D']].agg(['sum', 'std']).rename({'sum':'new1'}, axis='columns')  # 重命名
df.groupby('A')[['C', 'D']].agg(  # 或在agg中直接指定列名
    mean_C=pd.NamedAgg(column='C', aggfunc='mean'),
    std_D=pd.NamedAgg(column='D', aggfunc='std')
)
#        mean_C     std_D
# A                      
# bar  0.087191  0.367833
# foo  0.210187  0.737897

使用`gb.transform()` 逐组转换数据，将组内各列作为Series传入函数，并期望返回具有相同长度的Series。

df = pd.DataFrame({
    'group':np.random.choice(list('ABCD'), 25),
    'value': np.random.normal(0.5, 2, 25)
})
#    group     value
# 0      C -1.650943
# 1      C -1.318024
# 2      B  2.662567
# 3      A -0.009293
# 4      C  0.190648
# ...
# 20     B  0.670990
# 21     D -0.508632
# 22     B  0.561280
# 23     A  1.970783
# 24     B  0.451692
df.groupby('group').agg(['mean', 'std'])  # 获取原始的平均数和标准差
#           value          
#            mean       std
# group                    
# A      0.922288  2.175179
# B      0.006243  2.285984
# C      0.597175  1.850434
# D      0.995618  2.127331 
df.groupby('group').transform(lambda x: (x-x.mean())/x.std()).groupby(df.group).agg(['mean', 'std'])
# 先根据group分组，对组内标准化，再分组、聚合
((df - df.groupby('group').transform('mean')) / df.groupby('group').transform('std')).groupby(df.group).value.agg(['mean', 'std'])  # 同上，但速度更快
#               value     
#                mean  std
# group                   
# A      6.938894e-18  1.0
# B      9.251859e-18  1.0
# C     -3.700743e-17  1.0
# D      0.000000e+00  1.0 
df['value'].transform(lambda x: (x - x.mean())/x.std()).groupby(df.group).agg(['mean', 'std'])
# 对所有value进行标准化(无分组)后的情况，注意上一个命令的第一个group的作用
#            mean       std
# group                    
# A      0.196193  1.061747
# B     -0.250946  1.115833
# C      0.037499  0.903233
# D      0.231987  1.038392 

# 使用transform用平均值填充缺失值
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])
df.iloc[1,0] = df.iloc[4,1] = df.iloc[7,2] = np.nan
#           a         b         c
# 0 -1.361766  1.085768 -0.572732
# 1       NaN  0.275264  0.580841
# 2  0.697151  0.192106  0.259210
# 3 -1.012415 -0.356137  0.603221
# 4 -1.687601       NaN  0.310273
# 5 -1.259403 -0.410519  0.765802
# 6  1.119156  2.039054 -1.098849
# 7  1.079027  1.300669       NaN
# 8  0.061345  0.233087 -0.238461
# 9 -0.047450 -0.518641  0.756355
df.mean()
# a   -0.267995
# b    0.426739
# c    0.151740
df.transform(lambda x: x.fillna(x.mean()))
#           a         b         c
# 0 -1.361766  1.085768 -0.572732
# 1 -0.267995  0.275264  0.580841
# 2  0.697151  0.192106  0.259210
# 3 -1.012415 -0.356137  0.603221
# 4 -1.687601  0.426739  0.310273
# 5 -1.259403 -0.410519  0.765802
# 6  1.119156  2.039054 -1.098849
# 7  1.079027  1.300669  0.151740
# 8  0.061345  0.233087 -0.238461
# 9 -0.047450 -0.518641  0.756355

使用`gb.filter()`筛选组(不是筛选行)。不要与`df/Series.filter` 搞混，后者不对数据筛选，而筛选标签(如选中结尾为e的列标签)。 `gb.filter()` 将df传递给函数，并期望返回一个布尔标量值。

df = pd.DataFrame({
    'group':list('AAABBBB'),
    'value': np.random.randint(10,size=7)
})
#   group  value
# 0     A      1
# 1     A      3
# 2     A      4
# 3     B      3
# 4     B      7
# 5     B      1
# 6     B      9 
df.groupby('group').filter(lambda x: x['value'].mean()>4)  # 筛选组内平均值>4的组
#   group  value
# 3     B      3
# 4     B      7
# 5     B      1
# 6     B      9
df.groupby('group').filter(lambda x: len(x['value'])<4)  # 筛选组内成员<4的组
#   group  value
# 0     A      1
# 1     A      3
# 2     A      4

使用`gb.apply()` 和`gb.pipe()`进行复杂处理。`gb.apply()` 将df逐组传递给函数，当参数`include_groups=False` (在以后会变为默认)时，将除分类变量外的其他df传递给函数。`gb.pipe()` 将整个GroupBy对象传递给函数。

# pipe()
df = pd.DataFrame({
    "Store": np.random.choice(["Store_1", "Store_2"], 15),  # 商店种类
    "Product": np.random.choice(["Product_1", "Product_2"], 15),  # 商品种类
    "Revenue": (np.random.random(15) * 50 + 10).round(2),  # 商品总值
    "Quantity": np.random.randint(1, 10, size=15),  # 商品数量
}) 
#       Store    Product  Revenue  Quantity
# 0   Store_1  Product_2    28.77         6
# 1   Store_2  Product_1    45.77         8
# 2   Store_2  Product_2    42.58         4
# 3   Store_2  Product_2    12.98         7
# 4   Store_1  Product_2    52.98         9
# 5   Store_1  Product_1    24.35         4
# 6   Store_2  Product_1    14.76         9
# 7   Store_2  Product_2    37.62         4
# 8   Store_1  Product_1    48.43         1
# 9   Store_1  Product_2    35.89         6
# 10  Store_2  Product_2    36.87         8
# 11  Store_2  Product_1    35.04         1
# 12  Store_1  Product_1    12.79         6
# 13  Store_1  Product_2    26.34         8
# 14  Store_1  Product_1    18.25         8 
(
df.groupby(['Store', 'Product'])
    .pipe(lambda df: pd.Series(df.Revenue.sum() / df.Quantity.sum(), name='Price'))
    .round(2)
    .reset_index()
    .pivot(columns='Product', index='Store', values='Price')
) 
# Product  Product_1  Product_2
# Store                        
# Store_1       5.46       4.96
# Store_2       5.31       5.65 

# apply()
def t(df):
    return pd.DataFrame({
        'Rev_original': df.Revenue,
        'Rev_demeaned': df.Revenue - df.Revenue.mean()
    })

(
    df.groupby(['Store', 'Product'])
    .apply(lambda df: t(df), include_groups=False)
    .round(2)
    .reset_index()
    .set_index('level_2')  # 原来的index被reset_index后，列名为'level_2'，让它再变回index
) 
#            Store    Product  Rev_original  Rev_demeaned
# level_2                                                
# 5        Store_1  Product_1         24.35         -1.60
# 8        Store_1  Product_1         48.43         22.48
# 12       Store_1  Product_1         12.79        -13.16
# 14       Store_1  Product_1         18.25         -7.70
# 0        Store_1  Product_2         28.77         -7.22
# 4        Store_1  Product_2         52.98         16.98
# 9        Store_1  Product_2         35.89         -0.10
# 13       Store_1  Product_2         26.34         -9.65
# 1        Store_2  Product_1         45.77         13.91
# 6        Store_2  Product_1         14.76        -17.10
# 11       Store_2  Product_1         35.04          3.18
# 2        Store_2  Product_2         42.58         10.07
# 3        Store_2  Product_2         12.98        -19.53
# 7        Store_2  Product_2         37.62          5.11
# 10       Store_2  Product_2         36.87          4.36 
df['Rev_demeaned'] = (df.Revenue - df.groupby(['Store', 'Product']).Revenue.transform('mean')).round(2)  # 比apply更高效，优先使用apply以外的方法

`gb.head()/tail()/nth()`

df = pd.DataFrame({
    'group':list('AAABBBB'),
    'value': np.random.randint(10,size=7)
})
#   group  value
# 0     A      3
# 1     A      7
# 2     A      8
# 3     B      5
# 4     B      9
# 5     B      0
# 6     B      5
df.sort_values('value', ascending=False).groupby('group').head(2).sort_index()  # 选择每组中两个最大的value
# .nth(n) 选择每组中第n行(支持列表)，.tail(n) 选择每组末尾n行 
#   group  value
# 1     A      7
# 2     A      8
# 4     B      9
# 6     B      5

统一行列标签：重新索引`reindex()` 、对齐`align()` 、`isin()`

`reindex()` 需要用户指定一个用于对齐的参考，参考不会被改变

reindex() 对现有数据重新排序以匹配给定的标签，数据若不存在某些标签，则插入NA标记。若对象含有重复标签，会报错。

对于dataframe可以使用index和columns参数来分别指定行列标签。此外，也可以使用axis="index"/0或"columns"/1达到相同效果。

s = pd.Series(np.random.randn(3), index=["a", "b", "c"])
# a   -0.385797
# b    1.253458
# c    0.064259 
s.reindex(["c", "b", "a", "d"])  # 数据若不存在某些标签，则插入NA标记
# c    0.064259
# b    1.253458
# a   -0.385797
# d         NaN 

df = pd.DataFrame(np.arange(12).reshape(3,4),index=['r1','r2','r3'],columns=list('ABCD'))
#     A  B   C   D
# r1  0  1   2   3
# r2  4  5   6   7
# r3  8  9  10  11 
df.reindex(index=['r2','r3','r4'], columns=['A','C','E'])  # 使用index和columns参数来分别指定行列标签
#       A     C   E
# r2  4.0   6.0 NaN
# r3  8.0  10.0 NaN
# r4  NaN   NaN NaN 
df.reindex(['r2','r3','r4'], axis="index")  # 使用axis="index"或"columns"达到相同效果
#       A    B     C     D
# r2  4.0  5.0   6.0   7.0
# r3  8.0  9.0  10.0  11.0
# r4  NaN  NaN   NaN   NaN

`align()` 不给定参考，使两个对象彼此对齐，而不拼接或合并

语法类似于Series.align(Series2)或Df.align(Df2) ，返回一个含有对齐后Series或DataFrame的元组。

可以使用join参数'outer'(默认值)、'left'、'right'、'inner'，分别进行全连接(并集)、左连接、右连接和内连接(交集)。

对于形如s1.align(s2) 的代码，返回元组中的第0索引是s1，第1索引是s2。返回的s1和s2的标签经过了排序，可能与原本的标签顺序不同。

s1 = pd.Series(np.random.randn(4), index=list('ABCD'))
# A    0.234767
# B    0.198953
# C    1.276027
# D   -0.240349
s2 = pd.Series(np.random.randn(4), index=list('ACBE'))
# A    0.063577
# C   -0.774306
# B    0.254649
# E    1.500261 
s1.align(s2)  # 默认全连接
# (A    0.234767      A    0.063577
#  B    0.198953      B    0.254649
#  C    1.276027      C   -0.774306
#  D   -0.240349      D         NaN
#  E         NaN      E    1.500261
#  dtype: float64,    dtype: float64)
s1.align(s2, join='inner')  # 求交集 
# (A    0.234767      A    0.063577
#  B    0.198953      B    0.254649
#  C    1.276027      C   -0.774306
#  dtype: float64,    dtype: float64) 
s1.align(s2, join='left')  # 左连接，即，将左边对象的标签作为参考
# (A    0.234767      A    0.063577
#  B    0.198953      B    0.254649
#  C    1.276027      C   -0.774306
#  D   -0.240349      D         NaN
#  dtype: float64,    dtype: float64)

对于DataFrame，默认情况下join参数将应用于行标签与列标签，可以使用axis参数来使其仅在指定轴上对齐，axis=0时，仅对行标签对齐，axis=1时，仅对列标签对齐。

df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['r1','r2','r3'], columns=list('AB'))
#     A  B
# r1  0  1
# r2  2  3
# r3  4  5
df2 = pd.DataFrame(np.arange(10,18).reshape(2,4), index=['r1','r2'], columns=list('ABCD'))
#      A   B   C   D
# r1  10  11  12  13
# r2  14  15  16  17
df1.align(df2)
# (    A  B   C   D            A     B     C     D
#  r1  0  1 NaN NaN     r1  10.0  11.0  12.0  13.0
#  r2  2  3 NaN NaN     r2  14.0  15.0  16.0  17.0
#  r3  4  5 NaN NaN,    r3   NaN   NaN   NaN   NaN)
df1.align(df2, join='right')  # 右连接，即，将右边对象的标签作为参考
# (    A  B   C   D          A   B   C   D
#  r1  0  1 NaN NaN     r1  10  11  12  13
#  r2  2  3 NaN NaN,    r2  14  15  16  17)
df1.align(df2, join='left', axis=0)  # 仅对行标签对齐
# (    A  B            A     B     C     D
#  r1  0  1     r1  10.0  11.0  12.0  13.0
#  r2  2  3     r2  14.0  15.0  16.0  17.0
#  r3  4  5,    r3   NaN   NaN   NaN   NaN)
df1.align(df2, join='left', axis=1)  # 仅对列标签对齐
# (    A  B         A   B
#  r1  0  1    r1  10  11
#  r2  2  3    r2  14  15)
#  r3  4  5,

`isin()` 这个方法不是专门为标签设计的，需要给定一个参考，它逐个检查目标标签是否包含在参考中，返回布尔数组。

s = pd.Series(np.arange(3,6), index=list('abc'))
s.index.isin(list('adb'))
# array([ True,  True, False])
df = pd.DataFrame(np.arange(12).reshape(4, 3),
    index=list('abcd'),
    columns=list('ABC'))
df.columns.isin(list('ACD')) 
# array([ True, False,  True])

.isin(values) 遍历对象的每个元素，检查元素是否包含在给定的values中，返回一个布尔Series或DataFrame。

s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
# 4    0
# 3    1
# 2    2
# 1    3
# 0    4 
s.isin([2, 4, 6])
# 4    False
# 3    False
# 2     True
# 1    False
# 0     True 
s[s.isin([2, 4, 6])]  # 筛选在values中的元素
# 2    2
# 0    4 
s[s.index.isin([2, 4, 6])]  # 也可用于标签
# 4    0
# 2    2 
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
				   'ids2': ['a', 'n', 'c', 'n']})
#    vals ids ids2
# 0     1   a    a
# 1     2   b    n
# 2     3   f    c
# 3     4   n    n 
df.isin(['a','b',1,3])  # 对df中的每个元素进行判断
#     vals    ids   ids2
# 0   True   True   True
# 1  False   True  False
# 2   True  False  False
# 3  False  False  False
df.isin({'ids':['a','b'], 'vals': [1,3]})  # 对df中的特定列进行特殊判断，如ids列元素是否是'a'或'b'
#     vals    ids   ids2
# 0   True   True  False
# 1  False   True  False
# 2   True  False  False
# 3  False  False  False

删除DataFrame中的重复行：`.duplicated()` 返回布尔Series指示是否重复，`drop_duplicates()` 返回删除后df

指定keep 参数，以设定要保留哪个重复行，'first'(默认)，保留第一次出现的重复行；'last'，保留最后一次出现的重复行；False，不保留重复行。

df = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                   'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                   'c': np.random.randn(7)})
#        a  b         c
# 0    one  x -0.473516
# 1    one  y -1.933758
# 2    two  x -1.495605
# 3    two  y -0.004867
# 4    two  x -1.429344
# 5  three  x  0.996961
# 6   four  x -0.213066 
df.duplicated('a')  # 返回'a'列的重复情况
# 0    False
# 1     True
# 2    False
# 3     True
# 4     True
# 5    False
# 6    False 
df.duplicated('a', keep='last')  # 保留最后一次出现的重复行(即 False)
# 0     True
# 1    False
# 2     True
# 3     True
# 4    False
# 5    False
# 6    False 
df.duplicated(['a','b'])  # 输入一个列标签数组，以将多列组合起来判断是否重复(若不指定则使用全部列)
# 0    False
# 1    False
# 2    False
# 3    False
# 4     True
# 5    False
# 6    False 
df[~df.duplicated(['a','b'])]
df.drop_duplicates(['a','b'])  # 这两个代码相同，返回删除后的df
#        a  b         c
# 0    one  x -0.536366
# 1    one  y  0.626369
# 2    two  x  0.285702
# 3    two  y -0.053404
# 5  three  x -0.741357
# 6   four  x -0.616092

pandas访问器(Accessors)

访问器提供了多个方法集合，来简化对特定类型的操作。

使用pd.Series._accessors 来列出Series支持的访问器：dt、cat、sparse、str ，分别用于时间戳数据类型、分类数据类型、稀疏矩阵类型、字符串类型。不支持的类型需要转化为支持的类型，否则会报错。

dt：时间戳数据类型(不讲)

s = pd.Series(pd.to_datetime(['2025-01-01', '2025-07-15']))  # 需要将字符串转为时间戳
# 0   2025-01-01
# 1   2025-07-15
# dtype: datetime64[ns]
s.dt.day
# 0     1
# 1    15
# dtype: int32 
s.dt.is_month_start  # 是否是一个月的开始
# 0     True
# 1    False
# dtype: bool

cat ：分类数据类型(不讲)

s = pd.Series(['low', 'medium', 'high', 'medium'], dtype='category')
# 0       low
# 1    medium
# 2      high
# 3    medium
# dtype: category
# Categories (3, object): ['high', 'low', 'medium'] 
s.cat.categories  # 获取类别
# Index(['high', 'low', 'medium'], dtype='object') 
s.cat.ordered  # 返回类别是否有序
# False
s.cat.codes  # 返回Series元素对应level的整数索引
# 0    1
# 1    2
# 2    0
# 3    2 
s.cat.reorder_categories(['low', 'medium', 'high'], ordered=True)  # 将指定类别level顺序，并将该Series设置为已排序
# 0       low
# 1    medium
# 2      high
# 3    medium
# dtype: category
# Categories (3, object): ['low' < 'medium' < 'high']

str ：字符串数据类型(详见下文)

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan])
# 0       A
# 1       B
# 2       C
# 3    Aaba
# 4    Baca
# 5     NaN
# dtype: object
s.str.upper()
# 0       A
# 1       B
# 2       C
# 3    AABA
# 4    BACA
# 5     NaN
# dtype: object
s.str.len()
# 0    1.0
# 1    1.0
# 2    1.0
# 3    4.0
# 4    4.0
# 5    NaN
# dtype: float64

使用pd.DataFrame._accessors 来列出DataFrame支持的访问器：sparse

使用`Series/Index.str` 访问器进行文本数据处理

str提供了对字符串(object或string)类型的一套方法，这些方法对Series的每个元素进行操作并返回Series。

正则表达式相关见 Python re

常用方法

lower() 转小写；upper() 转大写；len() 字符数；strip() 去除字符串头尾的空格，lstrip()与rstrip()分别去除头部和尾部的空格。

s = pd.Series(["A", "Aaba", "Baca", np.nan, "dog", "cat"])
# 0       A
# 1    Aaba
# 2    Baca
# 3     NaN
# 4     dog
# 5     cat
# dtype: object
s.str.lower()  # 转小写
# 0       a
# 1    aaba
# 2    baca
# 3     NaN
# 4     dog
# 5     cat 
s.str.len()  # 字符数目
# 0    1.0
# 1    4.0
# 2    4.0
# 3    NaN
# 4    3.0
# 5    3.0

字符串分割与提取子字符串 `split() rsplit()`

split()从字符串头部开始分割，rsplit()从字符串末尾开始分割。

参数：n=-1 限制分割次数，默认为全部拆分；expand=False 是否将拆分的字符串列表展开为单独的列；regex=None 是否使用正则表达式，默认将输入单个字符看作普通字符，若输入多个字符则看作正则。

split()会将字符串拆分为列表(一个元素一个列表)，可以使用.str.get(n) 或.str[n] 来选取该列表中的元素。另外，直接对Series使用.str[n] 会将字符串本身视为列表，而选取其中的字符。

s = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"])
# 0    a_b_c
# 1    c_d_e
# 2      NaN
# 3    f_g_h
# dtype: object
s.str.split('_')  # 按'_'进行分割
# 0    [a, b, c]
# 1    [c, d, e]
# 2          NaN
# 3    [f, g, h] 
s.str.split('_').str.get(0)  # 选取0号元素
s.str.split('_').str[0]  # 与上等价
# 0      a
# 1      c
# 2    NaN
# 3      f 
s.str.split('_', expand=True)  # 将拆分的字符串列表展开
#      0    1    2
# 0    a    b    c
# 1    c    d    e
# 2  NaN  NaN  NaN
# 3    f    g    h 
s.str.split('_', expand=True, n=1)  # 限制为1次分割
#      0    1
# 0    a  b_c
# 1    c  d_e
# 2  NaN  NaN
# 3    f  g_h 
s.str.rsplit('_', expand=True, n=1)  # 从右侧分割
#      0    1
# 0  a_b    c
# 1  c_d    e
# 2  NaN  NaN
# 3  f_g    h

字符串替换 `replace(pat, repl)`

不要与Series.replace() 搞混了。

参数：repl 指定替换字符串或一个函数(传递re.Match对象，期望返回一个字符串)；n=-1 指定替换次数；case 指定是否区分大小写；flags 正则表达式flag；regex=False 是否使用正则表达式。

s = pd.Series(["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"])
# 0       A
# 1       B
# 2       C
# 3    Aaba
# 4    Baca
# 5        
# 6     NaN
# 7    CABA
# 8     dog
# 9     cat
s.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)  # 将开头为 "任意字符+a" 或者 "dog" 替换为"XX-XX "，不区分大小写
# 0           A
# 1           B
# 2           C
# 3    XX-XX ba
# 4    XX-XX ca
# 5            
# 6         NaN
# 7    XX-XX BA
# 8      XX-XX 
# 9     XX-XX t 
def t(x:re.Match):
    return x.group(0) + '#'
s.str.replace('^.a', t, regex=True)  # 传递一个函数
# 0        A
# 1        B
# 2        C
# 3    Aa#ba
# 4    Ba#ca
# 5         
# 6      NaN
# 7     CABA
# 8      dog
# 9     ca#t

字符串合并 `cat()`

参数：other=None 指定要拼接的其他对象，支持Series和DataFrame，及由Series(及其他一维对象)构成的列表；sep=None 指定分隔符，默认没有分隔符；na_rep=None 指定NA对应的字符，默认忽略缺失值；join='left' 指定进行全连接、左右连接、内连接等，默认将进行自动数据对齐后再拼接。

当other为列表、numpy数组等没有index的对象，则一维对象的长度或二维对象的行数需与s一致。

关于数字有一个细微之处需要注意，拼接时，如果有列纯数字，需要将那列.astype('str') 而不是.astype('O') ，cat方法调用了numpy.sum函数，object对象在转换为numpy时似乎有点问题，数字不会加上“引号”，而string对象(是pandas的新型对象，考虑可能造成混乱，前文一直避免提及)则没有问题。

s = pd.Series(["a", "b", np.nan, "d"])
# 0      a
# 1      b
# 2    NaN
# 3      d
s.str.cat(sep=',', na_rep='-')  # 若未指定other，则将s拼接成一个字符串
# 'a,b,-,d'
s.str.cat(["A", "B", "C", "D"])  # 与Series拼接，含有NA值的位置无法拼接
# 0     aA
# 1     bB
# 2    NaN
# 3     dD 
s.str.cat(["A", "B", "C", "D"], na_rep='-')  # 指定na_rep参数，NA值位置也能正常拼接
# 0    aA
# 1    bB
# 2    -C
# 3    dD 
s2 = pd.Series(list('ABCD'), index=[2,5,0,1])
s.str.cat(s2, na_rep='-')  # 进行了自动数据对齐
# 0    aC
# 1    bD
# 2    -A
# 3    d-
s.str.cat(pd.DataFrame(np.array(list('ABCDEFGH')).reshape(4,2)), na_rep='-')  # 与DataFrame连接
# 0    aAB
# 1    bCD
# 2    -EF
# 3    dGH 
s.str.cat(pd.DataFrame(np.arange(8).reshape(4,2)).astype('str'), na_rep='-')  # 与数字df连接，要astype为str
s.str.cat(np.arange(8).reshape(4,2).astype('str'), na_rep='-')  # 同上
# 0    a01
# 1    b23
# 2    -45
# 3    d67 
s2 = pd.Series(list('ABCDE'), index=[2,5,0,1,6])
s.str.cat(s2, na_rep='-', join='outer')  # 外连接
# 0    aC
# 1    bD
# 2    -A
# 3    d-
# 5    -B
# 6    -E 
s3 = pd.Series(list('EFGH'), index=[2,5,0,1])
s.str.cat([s2, s3], na_rep='-')  # other可以是Series(及其他一维对象)列表
# 0    aCG
# 1    bDH
# 2    -AE
# 3    d--

判断是否匹配 `match()` `fullmatch()` `contains()`

fullmatch测试整个字符串是否与正则表达式匹配，match测试字符串开头是否与正则匹配，contains测试字符串任意位置是否与正则匹配。可以指定na参数，以指定NA值被当作True还是False。

s = pd.Series(["1", "2", "3a", "3b", "03c", "4dx"])
# 0      1
# 1      2
# 2     3a
# 3     3b
# 4    03c
# 5    4dx
s.str.fullmatch(r'[0-9][a-z]')  # 需要完全匹配
# 0    False
# 1    False
# 2     True
# 3     True
# 4    False
# 5    False
s.str.match(r'[0-9][a-z]')  # 需要开头匹配
# 0    False
# 1    False
# 2     True
# 3     True
# 4    False
# 5     True 
s.str.contains(r'[0-9][a-z]')  # 任意位置匹配
# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# 5     True

提取匹配项 `extract(pat)`

将正则表达式pat中的捕获组变为DataFrame的列，默认将始终返回一个DataFrame。参数：flags 正则表达式flag。

s = pd.Series(['20241112-aaa1', '20241113-aaa1', '20241112-bbb1'])
# 0    20241112-aaa1
# 1    20241113-aaa1
# 2    20241112-bbb1
s.str.extract(r'.{6}(.{2})-(\D+)') 
#     0    1
# 0  12  aaa
# 1  13  aaa
# 2  12  bbb 
s.str.extract(r'.{6}(?P<day>.{2})-(?P<title>\D+)')  # 使用命名组
#    day title
# 0   12   aaa
# 1   13   aaa
# 2   12   bbb
s = pd.Series(['20241112-aaa1', '20241113-aaa1', '20241112-bbb1', '20241113-1'])
# 0    20241112-aaa1
# 1    20241113-aaa1
# 2    20241112-bbb1
# 3       20241113-1 
s.str.extract(r'.{6}(?P<day>.{2})-(?P<title>\D+)')  # 当有的匹配组匹配失败时，该行都会变成NA
#    day title
# 0   12   aaa
# 1   13   aaa
# 2   12   bbb
# 3  NaN   NaN 
s.str.extract(r'.{6}(?P<day>.{2})-(?P<title>\D+)?')  # 在匹配组后面加上'?'，以表示该组可不匹配
#   day title
# 0  12   aaa
# 1  13   aaa
# 2  12   bbb
# 3  13   NaN

排序

按标签排序 `.sort_index()`

参数：axis='index' 对index轴进行排序；ascending=True 正序排序；[key] 将标签处理后排序，输入Index对象并期望返回一个Index对象(然后sort_index将对返回的Index对象进行排序)。

df = pd.DataFrame(np.arange(12).reshape(4, 3),index=list('acbd'),columns=list('ACB'))
#    A   C   B
# a  0   1   2
# c  3   4   5
# b  6   7   8
# d  9  10  11
df.sort_index()  # 默认对index排序
#    A   C   B
# a  0   1   2
# b  6   7   8
# c  3   4   5
# d  9  10  11
df.sort_index(axis='columns')  # 对列标签排序
#    A   B   C
# a  0   2   1
# c  3   5   4
# b  6   8   7
# d  9  11  10 
df.sort_index(ascending=False)  # 倒序排序 
#    A   C   B
# d  9  10  11
# c  3   4   5
# b  6   7   8
# a  0   1   2 
df.sort_index(key=lambda index: index.to_series().replace({'b':'d', 'd':'b'}))  # 排序中行标签的b与d被交换 
#    A   C   B
# a  0   1   2
# d  9  10  11
# c  3   4   5
# b  6   7   8

按值排序 `.sort_values()`

参数：by 指定被排序列或行的标签；axis='index' 对列值进行排序；ascending=True 正序排序；[key] 将标签处理后排序，输入Series对象并期望返回一个Series对象。

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
#   col1  col2  col3 col4
# 0    A     2     0    a
# 1    A     1     1    B
# 2    B     9     9    c
# 3  NaN     8     4    D
# 4    D     7     2    e
# 5    C     4     3    F 
df.sort_values(by=['col1'])
#   col1  col2  col3 col4
# 0    A     2     0    a
# 1    A     1     1    B
# 2    B     9     9    c
# 5    C     4     3    F
# 4    D     7     2    e
# 3  NaN     8     4    D 
df.sort_values(by=['col1', 'col2'])  # 对多列排序
#   col1  col2  col3 col4
# 1    A     1     1    B
# 0    A     2     0    a
# 2    B     9     9    c
# 5    C     4     3    F
# 4    D     7     2    e
# 3  NaN     8     4    D 
df.sort_values(by='col4', key=lambda col: col.str.lower())  # 使用key
#   col1  col2  col3 col4
# 0    A     2     0    a
# 1    A     1     1    B
# 2    B     9     9    c
# 3  NaN     8     4    D
# 4    D     7     2    e
# 5    C     4     3    F

同时按标签和值排序 `.sort_values()`

对by参数传递Index对象名称(name属性)，以按标签排序。

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col3': [0, 1, 9, 4, 2, 3],
    'col2': [2, 1, 9, 8, 7, 4],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
df.index.name = 'i_name'
#        col1  col3  col2 col4
# i_name                      
# 0         A     0     2    a
# 1         A     1     1    B
# 2         B     9     9    c
# 3       NaN     4     8    D
# 4         D     2     7    e
# 5         C     3     4    F
df.sort_values(by=['col1','i_name'])  # 同时按列值和index行标签排序
#        col1  col3  col2 col4
# i_name                      
# 0         A     0     2    a
# 1         A     1     1    B
# 2         B     9     9    c
# 5         C     3     4    F
# 4         D     2     7    e
# 3       NaN     4     8    D

取最大/小的几个值(行) `.nsmallest()` ，`.nlargeset()`

s = pd.Series(np.random.randint(0,10,6))
# 0    6
# 1    5
# 2    8
# 3    6
# 4    9
# 5    0
s.nlargest(3)  # 取前三个最大值
# 4    9
# 2    8
# 0    6 

df = pd.DataFrame({
    "a": [-2, -1, 1, 10, 8, 11, -1],
    "b": list("abdceff"),
    "c": [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0],
}) 
#     a  b    c
# 0  -2  a  1.0
# 1  -1  b  2.0
# 2   1  d  4.0
# 3  10  c  3.2
# 4   8  e  NaN
# 5  11  f  3.0
# 6  -1  f  4.0 
df.nsmallest(3, ['a'])  # 需指定列名
#    a  b    c
# 0 -2  a  1.0
# 1 -1  b  2.0
# 6 -1  f  4.0

DataFrame的拼接

`pd.concat()`输入一个包含要拼接对象的列表(可迭代对象)，沿指定轴连接，但本身只支持全连接与内连接

参数：axis=0/'index'，指定堆叠哪个轴，默认堆叠index轴；join='outer'，指定如何处理另一个轴中的不匹配标签，默认输出并集；ignore_index=False，指定是否将堆叠轴重置为数字标签。

若要实现左右连接，需要进行reindex。

df1 = pd.DataFrame(np.arange(9).reshape(3,3),index=list('ABC'),columns=list('abc'))
#    a  b  c
# A  0  1  2
# B  3  4  5
# C  6  7  8
df2 = pd.DataFrame(np.arange(9,15).reshape(2,3),index=['D','E'],columns=list('abd'))
#     a   b   d
# D   9  10  11
# E  12  13  14
pd.concat([df1, df2])
#     a   b    c     d
# A   0   1  2.0   NaN
# B   3   4  5.0   NaN
# C   6   7  8.0   NaN
# D   9  10  NaN  11.0
# E  12  13  NaN  14.0
pd.concat([df1, df2], axis='columns')  # 堆叠columns轴
#      a    b    c     a     b     d
# A  0.0  1.0  2.0   NaN   NaN   NaN
# B  3.0  4.0  5.0   NaN   NaN   NaN
# C  6.0  7.0  8.0   NaN   NaN   NaN
# D  NaN  NaN  NaN   9.0  10.0  11.0
# E  NaN  NaN  NaN  12.0  13.0  14.0 
pd.concat([df1, df2], join='inner')  # 只保留匹配的标签
#     a   b
# A   0   1
# B   3   4
# C   6   7
# D   9  10
# E  12  13

Series与DataFrame拼接时，Series会被当做n行1列的DataFrame，因此，要想将Series作为1行n列的数据进行行拼接，需把Series转换为DataFrame，并转置。

df1 = pd.DataFrame(np.arange(9).reshape(3,3))
#    0  1  2
# 0  0  1  2
# 1  3  4  5
# 2  6  7  8
### Series作为n行1列的数据被处理 
pd.concat([df1, pd.Series(np.arange(3))], axis=0)  # Series未指定name，name(列名)默认为0
#    0    1    2
# 0  0  1.0  2.0
# 1  3  4.0  5.0
# 2  6  7.0  8.0
# 0  0  NaN  NaN
# 1  1  NaN  NaN
# 2  2  NaN  NaN 
pd.concat([df1, pd.Series(np.arange(3))], axis='columns')  # 堆叠columns轴
#    0  1  2  0
# 0  0  1  2  0
# 1  3  4  5  1
# 2  6  7  8  2 
### 将Series转为1行n列的数据
pd.concat([df1, pd.Series(np.arange(3)).to_frame().T], axis=0)
#    0  1  2
# 0  0  1  2
# 1  3  4  5
# 2  6  7  8
# 0  0  1  2
pd.concat([df1, pd.Series(np.arange(3)).to_frame().T], axis=1) 
#    0  1  2    0    1    2
# 0  0  1  2  0.0  1.0  2.0
# 1  3  4  5  NaN  NaN  NaN
# 2  6  7  8  NaN  NaN  NaN

`pd.merge()` 输入两个要拼接的对象，只能堆叠列，可以进行全连接、左右连接、全连接及笛卡尔积

参数：how='inner' 指定连接方式(默认内连接)，left左连接，right右连接，outer 全连接，cross 笛卡儿积；on, left_on, right_on 指定连接的键的列名，默认使用两个df中的列交集作为键，on用于指定两个df中都存在的列，left_on、right_on分别指定两个df的连接列，此外，可以将多个列作为键；left_index=False, right_index=False 指定是否将df的index作为连接键；suffixes=('_x', '_y') 指定两个df中具有相同标签的列的重命名后缀，将其设置为None以不添加后缀。

df1 = pd.DataFrame(np.arange(9).reshape(3,3),
    index=pd.Index(list('ABC'), name='rowname'),
    columns=list('abc'))
#          a  b  c
# rowname         
# A        0  1  2
# B        3  4  5
# C        6  7  8
df2 = pd.DataFrame(np.arange(9,18).reshape(3,3),
    index=pd.Index(list('ACE'), name='rowname'),
    columns=list('abd'))
#           a   b   d
# rowname            
# A         9  10  11
# C        12  13  14
# E        15  16  17
pd.merge(df1, df2, on='rowname')  # 可以将index命名，或将其变为列以拥有名字。默认内连接
pd.merge(df1, df2, left_index=True, right_index=True)  # 与上面等价，设置left/right_index参数以将df的index作为连接键
#          a_x  b_x  c  a_y  b_y   d    # 两个df中具有相同标签的列被重命名
# rowname                           
# A          0    1  2    9   10  11
# C          6    7  8   12   13  14 
pd.merge(df1, df2, on='rowname', how='left', suffixes=['_df1', '_df2'])  # 左连接并指定重命名后缀
#          a_df1  b_df1  c  a_df2  b_df2     d
# rowname                                     
# A            0      1  2    9.0   10.0  11.0
# B            3      4  5    NaN    NaN   NaN
# C            6      7  8   12.0   13.0  14.0
df1 = df1.reset_index(names='r1')  # 将index变成标签为'r1'的列
df2 = df2.reset_index(names='r2')
pd.merge(df1, df2, left_on='r1', right_on='r2',  suffixes=['_df1', '_df2'])  # 分别指定两个df的连接列
#   r1  a_df1  b_df1  c r2  a_df2  b_df2   d
# 0  A      0      1  2  A      9     10  11
# 1  C      6      7  8  C     12     13  14

`df.conbine_first()` 将DataFrame中的缺失值更新为另一DataFrame相应位置的值。

df1 = pd.DataFrame(
    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
)
#      0    1    2
# 0  NaN  3.0  5.0
# 1 -4.6  NaN  NaN
# 2  NaN  7.0  NaN
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
#       0    1    2
# 1 -42.6  NaN -8.2
# 2  -5.0  1.6  4.0
df1.combine_first(df2)
#      0    1    2
# 0  NaN  3.0  5.0
# 1 -4.6  NaN -8.2
# 2 -5.0  7.0  4.0

DataFrame的长宽数据间转换

长数据与宽数据的判断

长宽之分是相对于某个(些)变量(列)来说的。要判断对当前变量(列)来说是长数据或宽数据：当要分析的单个变量包括多个列时，是宽数据；反之为长数据。

进一步地，可以将"分析的单个变量包括多列"认为是，“该变量的其中一个分类变量是列名”；反过来，长数据转宽数据时，可以认为是将分析变量的某个分类变量去重后变成列名。

# 创建一个学生成绩表：
Student  Math  English  Science
0       A    85       92       88
1       B    90       85       91
2       C    78       88       84
3       D    88       79       86
# 对于这一个df，当我们只需要数学成绩(变量)时，要分析的单个变量只有1列，因此它是长数据，可以直接用于分析
# 但当我们需要学生的全部成绩(变量为全部成绩)时，分析的变量包括了3列，因此它是宽数据，需要转换为长数据以供分析，如下所示：
Student   Course  Score
0        A     Math     85
1        B     Math     90
2        C     Math     78
3        D     Math     88
4        A  English     92
5        B  English     85
6        C  English     88
7        D  English     79
8        A  Science     88
9        B  Science     91
10       C  Science     84
11       D  Science     86

透视(长数据转宽数据) `pivot()` `pivot_table()`

pivot() 仅依据三个参数指定的列，重新构建dataframe。

参数：columns 指定新df的列名；index 指定新df的index；value 指定新df中的值。columns和index参数相当于分类变量，pivot()不允许出现相同的分类变量，因为这会导致一个位置出现两个值，要处理该情况，需使用pivot_table() 函数。

df = pd.DataFrame({
    "value": range(12),
    "variable": ["A"] * 3 + ["B"] * 3 + ["C"] * 3 + ["D"] * 3,
    "date": pd.to_datetime(["2020-01-03", "2020-01-04", "2020-01-05"] * 4)
})
#     value variable       date
# 0       0        A 2020-01-03
# 1       1        A 2020-01-04
# 2       2        A 2020-01-05
# 3       3        B 2020-01-03
# 4       4        B 2020-01-04
# 5       5        B 2020-01-05
# 6       6        C 2020-01-03
# 7       7        C 2020-01-04
# 8       8        C 2020-01-05
# 9       9        D 2020-01-03
# 10     10        D 2020-01-04
# 11     11        D 2020-01-05
df.pivot(columns='variable', index='date', values='value')  # 指定variable为新df的列名，date为新df的index，value为新df的值 
# variable    A  B  C   D
# date                   
# 2020-01-03  0  3  6   9
# 2020-01-04  1  4  7  10
# 2020-01-05  2  5  8  11

pivot_table() 与pivot()相似，但支持指定包含重复的分类变量。

参数：columns 指定新df的列名；index 指定新df的index；value 指定新df中的值；aggfunc='mean' 指定如何处理重复分类变量的值；fill_value 指定对新df中NA的替换值。

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
#      A    B      C  D  E
# 0  foo  one  small  1  2
# 1  foo  one  large  2  4
# 2  foo  one  large  2  5
# 3  foo  two  small  3  5
# 4  foo  two  small  3  6
# 5  bar  one  large  4  6
# 6  bar  one  small  5  8
# 7  bar  two  small  6  9
# 8  bar  two  large  7  9
df.pivot_table(columns='C', index=['A','B'], values='D', aggfunc='sum')  # 使用相同参数的pivot会报错
# C        large  small
# A   B                
# bar one    4.0    5.0
#     two    7.0    6.0
# foo one    4.0    1.0   # <--- 这行第一个值被"聚合"了
#     two    NaN    6.0

逆透视(宽数据转长数据) `melt()`

melt() 依据id_vars 与value_vars 参数指定的列，重新构建dataframe。

参数：id_vars 指定原分类变量(不修改)；var_name 指定新分类变量列名；value_vars 指定要逆透视的列，若未指定，则使用除id_vars外的所有列；value_name 指定新值变量列名；ignore_index=True 是否保留原始索引。

df = pd.DataFrame({
'Student': ['A', 'B', 'C'],
'Math': [85, 90, 78],
'English': [92, 85, 88],
'Science': [88, 91, 84]
})
#   Student  Math  English  Science
# 0       A    85       92       88
# 1       B    90       85       91
# 2       C    78       88       84
df.melt(id_vars='Student')  # 设定Student为原分类变量，并逆透视其他所有列
#   Student variable  value
# 0       A     Math     85
# 1       B     Math     90
# 2       C     Math     78
# 3       A  English     92
# 4       B  English     85
# 5       C  English     88
# 6       A  Science     88
# 7       B  Science     91
# 8       C  Science     84 
df.melt(id_vars='Student', var_name='学科', value_name='分数')  # 设定新的分类变量名称为学科，设定新值变量列名为分数
#   Student     学科  分数
# 0       A     Math  85
# 1       B     Math  90
# 2       C     Math  78
# 3       A  English  92
# 4       B  English  85
# 5       C  English  88
# 6       A  Science  88
# 7       B  Science  91
# 8       C  Science  84

广播

Series/DataFrame与标量的计算

与标量计算时，标量会广播到与Series/DataFrame相同的形状。

pd.Series(["foo", "bar", "baz"]) == "foo"
# 0     True
# 1    False
# 2    False

Series-Series，DataFrame-DataFrame的计算

Series与Series计算时，它们的形状必须相等，DataFrame之间也是一样，在进行自动数据对齐后，一一对应。

pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])
# 0     True
# 1     True
# 2    False

DataFrame与Series的运算

稍微复杂的广播行为主要发生在DataFrame与Series运算之中，这里使用减法函数DataFrame.sub()来阐述。

DataFrame.sub(other, axis='columns') 函数通过axis参数来控制广播的维度(方向)，当axis=1或'columns'时，Series沿垂直方向广播(想象把Series贴在了列标签上)；当axis=0或'index'时，Series沿水平方向广播(想象把Series贴在了行标签上)。

df = pd.DataFrame(
{
"one": pd.Series([1, 2, 3], index=["a", "b", "c"]),
"two": pd.Series([2, 3, 4, 5], index=["a", "b", "c", "d"]),
"three": pd.Series([10, 11, 12], index=["b", "c", "d"]),
}
)
#    one  two  three
# a  1.0    2    NaN
# b  2.0    3   10.0
# c  3.0    4   11.0
# d  NaN    5   12.0
df.sub([1,2,3])  # axis='columns'，相当于a-[1,2,3]，b-[1,2,3] ...
#    one  two  three
# a  0.0    0    NaN
# b  1.0    1    7.0
# c  2.0    2    8.0
# d  NaN    3    9.0 
df.sub([1,2,3,4], axis='index')  # 相当于one-[1,2,3,4]，two-[1,2,3,4] ...
#    one  two  three
# a  0.0    1    NaN
# b  0.0    1    8.0
# c  0.0    1    8.0
# d  NaN    1    8.0

作者：qq_44048812

物联沃分享整理
物联沃-IOTWORD物联网 » 系统入门 Python Pandas

代码收藏家普通

分享到：

pandas数据结构基础

Series，类似于一维数组

Series的创建

DataFrame，类似于二维数组

DataFrame的创建

数据类型 .dtype

使用numpy函数进行运算

提取值或转numpy

Index：存储所有pandas对象的轴标签的基本对象

创建Index：

修改Index：

常用操作：

自动数据对齐

触发对齐机制：算术运算(+-*/)，使用.loc进行赋值，合并或连接pandas对象，reindex，align，where等。

不触发对齐机制：使用[] 进行选择，使用.iloc进行赋值，当操作对象没有标签时(.to_numpy()或.values)等。

磁盘数据导入和保存(仅csv)

导入csv：df.read_csv()

导出csv：df.to_csv()

索引、数据选择与元素_列的更改

使用[] 进行选择和就地修改

将Series的index或DataFrame的列作为属性访问

.loc[] 主要基于行列标签选择或修改数据

.iloc[] 主要基于行列整数位置选择或修改数据

同时基于标签和整数位置来索引行和列(结合.loc与.iloc)

上面都是取子集，.where(condition, other=nan) 接受一个布尔对象condition，不改变原始对象的形状，将对应着False的元素替换为other (默认NA)，逻辑与if-else相似。

.where() 只可以替换布尔对象为False的值，而np.where(condition, x, y) 可以将布尔对象为True的值赋为x，False的值赋为y。此外还有np.select() 可用于更复杂的判断。

DataFrame.query('') 用于DataFrame对象的行选择(筛选)，语法更加简单

行/列的插入和删除

列的插入

列的删除

.drop([...], axis=0) 从指定轴上删除某个标签

重命名列/行

基于确定的整数位置重命名：间接地用位置修改标签

基于标签的名称重命名：df.rename({原值:更改值, ...}或function, axis) 或 df.index/columns=df.index.to_series().replace({原值:更改值, ...})

df.set_index()，某列_to_index；df.reset_index()，index_to_某列，令参数drop=True以直接丢弃index。

将自定义函数作用于pandas对象：pipe，apply，agg，transform和map

pipe() 传递给函数整个DataFrame或Series。用于构建类似于R中dplyr的管道，pandas叫做"方法链接"

apply() 将DataFrame各列或各行作为Series依次传递给函数。

agg() 是aggregate() 的别名，与apply()类似，将各列或各行作为Series传递给函数，但支持同时传递给多个函数。

map 仅传递给函数一个值(逐值传递)，要求函数返回一个值。它还有一个特殊功能：按键值对修改值。

transform() 将DataFrame各列或各行作为Series依次传递给函数。但返回的Series与原来的形状(长度)要相同。和agg一样，支持传入多个函数，每次运行函数时，会新建一个子列或子行。

描述性统计与聚合

大多数描述性统计函数(方法)是聚合函数，它们使用axis="index"或0参数(压缩行，单列中的行被聚合)来进行聚合，skipna=True参数指示是否忽略NA。

使用describe()自动汇总统计信息。

value_counts()计算Series各元素出现的次数，对dataframe计算特定列的组合出现的次数。

df.info() 打印df的列的名称、非空值个数及数据类型

DataFrame的GroupBy分组聚合

分组

聚合

聚合函数与前面介绍的相似；也可以使用gb.describe()生成汇总；还可以使用gb.agg()，要更改agg输出的列名，可以使用rename重命名列，也可以在agg中直接指定。

使用gb.transform() 逐组转换数据，将组内各列作为Series传入函数，并期望返回具有相同长度的Series。

使用gb.filter()筛选组(不是筛选行)。不要与df/Series.filter 搞混，后者不对数据筛选，而筛选标签(如选中结尾为e的列标签)。 gb.filter() 将df传递给函数，并期望返回一个布尔标量值。

使用gb.apply() 和gb.pipe()进行复杂处理。gb.apply() 将df逐组传递给函数，当参数include_groups=False (在以后会变为默认)时，将除分类变量外的其他df传递给函数。gb.pipe() 将整个GroupBy对象传递给函数。

gb.head()/tail()/nth()

统一行列标签：重新索引reindex() 、对齐align() 、isin()

reindex() 需要用户指定一个用于对齐的参考，参考不会被改变

align() 不给定参考，使两个对象彼此对齐，而不拼接或合并

isin() 这个方法不是专门为标签设计的，需要给定一个参考，它逐个检查目标标签是否包含在参考中，返回布尔数组。

删除DataFrame中的重复行：.duplicated() 返回布尔Series指示是否重复，drop_duplicates() 返回删除后df

pandas访问器(Accessors)

使用Series/Index.str 访问器进行文本数据处理

常用方法

字符串分割与提取子字符串 split() rsplit()

字符串替换 replace(pat, repl)

字符串合并 cat()

判断是否匹配 match() fullmatch() contains()

提取匹配项 extract(pat)

排序

按标签排序 .sort_index()

按值排序 .sort_values()

同时按标签和值排序 .sort_values()

取最大/小的几个值(行) .nsmallest() ，.nlargeset()

DataFrame的拼接

pd.concat()输入一个包含要拼接对象的列表(可迭代对象)，沿指定轴连接，但本身只支持全连接与内连接

pd.merge() 输入两个要拼接的对象，只能堆叠列，可以进行全连接、左右连接、全连接及笛卡尔积

df.conbine_first() 将DataFrame中的缺失值更新为另一DataFrame相应位置的值。

DataFrame的长宽数据间转换

长数据与宽数据的判断

透视(长数据转宽数据) pivot() pivot_table()

逆透视(宽数据转长数据) melt()

数据类型 `.dtype`

触发对齐机制：算术运算(`+-*/`)，使用`.loc`进行赋值，合并或连接pandas对象，`reindex`，`align`，`where`等。

不触发对齐机制：使用`[]` 进行选择，使用`.iloc`进行赋值，当操作对象没有标签时(`.to_numpy()`或`.values`)等。

导入csv：`df.read_csv()`

导出csv：`df.to_csv()`

使用`[]` 进行选择和就地修改

`.loc[]` 主要基于行列标签选择或修改数据

`.iloc[]` 主要基于行列整数位置选择或修改数据

同时基于标签和整数位置来索引行和列(结合`.loc`与`.iloc`)

上面都是取子集，`.where(condition, other=nan)` 接受一个布尔对象`condition`，不改变原始对象的形状，将对应着False的元素替换为`other` (默认NA)，逻辑与`if-else`相似。

`.where()` 只可以替换布尔对象为False的值，而`np.where(condition, x, y)` 可以将布尔对象为True的值赋为x，False的值赋为y。此外还有`np.select()` 可用于更复杂的判断。

`DataFrame.query('')` 用于DataFrame对象的行选择(筛选)，语法更加简单

`.drop([...], axis=0)` 从指定轴上删除某个标签

基于标签的名称重命名：`df.rename({原值:更改值, ...}或function, axis)` 或 `df.index/columns=df.index.to_series().replace({原值:更改值, ...})`

`df.set_index()`，某列_to_index；`df.reset_index()`，index_to_某列，令参数`drop=True`以直接丢弃index。

将自定义函数作用于pandas对象：`pipe`，`apply`，`agg`，`transform`和`map`

`pipe()` 传递给函数整个DataFrame或Series。用于构建类似于R中dplyr的管道，pandas叫做"方法链接"

`apply()` 将DataFrame各列或各行作为Series依次传递给函数。

`agg()` 是`aggregate()` 的别名，与`apply()`类似，将各列或各行作为Series传递给函数，但支持同时传递给多个函数。

`map` 仅传递给函数一个值(逐值传递)，要求函数返回一个值。它还有一个特殊功能：按键值对修改值。

`transform()` 将DataFrame各列或各行作为Series依次传递给函数。但返回的Series与原来的形状(长度)要相同。和agg一样，支持传入多个函数，每次运行函数时，会新建一个子列或子行。

大多数描述性统计函数(方法)是聚合函数，它们使用`axis="index"或0`参数(压缩行，单列中的行被聚合)来进行聚合，`skipna=True`参数指示是否忽略NA。

使用`describe()`自动汇总统计信息。

`value_counts()`计算Series各元素出现的次数，对dataframe计算特定列的组合出现的次数。

`df.info()` 打印df的列的名称、非空值个数及数据类型

聚合函数与前面介绍的相似；也可以使用`gb.describe()`生成汇总；还可以使用`gb.agg()`，要更改agg输出的列名，可以使用rename重命名列，也可以在agg中直接指定。

使用`gb.transform()` 逐组转换数据，将组内各列作为Series传入函数，并期望返回具有相同长度的Series。

使用`gb.filter()`筛选组(不是筛选行)。不要与`df/Series.filter` 搞混，后者不对数据筛选，而筛选标签(如选中结尾为e的列标签)。 `gb.filter()` 将df传递给函数，并期望返回一个布尔标量值。

使用`gb.apply()` 和`gb.pipe()`进行复杂处理。`gb.apply()` 将df逐组传递给函数，当参数`include_groups=False` (在以后会变为默认)时，将除分类变量外的其他df传递给函数。`gb.pipe()` 将整个GroupBy对象传递给函数。

`gb.head()/tail()/nth()`

统一行列标签：重新索引`reindex()` 、对齐`align()` 、`isin()`

`reindex()` 需要用户指定一个用于对齐的参考，参考不会被改变

`align()` 不给定参考，使两个对象彼此对齐，而不拼接或合并

`isin()` 这个方法不是专门为标签设计的，需要给定一个参考，它逐个检查目标标签是否包含在参考中，返回布尔数组。

删除DataFrame中的重复行：`.duplicated()` 返回布尔Series指示是否重复，`drop_duplicates()` 返回删除后df

使用`Series/Index.str` 访问器进行文本数据处理

字符串分割与提取子字符串 `split() rsplit()`

字符串替换 `replace(pat, repl)`

字符串合并 `cat()`

判断是否匹配 `match()` `fullmatch()` `contains()`

提取匹配项 `extract(pat)`

按标签排序 `.sort_index()`

按值排序 `.sort_values()`

同时按标签和值排序 `.sort_values()`

取最大/小的几个值(行) `.nsmallest()` ，`.nlargeset()`

`pd.concat()`输入一个包含要拼接对象的列表(可迭代对象)，沿指定轴连接，但本身只支持全连接与内连接

`pd.merge()` 输入两个要拼接的对象，只能堆叠列，可以进行全连接、左右连接、全连接及笛卡尔积

`df.conbine_first()` 将DataFrame中的缺失值更新为另一DataFrame相应位置的值。

透视(长数据转宽数据) `pivot()` `pivot_table()`

逆透视(宽数据转长数据) `melt()`

代码收藏家普通

发表回复取消回复