⭐本专栏旨在对Python的基础语法进行详解,精炼地总结语法中的重点,详解难点,面向零基础及入门的学习者,通过专栏的学习可以熟练掌握python编程,同时为后续的数据分析,机器学习及深度学习的代码能力打下坚实的基础。
🔥本文已收录于Python基础系列专栏: Python基础系列教程 欢迎订阅,持续更新。


文章目录
引子
11.1 对象创建
11.1.1 Pandas Series对象
11.1.2 Pandas DataFrame对象
11.2 DataFrame性质
11.3 数值运算及统计分析
11.4 缺失值处理
11.5 合并数据
11.6 分组和数据透视表
11.7 其他
引子
Numpy 在向量化的数值计算中表现优异
但是在处理更灵活、复杂的数据任务:
如为数据添加标签、处理缺失值、分组和透视表等方面
Numpy显得力不从心
而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具
11.1 对象创建

11.1.1 Pandas Series对象
Series 是带标签数据的一维数组
Series对象的创建
通用结构: pd.Series(data, index=index, dtype=dtype)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
dtype: 数据类型,为可选参数
1、用列表创建
index缺省,默认为整数序列
import pandas as pd
data = pd.Series([1.5, 3, 4.5, 6])
data
0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64
增加index
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64
增加数据类型
缺省则从传入的数据自动判断
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
注意:数据支持多种类型
混合后数据类型变为object
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: object
data["a"]
1
data["c"]
'3'
数据类型可被强制改变
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
data["c"]
3.0
不能转为浮点数则会报错
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9236/4046912764.py in <module>
----> 1 data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
2 data
NameError: name 'pd' is not defined
2、用一维numpy数组创建
import numpy as np
x = np.arange(5)
pd.Series(x)
0 0
1 1
2 2
3 3
4 4
dtype: int32
3、用字典创建
默认以键为index 值为data
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64
字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])
population
BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64
4、data为标量的情况
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
11.1.2 Pandas DataFrame对象
DataFrame 是带标签数据的多维数组
DataFrame对象的创建
通用结构: pd.DataFrame(data, index=index, columns=columns)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
columns: 列标签,为可选参数
1、通过Series对象创建
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
pd.DataFrame(population)
|
0 |
BeiJing |
2154 |
ShangHai |
2424 |
ShenZhen |
1303 |
HangZhou |
981 |
pd.DataFrame(population, columns=["population"])
|
population |
BeiJing |
2154 |
ShangHai |
2424 |
ShenZhen |
1303 |
HangZhou |
981 |
2、通过Series对象字典创建
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
pd.DataFrame({"population": population,
"GDP": GDP})
|
population |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
注意:数量不够的会自动补齐
pd.DataFrame({"population": population,
"GDP": GDP,
"country": "China"})
|
population |
GDP |
country |
BeiJing |
2154 |
30320 |
China |
ShangHai |
2424 |
32680 |
China |
ShenZhen |
1303 |
24222 |
China |
HangZhou |
981 |
13468 |
China |
3、通过字典列表对象创建
字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd
data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data
行的标签没有排,因此行从0开始,列的标签延续。
从中取出一列数据
data1 = data["a"].copy()
data1
0 0
1 1
2 2
Name: a, dtype: int64
data1[0] = 10
data1
0 10
1 1
2 2
Name: a, dtype: int64
data
不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)
|
a |
b |
c |
0 |
1.0 |
1 |
NaN |
1 |
NaN |
3 |
4.0 |
4、通过Numpy二维数组创建
data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
[2, 9],
[4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
|
foo |
bar |
a |
1 |
6 |
b |
2 |
9 |
c |
4 |
0 |
11.2 DataFrame性质
1、属性
data = pd.DataFrame({"pop": population, "GDP": GDP})
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
(1)df.values 返回numpy数组表示的数据
data.values
array([[ 2154, 30320],
[ 2424, 32680],
[ 1303, 24222],
[ 981, 13468]], dtype=int64)
(2)df.index 返回行索引
data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
(3)df.columns 返回列索引
data.columns
Index(['pop', 'GDP'], dtype='object')
(4)df.shape 形状
data.shape
(4, 2)
(5) pd.size 大小
data.size
8
(6)pd.dtypes 返回每列数据类型
data.dtypes
pop int64
GDP int64
dtype: object
2、索引
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
(1)获取列
字典式
data["pop"]
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: pop, dtype: int64
data[["GDP", "pop"]]
|
GDP |
pop |
BeiJing |
30320 |
2154 |
ShangHai |
32680 |
2424 |
ShenZhen |
24222 |
1303 |
HangZhou |
13468 |
981 |
对象属性式
data.GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64
(2)获取行
绝对索引 df.loc
data.loc["BeiJing"]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]
|
pop |
GDP |
BeiJing |
2154 |
30320 |
HangZhou |
981 |
13468 |
相对索引 df.iloc
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
data.iloc[0]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]
|
pop |
GDP |
ShangHai |
2424 |
32680 |
HangZhou |
981 |
13468 |
(3)获取标量
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320
(4)Series对象的索引
type(data.GDP)
pandas.core.series.Series
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
GDP["BeiJing"]
30320
3、切片
dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
(1)行切片
df["2019-01-01": "2019-01-03"]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
df.loc["2019-01-01": "2019-01-03"]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
df.iloc[0: 3]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
注意:这里的3是取不到的。
(2)列切片
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df.loc[:, "A": "C"]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
df.iloc[:, 0: 3]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
(3)多种多样的取值
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
df.iloc[1: 3, 2:]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
df.iloc[3:, [0, 2]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
上面这种方式是行不通的。
df.iloc[[1, 5], 0: 3]
|
A |
B |
C |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
同样,上面这种方式是行不通的。
df.iloc[[1, 5], [0, 3]]
|
A |
D |
2019-01-02 |
-0.234414 |
-2.294395 |
2019-01-06 |
0.250613 |
-1.573342 |
4、布尔索引
相当于numpy当中的掩码操作。
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df > 0
|
A |
B |
C |
D |
2019-01-01 |
False |
False |
True |
False |
2019-01-02 |
False |
False |
True |
False |
2019-01-03 |
False |
True |
True |
True |
2019-01-04 |
True |
True |
False |
True |
2019-01-05 |
True |
True |
True |
False |
2019-01-06 |
True |
False |
False |
False |
df[df > 0]
|
A |
B |
C |
D |
2019-01-01 |
NaN |
NaN |
0.925984 |
NaN |
2019-01-02 |
NaN |
NaN |
1.080779 |
NaN |
2019-01-03 |
NaN |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
NaN |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
NaN |
2019-01-06 |
0.250613 |
NaN |
NaN |
NaN |
可以观察到,为true的部分都被取到了,而false没有。
df.A > 0
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2019-01-05 True
2019-01-06 True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
|
A |
B |
C |
D |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
one |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
one |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
three |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
three |
ind = df2["E"].isin(["two", "four"])
ind
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
df2[ind]
|
A |
B |
C |
D |
E |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
(5)赋值
df
DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
df["E"] = s1
df
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
修改赋值
df.loc["2019-01-01", "A"] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df.iloc[0, 1] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
修改index和columns
df.index = [i for i in range(len(df))]
df
|
A |
B |
C |
D |
E |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
df.columns = [i for i in range(df.shape[1])]
df
|
0 |
1 |
2 |
3 |
4 |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
11.3 数值运算及统计分析

1、数据的查看
import pandas as pd
import numpy as np
dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(1)查看前面的行
df.head() # 默认5行,也可以进行设置
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
df.head(2)
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
(2)查看后面的行
df.tail() # 默认5行
|
A |
B |
C |
D |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.tail(3)
|
A |
B |
C |
D |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(3)查看总体信息
df.iloc[0, 3] = np.nan
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
NaN |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
2、Numpy通用函数同样适用于Pandas
(1)向量化运算
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
x+5
np.exp(x)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.718282 |
7.389056 |
20.085537 |
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
x*y
(2)矩阵化运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
6 |
3 |
7 |
4 |
6 |
9 |
2 |
6 |
7 |
4 |
… |
4 |
0 |
9 |
5 |
8 |
0 |
9 |
2 |
6 |
3 |
1 |
8 |
2 |
4 |
2 |
6 |
4 |
8 |
6 |
1 |
3 |
… |
2 |
0 |
3 |
1 |
7 |
3 |
1 |
5 |
5 |
9 |
2 |
3 |
5 |
1 |
9 |
1 |
9 |
3 |
7 |
6 |
8 |
… |
6 |
8 |
7 |
0 |
7 |
7 |
2 |
0 |
7 |
2 |
3 |
2 |
0 |
4 |
9 |
6 |
9 |
8 |
6 |
8 |
7 |
… |
0 |
2 |
4 |
2 |
0 |
4 |
9 |
6 |
6 |
8 |
4 |
9 |
9 |
2 |
6 |
0 |
3 |
3 |
4 |
6 |
6 |
… |
9 |
6 |
8 |
6 |
0 |
0 |
8 |
8 |
3 |
8 |
5 |
2 |
6 |
5 |
7 |
8 |
4 |
0 |
2 |
9 |
7 |
… |
2 |
0 |
4 |
0 |
7 |
0 |
0 |
1 |
1 |
5 |
6 |
6 |
4 |
0 |
0 |
2 |
1 |
4 |
9 |
5 |
6 |
… |
5 |
0 |
8 |
5 |
2 |
3 |
3 |
2 |
9 |
2 |
7 |
2 |
3 |
6 |
3 |
8 |
0 |
7 |
6 |
1 |
7 |
… |
3 |
0 |
1 |
0 |
4 |
4 |
6 |
8 |
8 |
2 |
8 |
2 |
2 |
3 |
7 |
5 |
7 |
0 |
7 |
3 |
0 |
… |
1 |
1 |
5 |
2 |
8 |
3 |
0 |
3 |
0 |
4 |
9 |
3 |
7 |
7 |
6 |
2 |
0 |
0 |
2 |
5 |
6 |
… |
4 |
2 |
3 |
2 |
0 |
0 |
4 |
5 |
2 |
8 |
10 |
4 |
7 |
0 |
4 |
2 |
0 |
3 |
4 |
6 |
0 |
… |
5 |
6 |
1 |
9 |
1 |
9 |
0 |
7 |
0 |
8 |
11 |
5 |
6 |
9 |
6 |
9 |
2 |
1 |
8 |
7 |
9 |
… |
6 |
5 |
2 |
8 |
9 |
5 |
9 |
9 |
5 |
0 |
12 |
3 |
9 |
5 |
5 |
4 |
0 |
7 |
4 |
4 |
6 |
… |
0 |
7 |
2 |
9 |
6 |
9 |
4 |
9 |
4 |
6 |
13 |
8 |
4 |
0 |
9 |
9 |
0 |
1 |
5 |
8 |
7 |
… |
5 |
8 |
4 |
0 |
3 |
4 |
9 |
9 |
4 |
6 |
14 |
3 |
0 |
4 |
6 |
9 |
9 |
5 |
4 |
3 |
1 |
… |
6 |
1 |
0 |
3 |
7 |
1 |
2 |
0 |
0 |
2 |
15 |
4 |
2 |
0 |
0 |
7 |
9 |
1 |
2 |
1 |
2 |
… |
6 |
3 |
9 |
4 |
1 |
7 |
3 |
8 |
4 |
8 |
16 |
3 |
9 |
4 |
8 |
7 |
2 |
0 |
2 |
3 |
1 |
… |
8 |
0 |
0 |
3 |
8 |
5 |
2 |
0 |
3 |
8 |
17 |
2 |
8 |
6 |
3 |
2 |
9 |
4 |
4 |
2 |
8 |
… |
6 |
9 |
4 |
2 |
6 |
1 |
8 |
9 |
9 |
0 |
18 |
5 |
6 |
7 |
9 |
8 |
1 |
9 |
1 |
4 |
4 |
… |
3 |
5 |
2 |
5 |
6 |
9 |
9 |
2 |
6 |
2 |
19 |
1 |
9 |
3 |
7 |
8 |
6 |
0 |
2 |
8 |
0 |
… |
4 |
3 |
2 |
2 |
3 |
8 |
1 |
8 |
0 |
0 |
20 |
4 |
5 |
5 |
2 |
6 |
8 |
9 |
7 |
5 |
7 |
… |
3 |
5 |
0 |
8 |
0 |
4 |
3 |
2 |
5 |
1 |
21 |
2 |
4 |
8 |
1 |
9 |
7 |
1 |
4 |
6 |
7 |
… |
0 |
1 |
8 |
2 |
0 |
4 |
6 |
5 |
0 |
4 |
22 |
4 |
5 |
2 |
4 |
6 |
4 |
4 |
4 |
9 |
9 |
… |
1 |
7 |
6 |
9 |
9 |
1 |
5 |
5 |
2 |
1 |
23 |
0 |
5 |
4 |
8 |
0 |
6 |
4 |
4 |
1 |
2 |
… |
8 |
5 |
0 |
7 |
6 |
9 |
2 |
0 |
4 |
3 |
24 |
9 |
7 |
0 |
9 |
0 |
3 |
7 |
4 |
1 |
5 |
… |
3 |
7 |
8 |
2 |
2 |
1 |
9 |
2 |
2 |
4 |
25 |
4 |
1 |
9 |
5 |
4 |
5 |
0 |
4 |
8 |
9 |
… |
9 |
3 |
0 |
7 |
0 |
2 |
3 |
7 |
5 |
9 |
26 |
6 |
7 |
1 |
9 |
7 |
2 |
6 |
2 |
6 |
1 |
… |
0 |
6 |
5 |
9 |
8 |
0 |
3 |
8 |
3 |
9 |
27 |
2 |
8 |
1 |
3 |
5 |
1 |
7 |
7 |
0 |
2 |
… |
8 |
0 |
4 |
5 |
4 |
5 |
5 |
6 |
3 |
7 |
28 |
6 |
8 |
6 |
2 |
2 |
7 |
4 |
3 |
7 |
5 |
… |
1 |
7 |
9 |
2 |
4 |
5 |
9 |
5 |
3 |
2 |
29 |
3 |
0 |
3 |
0 |
0 |
9 |
5 |
4 |
3 |
2 |
… |
1 |
3 |
0 |
4 |
8 |
0 |
8 |
7 |
5 |
6 |
30 rows × 30 columns
转置
z = x.T
z
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
6 |
8 |
3 |
2 |
9 |
2 |
6 |
2 |
2 |
3 |
… |
4 |
2 |
4 |
0 |
9 |
4 |
6 |
2 |
6 |
3 |
1 |
3 |
2 |
5 |
0 |
9 |
6 |
4 |
3 |
2 |
7 |
… |
5 |
4 |
5 |
5 |
7 |
1 |
7 |
8 |
8 |
0 |
2 |
7 |
4 |
1 |
4 |
2 |
5 |
0 |
6 |
3 |
7 |
… |
5 |
8 |
2 |
4 |
0 |
9 |
1 |
1 |
6 |
3 |
3 |
4 |
2 |
9 |
9 |
6 |
7 |
0 |
3 |
7 |
6 |
… |
2 |
1 |
4 |
8 |
9 |
5 |
9 |
3 |
2 |
0 |
4 |
6 |
6 |
1 |
6 |
0 |
8 |
2 |
8 |
5 |
2 |
… |
6 |
9 |
6 |
0 |
0 |
4 |
7 |
5 |
2 |
0 |
5 |
9 |
4 |
9 |
9 |
3 |
4 |
1 |
0 |
7 |
0 |
… |
8 |
7 |
4 |
6 |
3 |
5 |
2 |
1 |
7 |
9 |
6 |
2 |
8 |
3 |
8 |
3 |
0 |
4 |
7 |
0 |
0 |
… |
9 |
1 |
4 |
4 |
7 |
0 |
6 |
7 |
4 |
5 |
7 |
6 |
6 |
7 |
6 |
4 |
2 |
9 |
6 |
7 |
2 |
… |
7 |
4 |
4 |
4 |
4 |
4 |
2 |
7 |
3 |
4 |
8 |
7 |
1 |
6 |
8 |
6 |
9 |
5 |
1 |
3 |
5 |
… |
5 |
6 |
9 |
1 |
1 |
8 |
6 |
0 |
7 |
3 |
9 |
4 |
3 |
8 |
7 |
6 |
7 |
6 |
7 |
0 |
6 |
… |
7 |
7 |
9 |
2 |
5 |
9 |
1 |
2 |
5 |
2 |
10 |
3 |
8 |
7 |
1 |
3 |
5 |
3 |
0 |
7 |
5 |
… |
4 |
0 |
2 |
6 |
4 |
1 |
9 |
9 |
1 |
0 |
11 |
7 |
1 |
4 |
0 |
6 |
7 |
6 |
8 |
3 |
5 |
… |
7 |
5 |
0 |
5 |
1 |
0 |
5 |
8 |
3 |
5 |
12 |
7 |
9 |
1 |
6 |
2 |
8 |
7 |
8 |
5 |
5 |
… |
9 |
0 |
4 |
1 |
2 |
9 |
2 |
4 |
3 |
1 |
13 |
2 |
8 |
4 |
6 |
5 |
3 |
0 |
1 |
7 |
2 |
… |
3 |
1 |
8 |
5 |
8 |
8 |
2 |
5 |
5 |
7 |
14 |
5 |
9 |
7 |
7 |
1 |
0 |
5 |
6 |
3 |
5 |
… |
9 |
0 |
0 |
1 |
6 |
9 |
8 |
3 |
5 |
9 |
15 |
4 |
4 |
9 |
4 |
9 |
0 |
7 |
9 |
2 |
7 |
… |
7 |
4 |
2 |
1 |
6 |
8 |
6 |
9 |
0 |
4 |
16 |
1 |
1 |
8 |
2 |
8 |
9 |
4 |
2 |
8 |
1 |
… |
9 |
9 |
3 |
1 |
5 |
8 |
4 |
1 |
7 |
6 |
17 |
7 |
3 |
8 |
7 |
4 |
3 |
3 |
6 |
2 |
4 |
… |
1 |
8 |
0 |
2 |
7 |
5 |
9 |
7 |
5 |
9 |
18 |
5 |
6 |
0 |
5 |
5 |
6 |
1 |
9 |
8 |
0 |
… |
4 |
5 |
0 |
1 |
3 |
7 |
6 |
5 |
2 |
1 |
19 |
1 |
7 |
8 |
2 |
3 |
1 |
5 |
8 |
1 |
0 |
… |
8 |
0 |
7 |
3 |
7 |
0 |
8 |
4 |
8 |
7 |
20 |
4 |
2 |
6 |
0 |
9 |
2 |
5 |
3 |
1 |
4 |
… |
3 |
0 |
1 |
8 |
3 |
9 |
0 |
8 |
1 |
1 |
21 |
0 |
0 |
8 |
2 |
6 |
0 |
0 |
0 |
1 |
2 |
… |
5 |
1 |
7 |
5 |
7 |
3 |
6 |
0 |
7 |
3 |
22 |
9 |
3 |
7 |
4 |
8 |
4 |
8 |
1 |
5 |
3 |
… |
0 |
8 |
6 |
0 |
8 |
0 |
5 |
4 |
9 |
0 |
23 |
5 |
1 |
0 |
2 |
6 |
0 |
5 |
0 |
2 |
2 |
… |
8 |
2 |
9 |
7 |
2 |
7 |
9 |
5 |
2 |
4 |
24 |
8 |
7 |
7 |
0 |
0 |
7 |
2 |
4 |
8 |
0 |
… |
0 |
0 |
9 |
6 |
2 |
0 |
8 |
4 |
4 |
8 |
25 |
0 |
3 |
7 |
4 |
0 |
0 |
3 |
4 |
3 |
0 |
… |
4 |
4 |
1 |
9 |
1 |
2 |
0 |
5 |
5 |
0 |
26 |
9 |
1 |
2 |
9 |
8 |
0 |
3 |
6 |
0 |
4 |
… |
3 |
6 |
5 |
2 |
9 |
3 |
3 |
5 |
9 |
8 |
27 |
2 |
5 |
0 |
6 |
8 |
1 |
2 |
8 |
3 |
5 |
… |
2 |
5 |
5 |
0 |
2 |
7 |
8 |
6 |
5 |
7 |
28 |
6 |
5 |
7 |
6 |
3 |
1 |
9 |
8 |
0 |
2 |
… |
5 |
0 |
2 |
4 |
2 |
5 |
3 |
3 |
3 |
5 |
29 |
3 |
9 |
2 |
8 |
8 |
5 |
2 |
2 |
4 |
8 |
… |
1 |
4 |
1 |
3 |
4 |
9 |
9 |
7 |
2 |
6 |
30 rows × 30 columns
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
5 |
8 |
9 |
5 |
0 |
0 |
1 |
7 |
6 |
9 |
… |
1 |
7 |
0 |
6 |
9 |
9 |
7 |
6 |
9 |
1 |
1 |
0 |
1 |
8 |
8 |
3 |
9 |
8 |
7 |
3 |
6 |
… |
9 |
2 |
0 |
4 |
9 |
2 |
7 |
7 |
9 |
8 |
2 |
6 |
9 |
3 |
7 |
7 |
4 |
5 |
9 |
3 |
6 |
… |
7 |
7 |
1 |
1 |
3 |
0 |
8 |
6 |
4 |
5 |
3 |
6 |
2 |
5 |
7 |
8 |
4 |
4 |
7 |
7 |
4 |
… |
0 |
1 |
9 |
8 |
2 |
3 |
1 |
2 |
7 |
2 |
4 |
6 |
0 |
9 |
2 |
6 |
6 |
2 |
7 |
7 |
0 |
… |
1 |
5 |
4 |
0 |
7 |
8 |
9 |
5 |
7 |
0 |
5 |
9 |
3 |
9 |
1 |
4 |
4 |
6 |
8 |
8 |
9 |
… |
1 |
8 |
7 |
0 |
3 |
4 |
2 |
0 |
3 |
5 |
6 |
1 |
2 |
4 |
3 |
0 |
6 |
0 |
7 |
2 |
8 |
… |
4 |
3 |
3 |
6 |
7 |
3 |
5 |
3 |
2 |
4 |
7 |
4 |
0 |
3 |
3 |
8 |
3 |
5 |
6 |
7 |
5 |
… |
1 |
7 |
3 |
1 |
6 |
6 |
9 |
6 |
9 |
6 |
8 |
0 |
0 |
2 |
9 |
6 |
0 |
6 |
7 |
0 |
3 |
… |
6 |
7 |
9 |
5 |
4 |
9 |
5 |
2 |
5 |
6 |
9 |
6 |
8 |
7 |
7 |
7 |
2 |
6 |
0 |
5 |
2 |
… |
7 |
0 |
6 |
2 |
4 |
3 |
6 |
7 |
6 |
3 |
10 |
0 |
6 |
4 |
7 |
6 |
2 |
9 |
5 |
9 |
9 |
… |
4 |
9 |
3 |
9 |
1 |
2 |
5 |
4 |
0 |
8 |
11 |
2 |
3 |
9 |
9 |
4 |
4 |
8 |
2 |
1 |
6 |
… |
0 |
5 |
9 |
8 |
6 |
6 |
0 |
4 |
7 |
3 |
12 |
0 |
1 |
6 |
0 |
6 |
1 |
6 |
4 |
2 |
5 |
… |
8 |
8 |
0 |
7 |
2 |
0 |
7 |
1 |
1 |
9 |
13 |
5 |
1 |
5 |
9 |
6 |
4 |
9 |
8 |
7 |
5 |
… |
2 |
4 |
3 |
2 |
0 |
0 |
4 |
2 |
5 |
0 |
14 |
0 |
3 |
8 |
5 |
3 |
1 |
4 |
7 |
3 |
2 |
… |
8 |
5 |
5 |
7 |
5 |
9 |
1 |
3 |
9 |
3 |
15 |
3 |
3 |
6 |
1 |
3 |
0 |
5 |
0 |
5 |
2 |
… |
7 |
1 |
7 |
7 |
3 |
8 |
3 |
0 |
6 |
3 |
16 |
0 |
6 |
5 |
9 |
6 |
4 |
6 |
6 |
2 |
2 |
… |
3 |
6 |
8 |
6 |
5 |
1 |
3 |
2 |
6 |
3 |
17 |
6 |
7 |
2 |
8 |
0 |
1 |
8 |
6 |
0 |
0 |
… |
5 |
6 |
2 |
5 |
4 |
3 |
0 |
6 |
2 |
1 |
18 |
9 |
4 |
4 |
0 |
9 |
8 |
7 |
7 |
6 |
1 |
… |
7 |
9 |
9 |
7 |
1 |
1 |
4 |
6 |
5 |
6 |
19 |
4 |
1 |
1 |
5 |
1 |
2 |
6 |
2 |
3 |
3 |
… |
0 |
0 |
0 |
9 |
8 |
5 |
9 |
3 |
4 |
0 |
20 |
9 |
8 |
6 |
3 |
9 |
9 |
0 |
8 |
1 |
6 |
… |
2 |
9 |
0 |
1 |
3 |
9 |
4 |
8 |
8 |
8 |
21 |
2 |
8 |
6 |
4 |
9 |
0 |
5 |
5 |
6 |
1 |
… |
6 |
7 |
5 |
6 |
8 |
7 |
4 |
2 |
4 |
0 |
22 |
0 |
3 |
5 |
9 |
0 |
3 |
6 |
5 |
1 |
1 |
… |
6 |
2 |
5 |
3 |
9 |
3 |
9 |
5 |
1 |
9 |
23 |
7 |
7 |
0 |
8 |
6 |
1 |
2 |
0 |
4 |
4 |
… |
1 |
9 |
6 |
0 |
2 |
8 |
3 |
7 |
2 |
5 |
24 |
6 |
0 |
4 |
2 |
3 |
1 |
0 |
5 |
7 |
0 |
… |
1 |
1 |
2 |
7 |
5 |
2 |
9 |
4 |
7 |
3 |
25 |
5 |
0 |
2 |
1 |
4 |
9 |
4 |
6 |
9 |
3 |
… |
5 |
5 |
3 |
5 |
9 |
2 |
7 |
4 |
1 |
6 |
26 |
9 |
8 |
1 |
8 |
1 |
6 |
2 |
6 |
1 |
8 |
… |
2 |
5 |
1 |
2 |
5 |
3 |
3 |
6 |
1 |
8 |
27 |
1 |
8 |
6 |
4 |
6 |
9 |
5 |
4 |
7 |
2 |
… |
9 |
3 |
1 |
5 |
1 |
1 |
7 |
1 |
2 |
6 |
28 |
0 |
7 |
7 |
4 |
3 |
2 |
7 |
8 |
5 |
2 |
… |
0 |
2 |
8 |
3 |
7 |
3 |
9 |
2 |
3 |
8 |
29 |
8 |
0 |
2 |
6 |
8 |
3 |
6 |
4 |
9 |
7 |
… |
6 |
7 |
8 |
5 |
7 |
2 |
5 |
3 |
4 |
5 |
30 rows × 30 columns
x.dot(y)
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
616 |
560 |
723 |
739 |
612 |
457 |
681 |
799 |
575 |
590 |
… |
523 |
739 |
613 |
580 |
668 |
602 |
733 |
585 |
657 |
700 |
1 |
520 |
438 |
691 |
600 |
612 |
455 |
666 |
764 |
707 |
592 |
… |
555 |
681 |
503 |
679 |
641 |
506 |
779 |
494 |
633 |
590 |
2 |
557 |
570 |
786 |
807 |
690 |
469 |
804 |
828 |
704 |
573 |
… |
563 |
675 |
712 |
758 |
793 |
672 |
754 |
550 |
756 |
638 |
3 |
605 |
507 |
664 |
701 |
660 |
496 |
698 |
806 |
651 |
575 |
… |
582 |
685 |
668 |
586 |
629 |
534 |
678 |
484 |
591 |
626 |
4 |
599 |
681 |
753 |
873 |
721 |
563 |
754 |
770 |
620 |
654 |
… |
633 |
747 |
661 |
677 |
726 |
649 |
716 |
610 |
735 |
706 |
5 |
422 |
354 |
602 |
627 |
613 |
396 |
617 |
627 |
489 |
423 |
… |
456 |
572 |
559 |
537 |
499 |
384 |
589 |
436 |
574 |
507 |
6 |
359 |
446 |
599 |
599 |
481 |
357 |
577 |
572 |
451 |
464 |
… |
449 |
550 |
495 |
532 |
633 |
554 |
663 |
476 |
565 |
602 |
7 |
531 |
520 |
698 |
590 |
607 |
537 |
665 |
696 |
571 |
472 |
… |
576 |
588 |
551 |
665 |
652 |
527 |
742 |
528 |
650 |
599 |
8 |
449 |
322 |
547 |
533 |
593 |
399 |
584 |
638 |
587 |
424 |
… |
402 |
596 |
523 |
523 |
447 |
362 |
561 |
386 |
529 |
484 |
9 |
373 |
433 |
525 |
601 |
522 |
345 |
551 |
521 |
434 |
447 |
… |
508 |
498 |
438 |
478 |
459 |
418 |
488 |
407 |
503 |
496 |
10 |
500 |
427 |
574 |
607 |
667 |
477 |
652 |
656 |
615 |
477 |
… |
622 |
702 |
531 |
610 |
558 |
532 |
598 |
471 |
582 |
561 |
11 |
664 |
694 |
772 |
841 |
779 |
574 |
730 |
810 |
711 |
608 |
… |
591 |
760 |
616 |
638 |
721 |
676 |
846 |
678 |
754 |
708 |
12 |
545 |
547 |
687 |
701 |
721 |
576 |
689 |
724 |
710 |
532 |
… |
674 |
684 |
648 |
694 |
710 |
564 |
757 |
571 |
671 |
656 |
13 |
574 |
586 |
723 |
750 |
691 |
494 |
696 |
787 |
667 |
523 |
… |
618 |
681 |
568 |
682 |
715 |
644 |
756 |
557 |
690 |
604 |
14 |
502 |
382 |
645 |
557 |
570 |
403 |
538 |
677 |
500 |
501 |
… |
369 |
650 |
507 |
576 |
546 |
531 |
554 |
437 |
616 |
463 |
15 |
510 |
505 |
736 |
651 |
649 |
510 |
719 |
733 |
694 |
557 |
… |
605 |
717 |
574 |
642 |
678 |
576 |
755 |
455 |
598 |
654 |
16 |
567 |
376 |
614 |
612 |
643 |
514 |
598 |
724 |
547 |
464 |
… |
456 |
639 |
520 |
560 |
569 |
442 |
596 |
517 |
659 |
532 |
17 |
626 |
716 |
828 |
765 |
740 |
603 |
809 |
852 |
692 |
591 |
… |
664 |
716 |
655 |
721 |
742 |
612 |
819 |
593 |
744 |
712 |
18 |
600 |
559 |
667 |
664 |
641 |
556 |
624 |
815 |
638 |
564 |
… |
581 |
701 |
559 |
677 |
710 |
554 |
748 |
597 |
614 |
657 |
19 |
445 |
431 |
661 |
681 |
641 |
552 |
690 |
719 |
602 |
474 |
… |
515 |
637 |
576 |
620 |
572 |
512 |
599 |
455 |
622 |
538 |
20 |
523 |
569 |
784 |
725 |
713 |
501 |
740 |
772 |
638 |
640 |
… |
589 |
775 |
664 |
686 |
726 |
672 |
747 |
548 |
723 |
645 |
21 |
487 |
465 |
553 |
639 |
517 |
449 |
592 |
609 |
454 |
398 |
… |
492 |
567 |
534 |
404 |
554 |
417 |
561 |
466 |
498 |
492 |
22 |
479 |
449 |
574 |
686 |
583 |
377 |
566 |
614 |
563 |
455 |
… |
453 |
539 |
491 |
501 |
596 |
520 |
722 |
478 |
565 |
501 |
23 |
483 |
386 |
476 |
526 |
550 |
426 |
492 |
585 |
536 |
482 |
… |
322 |
541 |
438 |
456 |
487 |
408 |
502 |
426 |
474 |
481 |
24 |
523 |
551 |
658 |
767 |
537 |
444 |
663 |
731 |
576 |
577 |
… |
522 |
590 |
525 |
664 |
691 |
548 |
635 |
526 |
641 |
538 |
25 |
652 |
656 |
738 |
753 |
853 |
508 |
752 |
815 |
669 |
576 |
… |
694 |
833 |
693 |
606 |
575 |
616 |
704 |
559 |
728 |
672 |
26 |
578 |
577 |
744 |
856 |
699 |
497 |
779 |
800 |
733 |
587 |
… |
630 |
754 |
704 |
834 |
760 |
680 |
765 |
592 |
731 |
629 |
27 |
554 |
494 |
665 |
689 |
630 |
574 |
695 |
703 |
636 |
599 |
… |
554 |
685 |
532 |
658 |
649 |
554 |
693 |
577 |
634 |
668 |
28 |
498 |
552 |
659 |
784 |
552 |
492 |
690 |
775 |
544 |
551 |
… |
567 |
636 |
518 |
599 |
742 |
521 |
733 |
533 |
605 |
604 |
29 |
513 |
491 |
563 |
642 |
477 |
367 |
589 |
647 |
516 |
484 |
… |
428 |
574 |
504 |
548 |
553 |
483 |
540 |
407 |
547 |
455 |
30 rows × 30 columns
%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)
for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]*y[k][j]
row.append(sum_row)
res.append(row)
return res
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
|
A |
B |
C |
0 |
6 |
3 |
7 |
1 |
4 |
6 |
9 |
2 |
2 |
6 |
7 |
按行广播
x.iloc[0]
A 6
B 3
C 7
Name: 0, dtype: int32
x/x.iloc[0]
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
按列广播
x.A
0 6
1 4
2 2
Name: A, dtype: int32
x.div(x.A, axis=0) # add sub div mul
|
A |
B |
C |
0 |
1.0 |
0.5 |
1.166667 |
1 |
1.0 |
1.5 |
2.250000 |
2 |
1.0 |
3.0 |
3.500000 |
x.div(x.iloc[0], axis=1)
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
3、新的用法
(1)索引对齐
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
|
A |
B |
C |
0 |
7 |
5 |
1 |
1 |
4 |
0 |
9 |
2 |
5 |
8 |
0 |
pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B
|
A |
B |
C |
0 |
10.0 |
12.0 |
NaN |
1 |
6.0 |
1.0 |
NaN |
2 |
NaN |
NaN |
NaN |
缺省值也可用fill_value来填充
A.add(B, fill_value=0)
|
A |
B |
C |
0 |
10.0 |
12.0 |
1.0 |
1 |
6.0 |
1.0 |
9.0 |
2 |
5.0 |
8.0 |
0.0 |
A*B
|
A |
B |
C |
0 |
21.0 |
35.0 |
NaN |
1 |
8.0 |
0.0 |
NaN |
2 |
NaN |
NaN |
NaN |
(2)统计相关
数据种类统计
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
用Counter方法统计数据
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
|
A |
0 |
2 |
1 |
2 |
2 |
2 |
3 |
1 |
4 |
2 |
5 |
1 |
6 |
1 |
7 |
2 |
8 |
1 |
9 |
2 |
10 |
2 |
11 |
0 |
12 |
2 |
13 |
0 |
14 |
2 |
15 |
2 |
16 |
0 |
17 |
0 |
18 |
2 |
19 |
1 |
np.unique(y1)
有value counter的方法
y1["A"].value_counts()
2 11
1 5
0 4
Name: A, dtype: int64
产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
|
population |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
|
population |
GDP |
per_GDP |
BeiJing |
2154 |
30320 |
14.076137 |
ShangHai |
2424 |
32680 |
13.481848 |
ShenZhen |
1303 |
24222 |
18.589409 |
HangZhou |
981 |
13468 |
13.728848 |
递增排序
city_info.sort_values(by="per_GDP")
|
population |
GDP |
per_GDP |
ShangHai |
2424 |
32680 |
13.481848 |
HangZhou |
981 |
13468 |
13.728848 |
BeiJing |
2154 |
30320 |
14.076137 |
ShenZhen |
1303 |
24222 |
18.589409 |
递减排序
city_info.sort_values(by="per_GDP", ascending=False)
|
population |
GDP |
per_GDP |
ShenZhen |
1303 |
24222 |
18.589409 |
BeiJing |
2154 |
30320 |
14.076137 |
HangZhou |
981 |
13468 |
13.728848 |
ShangHai |
2424 |
32680 |
13.481848 |
按轴进行排序
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
|
C |
B |
A |
D |
2 |
3 |
13 |
17 |
8 |
1 |
1 |
19 |
14 |
6 |
0 |
11 |
7 |
14 |
2 |
行排序
data.sort_index()
|
C |
B |
A |
D |
0 |
11 |
7 |
14 |
2 |
1 |
1 |
19 |
14 |
6 |
2 |
3 |
13 |
17 |
8 |
列排序
data.sort_index(axis=1)
|
A |
B |
C |
D |
2 |
17 |
13 |
3 |
8 |
1 |
14 |
19 |
1 |
6 |
0 |
14 |
7 |
11 |
2 |
data.sort_index(axis=1, ascending=False)
|
D |
C |
B |
A |
2 |
8 |
3 |
13 |
17 |
1 |
6 |
1 |
19 |
14 |
0 |
2 |
11 |
7 |
14 |
统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
非空个数
df.count()
A 6
B 6
C 6
D 6
dtype: int64
求和
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum(axis=1)
0 7.947086
1 28.289408
2 -1.427283
3 -4.893571
4 20.476806
5 3.043402
dtype: float64
最大值 最小值
df.min()
A -3.386712
B -4.333177
C -4.032613
D -0.152567
dtype: float64
df.max(axis=1)
0 6.367969
1 13.113252
2 5.094802
3 2.537716
4 5.577329
5 6.398588
dtype: float64
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
df.idxmax()
A 1
B 1
C 4
D 5
dtype: int64
均值
df.mean()
A 2.378704
B 1.588916
C 0.655231
D 4.283124
dtype: float64
方差
df.var()
A 34.980702
B 19.110656
C 18.948144
D 6.726776
dtype: float64
标准差
df.std()
A 5.914449
B 4.371574
C 4.352947
D 2.593603
dtype: float64
中位数
df.median()
A 1.126767
B 1.531743
C 0.176172
D 5.273518
dtype: float64
众数
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
|
A |
B |
0 |
4 |
2 |
1 |
3 |
2 |
2 |
2 |
0 |
3 |
2 |
4 |
4 |
2 |
0 |
5 |
4 |
1 |
6 |
2 |
0 |
7 |
1 |
1 |
8 |
3 |
4 |
9 |
2 |
0 |
data.mode()
75%分位数
df.quantile(0.75)
A 3.539202
B 5.052594
C 4.539740
D 6.157738
Name: 0.75, dtype: float64
用describe()可以获取所有属性
df.describe()
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
min |
-3.386712 |
-4.333177 |
-4.032613 |
-0.152567 |
25% |
-1.256706 |
-1.265251 |
-2.925910 |
3.158284 |
50% |
1.126767 |
1.531743 |
0.176172 |
5.273518 |
75% |
3.539202 |
5.052594 |
4.539740 |
6.157738 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
|
A |
B |
C |
D |
0 |
a |
a |
c |
d |
1 |
c |
a |
c |
b |
2 |
a |
a |
d |
c |
字符串类型的describe
data_2.describe()
|
A |
B |
C |
D |
count |
3 |
3 |
3 |
3 |
unique |
2 |
1 |
2 |
3 |
top |
a |
a |
c |
d |
freq |
2 |
3 |
2 |
1 |
相关性系数和协方差
df.corr()
|
A |
B |
C |
D |
A |
1.000000 |
0.831063 |
0.331060 |
0.510821 |
B |
0.831063 |
1.000000 |
0.179244 |
0.719112 |
C |
0.331060 |
0.179244 |
1.000000 |
-0.450365 |
D |
0.510821 |
0.719112 |
-0.450365 |
1.000000 |
df.corrwith(df["A"])
A 1.000000
B 0.831063
C 0.331060
D 0.510821
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
df.apply(np.cumsum)
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
14.195450 |
10.331955 |
-0.185923 |
11.895013 |
2 |
12.159109 |
5.998778 |
4.908878 |
11.742447 |
3 |
8.772397 |
4.476413 |
2.386669 |
14.280162 |
4 |
13.100888 |
10.027406 |
7.963999 |
19.300153 |
5 |
14.272224 |
9.533497 |
3.931385 |
25.698741 |
df.apply(np.cumsum, axis=1)
|
A |
B |
C |
D |
0 |
1.082198 |
4.639594 |
1.579117 |
7.947086 |
1 |
13.113252 |
19.887811 |
22.762364 |
28.289408 |
2 |
-2.036341 |
-6.369518 |
-1.274717 |
-1.427283 |
3 |
-3.386712 |
-4.909077 |
-7.431287 |
-4.893571 |
4 |
4.328491 |
9.879485 |
15.456814 |
20.476806 |
5 |
1.171336 |
0.677427 |
-3.355186 |
3.043402 |
df.apply(sum)
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A 16.499965
B 11.107736
C 9.609942
D 6.551155
dtype: float64
def my_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
|
A |
B |
C |
D |
Count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
idxmin |
3.000000 |
2.000000 |
5.000000 |
2.000000 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
11.4 缺失值处理
1、发现缺失值
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
|
A |
B |
C |
0 |
1 |
NaN |
2 |
1 |
NaN |
3 |
4 |
2 |
5 |
6 |
None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
|
A |
B |
C |
0 |
False |
True |
False |
1 |
True |
False |
False |
2 |
False |
False |
True |
data.notnull()
|
A |
B |
C |
0 |
True |
False |
True |
1 |
False |
True |
True |
2 |
True |
True |
False |
2、删除缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
|
A |
B |
C |
D |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
(2)删除整列
data.dropna(axis="columns")
|
D |
0 |
3.0 |
1 |
6.0 |
2 |
9.0 |
3 |
13.0 |
data["D"] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
10.0 |
11.0 |
12.0 |
NaN |
data.dropna(axis="columns", how="all")
|
A |
B |
C |
0 |
1.0 |
NaN |
2.0 |
1 |
NaN |
4.0 |
5.0 |
2 |
7.0 |
8.0 |
NaN |
3 |
10.0 |
11.0 |
12.0 |
all表示都是缺失值时才删除。
data.dropna(axis="columns", how="any")
data.loc[3] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
data.dropna(how="all")
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3、填充缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
data.fillna(value=5)
|
A |
B |
C |
D |
0 |
1.0 |
5.0 |
2.0 |
3.0 |
1 |
5.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
5.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
用均值进行替换
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.666667 |
2.000000 |
3.0 |
1 |
6.0 |
4.000000 |
5.000000 |
6.0 |
2 |
7.0 |
8.000000 |
6.333333 |
9.0 |
3 |
10.0 |
11.000000 |
12.000000 |
13.0 |
全部数据的平均值,先进行摊平,再进行填充即可。
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.0 |
2.0 |
3.0 |
1 |
7.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
7.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
11.5 合并数据
构造一个生产DataFrame的函数
import pandas as pd
import numpy as np
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
|
A |
B |
C |
0 |
A0 |
B0 |
C0 |
1 |
A1 |
B1 |
C1 |
2 |
A2 |
B2 |
C2 |
垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
3 |
A3 |
B3 |
4 |
A4 |
B4 |
水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
pd.concat([df_3, df_4], axis=1)
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
索引重叠
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
pd.concat([df_5, df_6],ignore_index=True)
|
A |
B |
0 |
A1 |
B1 |
1 |
A2 |
B2 |
2 |
A1 |
B1 |
3 |
A2 |
B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
|
A |
B |
C |
B |
C |
D |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
|
0 |
1 |
2 |
3 |
4 |
5 |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
|
city |
pop |
0 |
BeiJing |
2154 |
1 |
HangZhou |
981 |
2 |
ShenZhen |
1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
|
city |
GDP |
0 |
BeiJing |
30320 |
1 |
ShangHai |
32680 |
2 |
HangZhou |
13468 |
city_info = pd.merge(population, GDP)
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154 |
30320 |
1 |
HangZhou |
981 |
13468 |
这里outer是求并集
city_info = pd.merge(population, GDP, how="outer")
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154.0 |
30320.0 |
1 |
HangZhou |
981.0 |
13468.0 |
2 |
ShenZhen |
1303.0 |
NaN |
3 |
ShangHai |
NaN |
32680.0 |
11.6 分组和数据透视表

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
|
key |
data1 |
data2 |
0 |
A |
0 |
1 |
1 |
B |
1 |
4 |
2 |
C |
2 |
9 |
3 |
C |
3 |
9 |
4 |
B |
4 |
1 |
5 |
A |
5 |
9 |
(1)分组
延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。
df.groupby("key").sum()
|
data1 |
data2 |
key |
|
|
A |
5 |
10 |
B |
5 |
6 |
C |
5 |
11 |
df.groupby("key").mean()
|
data1 |
data2 |
key |
|
|
A |
2.5 |
5.0 |
B |
2.5 |
3.0 |
C |
2.5 |
5.5 |
可以打印看看这是什么东西:
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
按列取值
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
按组迭代
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
调用方法
df.groupby("key")["data1"].describe()
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
key |
|
|
|
|
|
|
|
|
A |
2.0 |
2.5 |
3.535534 |
0.0 |
1.25 |
2.5 |
3.75 |
5.0 |
B |
2.0 |
2.5 |
2.121320 |
1.0 |
1.75 |
2.5 |
3.25 |
4.0 |
C |
2.0 |
2.5 |
0.707107 |
2.0 |
2.25 |
2.5 |
2.75 |
3.0 |
支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
|
data1 |
data2 |
|
min |
median |
max |
min |
median |
max |
key |
|
|
|
|
|
|
A |
0 |
2.5 |
5 |
2 |
5.0 |
8 |
B |
1 |
2.5 |
4 |
2 |
3.0 |
4 |
C |
2 |
2.5 |
3 |
3 |
5.5 |
8 |
过滤
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
5 |
A |
5 |
8 |
转换
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
df.groupby("key").transform(lambda x: x-x.mean())
|
data1 |
data2 |
0 |
-2.5 |
-3.0 |
1 |
-1.5 |
-1.0 |
2 |
-0.5 |
2.5 |
3 |
0.5 |
-2.5 |
4 |
1.5 |
1.0 |
5 |
2.5 |
3.0 |
df
|
key |
data1 |
data2 |
0 |
A |
0 |
1 |
1 |
B |
1 |
4 |
2 |
C |
2 |
9 |
3 |
C |
3 |
9 |
4 |
B |
4 |
1 |
5 |
A |
5 |
9 |
df.groupby("key").apply(lambda x: x-x.mean())
|
data1 |
data2 |
0 |
-2.5 |
-4.0 |
1 |
-1.5 |
1.5 |
2 |
-0.5 |
0.0 |
3 |
0.5 |
0.0 |
4 |
1.5 |
-1.5 |
5 |
2.5 |
4.0 |
apply()方法
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
|
key |
data1 |
data2 |
0 |
A |
0.000000 |
2 |
1 |
B |
0.166667 |
2 |
2 |
C |
0.181818 |
8 |
3 |
C |
0.272727 |
3 |
4 |
B |
0.666667 |
4 |
5 |
A |
0.500000 |
8 |
将列表、数组设为分组键
这里的L相当于一个新的标签替代原来的行标签。
L = [0, 1, 0, 1, 2, 0]
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
df.groupby(L).sum()
|
data1 |
data2 |
0 |
7 |
18 |
1 |
4 |
5 |
2 |
4 |
4 |
用字典将索引映射到分组
df2 = df.set_index("key")
df2
|
data1 |
data2 |
key |
|
|
A |
0 |
2 |
B |
1 |
2 |
C |
2 |
8 |
C |
3 |
3 |
B |
4 |
4 |
A |
5 |
8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
|
data1 |
data2 |
constant |
10 |
17 |
first |
5 |
10 |
任意Python函数
df2.groupby(str.lower).mean()
|
data1 |
data2 |
a |
2.5 |
5.0 |
b |
2.5 |
3.0 |
c |
2.5 |
5.5 |
多个有效值组成的列表
只有这两个数都相等,才会分到同一个组。
df2.groupby([str.lower, mapping]).mean()
|
|
data1 |
data2 |
a |
first |
2.5 |
5.0 |
b |
constant |
2.5 |
3.0 |
c |
constant |
2.5 |
5.5 |
【例1】 行星观测数据处理
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
planets.describe()
|
number |
orbital_period |
mass |
distance |
year |
count |
1035.000000 |
992.000000 |
513.000000 |
808.000000 |
1035.000000 |
mean |
1.785507 |
2002.917596 |
2.638161 |
264.069282 |
2009.070531 |
std |
1.240976 |
26014.728304 |
3.818617 |
733.116493 |
3.972567 |
min |
1.000000 |
0.090706 |
0.003600 |
1.350000 |
1989.000000 |
25% |
1.000000 |
5.442540 |
0.229000 |
32.560000 |
2007.000000 |
50% |
1.000000 |
39.979500 |
1.260000 |
55.250000 |
2010.000000 |
75% |
2.000000 |
526.005000 |
3.040000 |
178.500000 |
2012.000000 |
max |
7.000000 |
730000.000000 |
25.000000 |
8500.000000 |
2014.000000 |
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
decade = 10 * (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
planets.groupby(["method", decade]).sum()
|
|
number |
orbital_period |
mass |
distance |
year |
method |
decade |
|
|
|
|
|
Astrometry |
2010s |
2 |
1.262360e+03 |
0.00000 |
35.75 |
4023 |
Eclipse Timing Variations |
2000s |
5 |
1.930800e+04 |
6.05000 |
261.44 |
6025 |
2010s |
10 |
2.345680e+04 |
4.20000 |
1000.00 |
12065 |
Imaging |
2000s |
29 |
1.350935e+06 |
0.00000 |
956.83 |
40139 |
2010s |
21 |
6.803750e+04 |
0.00000 |
1210.08 |
36208 |
Microlensing |
2000s |
12 |
1.732500e+04 |
0.00000 |
0.00 |
20070 |
2010s |
15 |
4.750000e+03 |
0.00000 |
41440.00 |
26155 |
Orbital Brightness Modulation |
2010s |
5 |
2.127920e+00 |
0.00000 |
2360.00 |
6035 |
Pulsar Timing |
1990s |
9 |
1.900153e+02 |
0.00000 |
0.00 |
5978 |
2000s |
1 |
3.652500e+04 |
0.00000 |
0.00 |
2003 |
2010s |
1 |
9.070629e-02 |
0.00000 |
1200.00 |
2011 |
Pulsation Timing Variations |
2000s |
1 |
1.170000e+03 |
0.00000 |
0.00 |
2007 |
Radial Velocity |
1980s |
1 |
8.388800e+01 |
11.68000 |
40.57 |
1989 |
1990s |
52 |
1.091561e+04 |
68.17820 |
723.71 |
55943 |
2000s |
475 |
2.633526e+05 |
945.31928 |
15201.16 |
619775 |
2010s |
424 |
1.809630e+05 |
316.47890 |
11382.67 |
432451 |
Transit |
2000s |
64 |
2.897102e+02 |
0.00000 |
31823.31 |
124462 |
2010s |
712 |
8.087813e+03 |
1.47000 |
102419.46 |
673999 |
Transit Timing Variations |
2010s |
9 |
2.393505e+02 |
0.00000 |
3313.00 |
8050 |
这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
|
number |
decade |
1980s |
1990s |
2000s |
2010s |
method |
|
|
|
|
Astrometry |
0.0 |
0.0 |
0.0 |
2.0 |
Eclipse Timing Variations |
0.0 |
0.0 |
5.0 |
10.0 |
Imaging |
0.0 |
0.0 |
29.0 |
21.0 |
Microlensing |
0.0 |
0.0 |
12.0 |
15.0 |
Orbital Brightness Modulation |
0.0 |
0.0 |
0.0 |
5.0 |
Pulsar Timing |
0.0 |
9.0 |
1.0 |
1.0 |
Pulsation Timing Variations |
0.0 |
0.0 |
1.0 |
0.0 |
Radial Velocity |
1.0 |
52.0 |
475.0 |
424.0 |
Transit |
0.0 |
0.0 |
64.0 |
712.0 |
Transit Timing Variations |
0.0 |
0.0 |
0.0 |
9.0 |
(2)数据透视表
【例2】泰坦尼克号乘客数据分析
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()
|
survived |
pclass |
sex |
age |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00 30
22.00 27
60.00 26
18.00 26
28.00 25
30.00 25
19.00 25
21.00 24
25.00 23
36.00 22
29.00 20
35.00 18
32.00 18
27.00 18
26.00 18
31.00 17
16.00 17
34.00 15
20.00 15
33.00 15
23.00 15
39.00 14
40.00 13
17.00 13
42.00 13
45.00 12
38.00 11
4.00 10
50.00 10
2.00 10
..
8.00 4
5.00 4
11.00 4
6.00 3
7.00 3
46.00 3
30.50 2
57.00 2
0.83 2
55.00 2
10.00 2
59.00 2
13.00 2
28.50 2
40.50 2
45.50 2
0.75 2
32.50 2
34.50 1
55.50 1
0.92 1
36.50 1
12.00 1
53.00 1
14.50 1
0.67 1
20.50 1
23.50 1
24.50 1
0.42 1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20 220
30 167
10 102
40 89
0 62
50 48
60 26
Name: age, dtype: int64
Age.astype(str)+"s"
0 20s
1 30s
2 20s
3 30s
4 30s
6 50s
7 0s
8 20s
9 10s
10 0s
11 50s
12 20s
13 30s
14 10s
15 50s
16 0s
18 30s
20 30s
21 30s
22 10s
23 20s
24 0s
25 30s
27 10s
30 40s
33 60s
34 20s
35 40s
37 20s
38 10s
...
856 40s
857 50s
858 20s
860 40s
861 20s
862 40s
864 20s
865 40s
866 20s
867 30s
869 0s
870 20s
871 40s
872 30s
873 40s
874 20s
875 10s
876 20s
877 10s
879 50s
880 20s
881 30s
882 20s
883 20s
884 20s
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
age |
0 |
10 |
20 |
30 |
40 |
50 |
60 |
sex |
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.136364 |
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age |
0 |
10 |
20 |
30 |
40 |
50 |
60 |
sex |
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.136364 |
titanic.describe()
|
survived |
pclass |
age |
sibsp |
parch |
fare |
count |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
titanic.groupby("sex")[["survived"]].mean()
|
survived |
sex |
|
female |
0.742038 |
male |
0.188908 |
titanic.groupby("sex")["survived"].mean()
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
class |
First |
Second |
Third |
sex |
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
male |
0.368852 |
0.157407 |
0.135447 |
数据透视表:用更直观的方式实现上面的功能。
titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值
class |
First |
Second |
Third |
sex |
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
male |
0.368852 |
0.157407 |
0.135447 |
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。
class |
First |
Second |
Third |
All |
sex |
|
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
0.742038 |
male |
0.368852 |
0.157407 |
0.135447 |
0.188908 |
All |
0.629630 |
0.472826 |
0.242363 |
0.383838 |
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。
|
fare |
survived |
class |
First |
Second |
Third |
First |
Second |
Third |
sex |
|
|
|
|
|
|
female |
106.125798 |
21.970121 |
16.118810 |
91 |
70 |
72 |
male |
67.226127 |
19.741782 |
12.661633 |
45 |
17 |
47 |
11.7 其他
(1)向量化字符串操作
(2) 处理时间序列
(3) 多级索引:用于多维数据
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
|
|
population |
GDP |
BeiJing |
2008 |
1771 |
11115 |
2018 |
2154 |
30320 |
ShangHai |
2008 |
2141 |
14070 |
2018 |
2424 |
32680 |
ShenZhen |
2008 |
1077 |
7806 |
2018 |
1303 |
24222 |
HangZhou |
2008 |
798 |
4789 |
2018 |
981 |
13468 |
data.index.names = ["city", "year"]
data
|
|
population |
GDP |
city |
year |
|
|
BeiJing |
2008 |
1771 |
11115 |
2018 |
2154 |
30320 |
ShangHai |
2008 |
2141 |
14070 |
2018 |
2424 |
32680 |
ShenZhen |
2008 |
1077 |
7806 |
2018 |
1303 |
24222 |
HangZhou |
2008 |
798 |
4789 |
2018 |
981 |
13468 |
data["GDP"]
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680
(4) 高性能的Pandas:eval()
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
|
A |
B |
C |
0 |
0.418071 |
0.381836 |
0.500556 |
1 |
0.059432 |
0.749066 |
0.302429 |
2 |
0.489147 |
0.739153 |
0.777161 |
3 |
0.175441 |
0.016556 |
0.348979 |
4 |
0.766534 |
0.559252 |
0.310635 |
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
|
A |
B |
C |
D |
0 |
0.418071 |
0.381836 |
0.500556 |
-1.601593 |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
3 |
0.175441 |
0.016556 |
0.348979 |
-0.294917 |
4 |
0.766534 |
0.559252 |
0.310635 |
-1.923199 |
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
|
A |
B |
C |
D |
0 |
0.418071 |
0.381836 |
0.500556 |
-1.601593 |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
3 |
0.175441 |
0.016556 |
0.348979 |
-0.294917 |
4 |
0.766534 |
0.559252 |
0.310635 |
-1.923199 |
使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0 0.342788
1 0.047409
2 -0.387501
3 0.236956
4 0.694839
dtype: float64
(4) 高性能的Pandas:query()
df.head()
|
A |
B |
C |
D |
0 |
0.418071 |
0.381836 |
0.500556 |
-1.601593 |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
3 |
0.175441 |
0.016556 |
0.348979 |
-0.294917 |
4 |
0.766534 |
0.559252 |
0.310635 |
-1.923199 |
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
|
A |
B |
C |
D |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
7 |
0.073950 |
0.730144 |
0.646190 |
-2.272672 |
10 |
0.393200 |
0.610467 |
0.697096 |
-3.313485 |
11 |
0.065734 |
0.764699 |
0.179380 |
-1.011958 |
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True
(5)eval()和query()的使用时机
小数组时,普通方法反而更快
df.values.nbytes
32000
df1.values.nbytes
8000000