Pandas
在计算机编程中,pandas是用于数据操纵和分析的Python软件库。它建造在NumPy基础上,并为操纵数值表格和时间序列,提供了数据结构和运算操作。它是在三条款BSD许可证下发行的自由软件[2]。它的名字衍生自术语“面板数据”(panel data),这是计量经济学的术语,即包括了对同一个体在多个时期内的观测的数据集[3]。它的名字还可解释为对短语“Python data analysis”的玩笑[4]。 历史2008年,原作者Wes McKinney开始在AQR资本管理公司制作pandas,用来满足在财务数据上进行定量分析时,对高性能、灵活工具的需要。2009年,他在离开AQR之前,说服管理者允许他将这个软件库开放源代码。下面是其开发过程的时间线[5]:
数据模型pandas的序列( pandas提供了类似于R语言中 主要特征pandas提供了快速而高效的数据帧对象,用于凭借其集成的索引进行数据操纵。它的主要特征有:
pandas经过了高度的性能优化,关键代码路径用Cython或C语言写成。pandas可以利用PyArrow来扩展功能并增进各种API的性能[15]。pandas的缺省绘图后端是matplotlib,还可以扩展上其他第三方绘图后端[16],比如Plotly Express[17]。 进程内SQL OLAP列式数据库DuckDB,可以在pandas数据帧上执行SQL[18]。 示例>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>>
>>> data = {
... 'variable': ['A'] * 3 + ['B'] * 3 + ['C'] * 3 + ['D'] * 3,
... 'date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'] * 4),
... 'value': [x + 0.1 for x in range(12)]
... }
>>>
>>> df = pd.DataFrame(data)
>>> type(df['value']) == pd.Series
True
>>>
>>> df['value1'] = df['value'] + 0.1
>>> df
variable date value value1
0 A 2023-01-01 0.1 0.2
1 A 2023-01-02 1.1 1.2
2 A 2023-01-03 2.1 2.2
3 B 2023-01-01 3.1 3.2
4 B 2023-01-02 4.1 4.2
5 B 2023-01-03 5.1 5.2
6 C 2023-01-01 6.1 6.2
7 C 2023-01-02 7.1 7.2
8 C 2023-01-03 8.1 8.2
9 D 2023-01-01 9.1 9.2
10 D 2023-01-02 10.1 10.2
11 D 2023-01-03 11.1 11.2
>>>
>>> df.index
RangeIndex(start=0, stop=12, step=1)
>>>
>>> df.columns
Index(['variable', 'date', 'value', 'value1'], dtype='object')
>>>
>>> df.loc[[1, 2], ['value', 'value1']]
value value1
1 1.1 1.2
2 2.1 2.2
>>>
>>> [df.columns.get_loc(x) for x in ['value', 'value1']]
[2, 3]
>>>
>>> df.iloc[[1, 2], [2, 3]]
value value1
1 1.1 1.2
2 2.1 2.2
>>>
>>> df[(df['value']/2 > 1) & (df['value1'] < 3)]
variable date value value1
2 A 2023-01-03 2.1 2.2
>>>
>>> df.query('value/2 > 1 & value1 < 3')
variable date value value1
2 A 2023-01-03 2.1 2.2
>>>
数据帧中的数据经常存储为两种格式:堆叠格式或记录格式。在堆栈格式中,针对每个主题(subject)在适用情况下有多个横行,故而也称为“长”格式。在记录格式中,针对每个主题典型地有一个横行,故而也称为“宽”格式。在这个例子中,如果要对每个唯一的变量( >>> df.drop([0, 4, 8]).pivot(index='date', columns='variable')
value value1
variable A B C D A B C D
date
2023-01-01 NaN 3.1 6.1 9.1 NaN 3.2 6.2 9.2
2023-01-02 1.1 NaN 7.1 10.1 1.2 NaN 7.2 10.2
2023-01-03 2.1 5.1 NaN 11.1 2.2 5.2 NaN 11.2
>>>
这里给 使用 >>> df1 = df.drop(columns='value').rename(columns={'value1': 'value'})
>>> df1 = pd.concat([df.drop(columns='value1'), df1], ignore_index=True)
>>> df1.shape
(24, 3)
>>>
>>> data1 = [
... ('A', pd.Timestamp('2023-01-01'), 0.3),
... ('A', pd.Timestamp('2023-01-02'), 1.3)
... ]
>>>
>>> rows = pd.DataFrame(data1, columns=['variable', 'date', 'value'])
>>> pd.concat([df1, rows], ignore_index=True).tail(3)
variable date value
23 D 2023-01-03 11.2
24 A 2023-01-01 0.3
25 A 2023-01-02 1.3
>>>
>>> right = pd.DataFrame(data1[0:1], columns=['variable', 'date', 'value1'])
>>> pd.merge(df1, right, on=['variable', 'date'], how='inner')
variable date value value1
0 A 2023-01-01 0.1 0.3
1 A 2023-01-01 0.2 0.3
>>>
使用 >>> df2 = df1.groupby(['date', 'variable']).agg({'value': 'sum'})
>>> df2
value
date variable
2023-01-01 A 0.3
B 6.3
C 12.3
D 18.3
2023-01-02 A 2.3
B 8.3
C 14.3
D 20.3
2023-01-03 A 4.3
B 10.3
C 16.3
D 22.3
>>>
>>> df2.shape
(12, 1)
>>>
>>> df2.index
MultiIndex([('2023-01-01', 'A'),
('2023-01-01', 'B'),
('2023-01-01', 'C'),
('2023-01-01', 'D'),
('2023-01-02', 'A'),
('2023-01-02', 'B'),
('2023-01-02', 'C'),
('2023-01-02', 'D'),
('2023-01-03', 'A'),
('2023-01-03', 'B'),
('2023-01-03', 'C'),
('2023-01-03', 'D')],
names=['date', 'variable'])
>>>
>>> df2.columns
Index(['value'], dtype='object')
>>>
>>> df2.loc[('2023-01-02', 'A')]
value 2.3
Name: (2023-01-02 00:00:00, A), dtype: float64
>>>
>>> df2.loc['2023-01-02']
value
variable
A 2.3
B 8.3
C 14.3
D 20.3
>>>
>>> df2.xs('A', level='variable')
value
date
2023-01-01 0.3
2023-01-02 2.3
2023-01-03 4.3
>>>
使用 >>> df3 = df1.pivot_table(index='date', columns='variable', aggfunc='sum')
>>> df3
value
variable A B C D
date
2023-01-01 0.3 6.3 12.3 18.3
2023-01-02 2.3 8.3 14.3 20.3
2023-01-03 4.3 10.3 16.3 22.3
>>>
>>> df3.shape
(3, 4)
>>>
>>> df3.to_numpy()
array([[ 0.3, 6.3, 12.3, 18.3],
[ 2.3, 8.3, 14.3, 20.3],
[ 4.3, 10.3, 16.3, 22.3]])
>>>
>>> df3.index
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'], dtype='datetime64[ns]', name='date', freq=None)
>>>
>>> df3.columns
MultiIndex([('value', 'A'),
('value', 'B'),
('value', 'C'),
('value', 'D')],
names=[None, 'variable'])
>>>
>>> df1.pivot_table(index='date', columns='variable', values='value', aggfunc='sum').columns
Index(['A', 'B', 'C', 'D'], dtype='object', name='variable')
>>>
>>> df3['value'].columns
Index(['A', 'B', 'C', 'D'], dtype='object', name='variable')
>>>
>>> df3[('value', 'A')]
date
2023-01-01 0.3
2023-01-02 2.3
2023-01-03 4.3
Name: (value, A), dtype: float64
>>>
用matplotlib为数据帧绘制条形图: >>> ax = df3.plot.bar()
>>> h, l = ax.get_legend_handles_labels()
>>> ax.legend(h, df3.columns.get_level_values(1), title=None, loc='upper left')
<matplotlib.legend.Legend object at 0x7fdd1cff96d0>
>>> ax.set_xticklabels([x.strftime('%Y-%m-%d') for x in df3.index], rotation=0)
[Text(0, 0, '2023-01-01'), Text(1, 0, '2023-01-02'), Text(2, 0, '2023-01-03')]
>>> ax.get_xaxis().get_label().set_visible(False)
>>> ax.grid(axis='y', linestyle=':')
>>> ax.set_axisbelow(True)
>>> for i, m in enumerate(ax.containers):
... for j, n in enumerate(m.get_children()):
... n.set_x(j - 0.8*(0.5 - i/df3.columns.size))
... n.set_width(0.8/df3.columns.size)
... ax.bar_label(m, fontsize='small')
...
[Text(0, 0, '0.3'), Text(0, 0, '2.3'), Text(0, 0, '4.3')]
[Text(0, 0, '6.3'), Text(0, 0, '8.3'), Text(0, 0, '10.3')]
[Text(0, 0, '12.3'), Text(0, 0, '14.3'), Text(0, 0, '16.3')]
[Text(0, 0, '18.3'), Text(0, 0, '20.3'), Text(0, 0, '22.3')]
>>> plt.show()
>>>
导出和导入CSV文件: >>> df3.to_csv('dftest.csv', float_format='%.1f')
>>>
>>> df4 = pd.read_csv('dftest.csv', header=[0, 1], index_col=0)
>>> df4.shape
(3, 4)
这里的标头( 使用util-linux工具组成员 $ cat dftest.csv | column -s, -o, -t
,value,value,value,value
variable ,A ,B ,C ,D
date , , , ,
2023-01-01,0.3 ,6.3 ,12.3 ,18.3
2023-01-02,2.3 ,8.3 ,14.3 ,20.3
2023-01-03,4.3 ,10.3 ,16.3 ,22.3
导出和导入JSON文件: >>> df3.to_json('dftest.json', orient='index', date_format='iso', date_unit='s')
>>>
>>> df4 = pd.read_json('dftest.json', orient='index')
>>> df4.shape
(3, 4)
>>>
>>> df4.columns
Index(['('value', 'A')', '('value', 'B')', '('value', 'C')', '('value', 'D')'], dtype='object')
>>>
>>> df4.columns = pd.MultiIndex.from_tuples([eval(x) for x in df4.columns])
>>> df4.index.name = df3.index.name
>>> df4.columns.name = df3.columns.name
这里指定了方向( $ cat dftest.json | jq
{
"2023-01-01T00:00:00": {
"('value', 'A')": 0.3,
"('value', 'B')": 6.3,
"('value', 'C')": 12.3,
"('value', 'D')": 18.3
},
"2023-01-02T00:00:00": {
"('value', 'A')": 2.3,
"('value', 'B')": 8.3,
"('value', 'C')": 14.3,
"('value', 'D')": 20.3
},
"2023-01-03T00:00:00": {
"('value', 'A')": 4.3,
"('value', 'B')": 10.3,
"('value', 'C')": 16.3,
"('value', 'D')": 22.3
}
}
>>> df3.to_hdf('dftest.h5', key='df3', mode='w')
>>>
>>> df4 = pd.read_hdf('dftest.h5', key='df3')
>>> df4.shape
(3, 4)
这里通过键( 使用 $ h5ls -r -d dftest.h5
/ Group
/df3 Group
/df3/axis0_label0 Dataset {4}
Data:
0, 0, 0, 0
/df3/axis0_label1 Dataset {4}
Data:
0, 1, 2, 3
/df3/axis0_level0 Dataset {1}
Data:
"value"
/df3/axis0_level1 Dataset {4}
Data:
"A", "B", "C", "D"
/df3/axis1 Dataset {3}
Data:
1672531200000000000, 1672617600000000000, 1672704000000000000
/df3/block0_items_label0 Dataset {4}
Data:
0, 0, 0, 0
/df3/block0_items_label1 Dataset {4}
Data:
0, 1, 2, 3
/df3/block0_items_level0 Dataset {1}
Data:
"value"
/df3/block0_items_level1 Dataset {4}
Data:
"A", "B", "C", "D"
/df3/block0_values Dataset {3, 4}
Data:
0.3, 6.3, 12.3, 18.3, 2.3, 8.3, 14.3, 20.3, 4.3, 10.3, 16.3, 22.3
这里的HDF5文件中的日期时间表示,是以纳秒为单位的UNIX时间纪元时间戳。这种存储格式保存了数据帧的两个轴[20]和所有的块[21]。由于这里只有一个块,这个块的 保存HDF5文件还可采用表格( >>> df3.to_hdf('dftest.h5', key='df4', format='table', mode='a')
>>>
>>> pd.read_hdf('dftest.h5', key='df4', where='index > 20230101', columns=[('value', 'A'), ('value', 'C')])
value
variable A C
date
2023-01-02 2.3 14.3
2023-01-03 4.3 16.3
这里的文件打开模态 查看变更后的HDF5文件: $ h5ls dftest.h5
df3 Group
df4 Group
$ h5ls -r dftest.h5/df4
/_i_table Group
/_i_table/index Group
/_i_table/index/abounds Dataset {0/Inf}
/_i_table/index/bounds Dataset {0/Inf, 127}
/_i_table/index/indices Dataset {0/Inf, 131072}
/_i_table/index/indicesLR Dataset {131072}
/_i_table/index/mbounds Dataset {0/Inf}
/_i_table/index/mranges Dataset {0/Inf}
/_i_table/index/ranges Dataset {0/Inf, 2}
/_i_table/index/sorted Dataset {0/Inf, 131072}
/_i_table/index/sortedLR Dataset {131201}
/_i_table/index/zbounds Dataset {0/Inf}
/table Dataset {3/Inf}
$ h5ls -d dftest.h5/df4/table
table Dataset {3/Inf}
Data:
{1672531200000000000, [0.3,6.3,12.3,18.3]},
{1672617600000000000, [2.3,8.3,14.3,20.3]},
{1672704000000000000, [4.3,10.3,16.3,22.3]}
这里的 导出和导入netCDF文件可以借助 >>> df3.stack().shape
(12, 1)
>>>
>>> import xarray as xr
>>> df3.stack().to_xarray().to_netcdf('dftest.nc')
>>>
>>> df4 = xr.open_dataset('dftest.nc').to_dataframe().unstack()
>>> df4.shape
(3, 4)
使用 $ ncdump dftest.nc
netcdf dftest {
dimensions:
date = 3 ;
variable = 4 ;
variables:
int64 date(date) ;
date:units = "days since 2023-01-01 00:00:00" ;
date:calendar = "proleptic_gregorian" ;
string variable(variable) ;
double value(date, variable) ;
value:_FillValue = NaN ;
data:
date = 0, 1, 2 ;
variable = "A", "B", "C", "D" ;
value =
0.3, 6.3, 12.3, 18.3,
2.3, 8.3, 14.3, 20.3,
4.3, 10.3, 16.3, 22.3 ;
}
$ ncdump -k dftest.nc
netCDF-4
$ h5ls -r -d dftest.nc
/ Group
/date Dataset {3}
Data:
0, 1, 2
/value Dataset {3, 4}
Data:
0.3, 6.3, 12.3, 18.3, 2.3, 8.3, 14.3, 20.3, 4.3, 10.3, 16.3, 22.3
/variable Dataset {4}
Data:
"A", "B", "C", "D"
参见引用
延伸阅读
外部链接
|