データセット¶

Python の statsmodels パッケージを使用することで、R で使われている様々なデータセットを取り出すことができる。statsmodels パッケージ中の get_rdataset 関数に、R で使われているときの、データセットの名前とパッケージの名前を与えると、データとそのアノテーションが取得され、オブジェクトに保存される。statsmodels パッケージで呼び出せるデータセット一覧は GitHub ページで公開されている。

import statsmodels.api as sm

economics_dataset = sm.datasets.get_rdataset('economics', 'ggplot2')
economics = economics_dataset.data
economics.head()

print(economics_dataset.data.__doc__)

    Two-dimensional size-mutable, potentially heterogeneous tabular data
    structure with labeled axes (rows and columns). Arithmetic operations
    align on both row and column labels. Can be thought of as a dict-like
    container for Series objects. The primary pandas data structure.

    Parameters
    ----------
    data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
        Dict can contain Series, arrays, constants, or list-like objects

        .. versionchanged :: 0.23.0
           If data is a dict, argument order is maintained for Python 3.6
           and later.

    index : Index or array-like
        Index to use for resulting frame. Will default to RangeIndex if
        no indexing information part of input data and no index provided
    columns : Index or array-like
        Column labels to use for resulting frame. Will default to
        RangeIndex (0, 1, 2, ..., n) if no column labels are provided
    dtype : dtype, default None
        Data type to force. Only a single dtype is allowed. If None, infer
    copy : boolean, default False
        Copy data from inputs. Only affects DataFrame / 2d ndarray input

    See Also
    --------
    DataFrame.from_records : Constructor from tuples, also record arrays.
    DataFrame.from_dict : From dicts of Series, arrays, or dicts.
    DataFrame.from_items : From sequence of (key, value) pairs
        pandas.read_csv, pandas.read_table, pandas.read_clipboard.

    Examples
    --------
    Constructing DataFrame from a dictionary.

    >>> d = {'col1': [1, 2], 'col2': [3, 4]}
    >>> df = pd.DataFrame(data=d)
    >>> df
       col1  col2
    0     1     3
    1     2     4

    Notice that the inferred dtype is int64.

    >>> df.dtypes
    col1    int64
    col2    int64
    dtype: object

    To enforce a single dtype:

    >>> df = pd.DataFrame(data=d, dtype=np.int8)
    >>> df.dtypes
    col1    int8
    col2    int8
    dtype: object

    Constructing DataFrame from numpy ndarray:

    >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
    ...                    columns=['a', 'b', 'c'])
    >>> df2
       a  b  c
    0  1  2  3
    1  4  5  6
    2  7  8  9

rice_dataset = sm.datasets.get_rdataset('rice', 'DAAG')
rice = rice_dataset.data
rice.head()

msleep_dataset = sm.datasets.get_rdataset('msleep', 'ggplot2')
msleep = msleep_dataset.data
msleep.head()

soils_dataset = sm.datasets.get_rdataset('Soils', 'carData')
soils = soils_dataset.data
soils.head()

iris_dataset = sm.datasets.get_rdataset('iris', 'datasets')
iris = iris_dataset.data
iris.head()

orange_dataset = sm.datasets.get_rdataset('Orange', 'datasets')
orange = orange_dataset.data
orange.head()

Pandas 基本操作¶

以下で Pandas の機能を紹介するために、あらかじめ関連するいくつかのパッケージを呼び出して、使えるように準備する。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Series¶

Pandas でよく使われるオブジェクト型として Series と DataFrame がある。Series は一次元配列（ベクトル）に似たデータ構造である。Series 型のデータには、位置番号と名前の両方が存在する。そのため、Series 型データから個々の要素を取り出す際に、位置番号と名前の両方を使用することができる。

s = pd.Series([1, 3, 5], index=['apple', 'banana', 'cherry'])
s

apple     1
banana    3
cherry    5
dtype: int64

s[1]

3

s['cherry']

5

また、次のように、Boolean 型の要素からなるリスト・配列を使用して、要素を取り出すこともできる。

k = [True, False, True]
# k = np.array([True, False, True])
s[k]

apple     1
cherry    5
dtype: int64

k = pd.Series([True, False, True], index=['apple', 'banana', 'cherry'])
s[k]

apple     1
cherry    5
dtype: int64

Series 型のデータの中から重複要素を調べるたり、削除したりするのメソッドには、unique、value_counts、duplicated、drop_duplicates などがある。

s = pd.Series(['apple', 'apple', 'orange', 'apple', 'cherry', 'orange'])
s

0     apple
1     apple
2    orange
3     apple
4    cherry
5    orange
dtype: object

s.unique()

array(['apple', 'orange', 'cherry'], dtype=object)

s.value_counts()

apple     3
orange    2
cherry    1
dtype: int64

s.duplicated()

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

s.duplicated(keep='last')

0     True
1     True
2     True
3    False
4    False
5    False
dtype: bool

s.drop_duplicates()

0     apple
2    orange
4    cherry
dtype: object

s.drop_duplicates(keep='last')

3     apple
4    cherry
5    orange
dtype: object

crosstab¶

crosstab メソッドは、与えられた 2 つの配列・シリーズに対して、その組み合わせの集計を行うメソッドである。集計結果は、データフレームとして返される。

fruits = np.array(['apple', 'apple', 'banana', 'apple', 'banana', 'cherry', 'apple'])
colors = np.array(['red',   'red',   'yellow', 'green', 'green',  'red',    'red'])

pd.crosstab(fruits, colors, rownames=['fruit'], colnames=['color'])

cut¶

cut メソッドは、連続値データを受け取り、それを離散値に変化する機能を持つ。

s = pd.Series([10, 12, 31, 13, 8, 30, 22, 27, 32, 37, 29])
pd.cut(s, bins=3)

0      (7.971, 17.667]
1      (7.971, 17.667]
2       (27.333, 37.0]
3      (7.971, 17.667]
4      (7.971, 17.667]
5       (27.333, 37.0]
6     (17.667, 27.333]
7     (17.667, 27.333]
8       (27.333, 37.0]
9       (27.333, 37.0]
10      (27.333, 37.0]
dtype: category
Categories (3, interval[float64]): [(7.971, 17.667] < (17.667, 27.333] < (27.333, 37.0]]

get_dummies¶

ダミー変数を作成するのに便利なメソッドである。

s = ['apple', 'apple', 'banana', 'cherry', 'cherry', 'apple']
pd.get_dummies(s)

s = pd.Series([10, 12, 31, 13, 8, 30, 22, 27, 32, 37, 29])
pd.get_dummies(pd.cut(s, bins=3))

    (7.971, 17.667]  (17.667, 27.333]  (27.333, 37.0]
0                 1                 0               0
1                 1                 0               0
2                 0                 0               1
3                 1                 0               0
4                 1                 0               0
5                 0                 0               1
6                 0                 1               0
7                 0                 1               0
8                 0                 0               1
9                 0                 0               1
10                0                 0               1

DataFrame¶

DataFrame は二次元配列（行列）に似たデータ構造である。DataFrame は Series と同様に、要素を位置番号と名前の両方で管理している。そのため、DataFrame から要素を取り出す際に、位置番号と名前の両方で取り出すことができる。ただし、要素を取り出すときは、Series 型のオブジェクトとは異なり、位置番号で要素を取り出すときは iloc メソッドを介して行い、名前で要素を取り出すときは loc メソッドを介して行う。

列抽出¶

rice データセットには、植物番号（PlantNo）、区画番号（Block）、根部乾燥重量（RootDryMass）、地上部乾燥重量（ShootDryMass）系統（variety）、処理（fert）などのデータが記述されている。このデータセットから variety 列だけを取り出して、variety 列にどんな要素が含まれているのかを調べてみる。

rice.head()

rice_subset = rice.loc[:, 'variety']
rice_subset.head()

0    wt
1    wt
2    wt
3    wt
4    wt
Name: variety, dtype: object

variety 列に含まれているユニークな値を出力したい場合は、unique メソッドを使用する。

rice_subset.unique()

array(['wt', 'ANU843'], dtype=object)

次に、fert 列にはどんな要素が、いくつ含まれているのかを value_counts メソッドで調べてみる。

rice_subset = rice.loc[:, 'fert']
rice_subset.value_counts()

NH4Cl     24
F10       24
NH4NO3    24
Name: fert, dtype: int64

データフレームから複数の要素を抽出するとき、抽出対象となる列の番号あるいは名前をリストとして与えて、取得する。次は、rice データセットから複数の列（variety、fert、RootDryMass、ShootDryMass）を抽出例となっている。

rice_subset = rice.loc[:, ['variety', 'fert', 'RootDryMass', 'ShootDryMass']]
rice_subset.head()

Boolean 型からなるリスト・配列で取得することもできる。

k = [False, False, True, True, False, True, True]
rice.loc[:, k].head()
# ricce.iloc[:, k].head()

行抽出¶

行の抽出も、列の抽出と同様な手順で行う。次は、データフレームの最初の 2 行だけを抽出するときの例である。

rice.loc[0:2, :]

ある列の値に閾値を設けて条件設定を行い、その条件を満たした行だけを抽出することもできる。例えば、fert 列の値が F10 である行を抽出したい場合は、次のように行う。

is_F10 = (rice.loc[:, 'fert'] == 'F10')

rice_subset = rice.loc[is_F10, :]
rice_subset

複数条件を組み合わせて使うこともできる。次は、wt 系統の F10 処理のデータのみを抽出する例である。

is_wt = (rice.loc[:, 'variety'] == 'wt')
is_F10 = (rice.loc[:, 'fert'] == 'F10')

rice_subset = rice.loc[(is_wt & is_F10), :]
rice_subset

wt 系統で RootDryMass が 50 よりも大きく、かつ ShootDryMass が 120 よりも大きい個体を調べてみる。

subset_idx = (rice.loc[:, 'variety'] == 'wt') & (rice.loc[:, 'RootDryMass'] > 50) & (rice.loc[:, 'ShootDryMass'] > 120)
rice_subset = rice.loc[subset_idx, :]
rice_subset

重複要素¶

Series 型のデータで使用した unique、value_counts、duplicated、drop_duplicates などのメソッドは、DataFrame 型のデータに対しても適用できる。

rice.drop_duplicates('variety')

rice.drop_duplicates(['variety', 'fert'])

グループ演算¶

Pandas では groupby および apply メソッドを使用することで、データセットをいくつかのサブセットに分けて集計を行うグループ演算ができる。Pandas で定義されているメソッド（mean、max などの常用関数）を使用して集計を行いたい場合は groupby メソッドを使用する。また、自分で定義した関数を使用して集計したい場合は apply メソッドを使用する。

次の例は、rice データフレームを variety の値に応じてグループ分けを行い、それぞれのグループの各列に対して平均を求める。

rice_ave = rice.groupby('variety').mean()
rice_ave

RootDryMass および ShootDryMass のみに対して集計を行うとき、該当する列を抽出してから集計を行う。ただし、この場合、グループ分けを行うための情報（variety 列）も合わせて抽出する必要がある。

rice_ave =  rice.loc[:, ['variety', 'RootDryMass', 'ShootDryMass']].\
                 groupby('variety').\
                 mean()

rice_ave

複数の列の値に基づいて、その組み合わせでグループ分けすることもできる。例えば、系統（variety）と処理（fert）の組み合わせでグループ分けを行い、次に各グループに対して、RootDryMass および ShootDryMass の平均値を計算する場合は、次のようにする。

rice_ave = rice.loc[:, ['variety', 'fert', 'RootDryMass', 'ShootDryMass']].\
                groupby(['variety', 'fert']).\
                mean()

rice_ave

groupby メソッドの後に集計処理用のメソッドを付けない場合は、グループ分け後の各サブセットがイテレーターとして処理できるようになる。例えば、各グループの、RootDryMass および ShootDryMass 列に対して、最大値と最小値の差を求めたい場合は、イテレーターを次のように使用する。

df  = []

for gname, gdata in rice.loc[:, ['variety', 'RootDryMass', 'ShootDryMass']].groupby(['variety']):

  # get maximum values of RootDryMass and ShootDryMass
  subset_max =  gdata.iloc[:, 1:3].max()
  subset_min =  gdata.iloc[:, 1:3].min()

  # calculate the range
  subset_range =  subset_max - subset_min
  
  # set up group name to results
  subset_range.name = gname
  
  # append the range into `df` 
  df.append(subset_range)


df = pd.DataFrame(df)
df

group_by メソッドのイテレーター処理は、apply メソッドを使って書き換えることもできる。この場合、for 文ブロック内部の処理を関数として定義するしておく必要がある。

def calc_range(x):
  # note that x should be a Series or a DataFrame  of a subset
  
  # calculate MAX for each column
  x_max = x.max()
  # calculate MIN for each column
  x_min = x.min()
  # calculate range for each column
  x_range = x_max - x_min
  
  return x_range
  

  
rice_ave = rice.loc[:, ['variety', 'fert', 'RootDryMass', 'ShootDryMass']].\
                groupby(['variety', 'fert']).\
                apply(calc_range)

rice_ave

表データの結合¶

Pandas では 2 つのデータフレームを結合するときに、concat メソッドをおよび join メソッドを使われる。concat メソッドは 2 つのデータフレームを縦に、または横にそのまま結合させる機能を持つ。join メソッドは 2 つのデータフレームのある基準列に基づい結合したいときに利用する。

concat¶

データフレームに列名あるいは行名が付いている場合は、concat メソッドを使用してデータフレームを結合すると、データフレームの列名あるいは行名が結合する際の位置判断に利用される。

df1 = pd.DataFrame(np.random.randn(3, 3))
df1.columns = ['F1', 'F2', 'F3']
df1.index = ['A',  'B', 'X']
df1

df2 = pd.DataFrame(np.random.randn(4, 3))
df2.columns = ['F1', 'F2', 'F4']
df2.index = ['A',  'B',  'C', 'Y']
df2

1 次元方向にデータフレームを結合したい場合は axis=0 と指定する。

df = pd.concat([df1, df2], axis=0, sort=False)
df

2 次元方向にデータフレームを結合したい場合は axis=1 と指定する。

df = pd.concat([df1, df2], axis=1,  sort=False)
df

concat メソッドを使用するとき、列名あるいは行名をリセットすることで、列名と行名の影響をなくすことができる。行名を無視して結合する時は、ignore_index オプションが要されているが、このオプションが期待通りに動作しない場合がある。

# same to `pd.concat([df1, df2], axis=0,  sort=False)`
df = pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
df

# same to `pd.concat([df1, df2], axis=1,  sort=False)`
df = pd.concat([df1, df2], axis=1, sort=False, ignore_index=True)
df

確実に行名の影響を無くしたい場合は、concat メソッドのオプションに頼らずに、自分でリセットをかける必要がある。まず、上と同様に、リセットかけずに、結合してみる。

df1 = pd.DataFrame(np.random.randn(3, 3))
df1.columns = ['F1', 'F2', 'F3']
df1.index = ['A',  'B', 'X']
df2 = pd.DataFrame(np.random.randn(4, 3))
df2.columns = ['F1', 'F2', 'F4']
df2.index = ['A',  'B',  'C', 'Y']

df = pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
df

df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)

df = pd.concat([df1, df2], axis=1, sort=False, ignore_index=True)
df

列名の無視して結合するときのオプションは用意されていないので、列名を自分でリセットする必要がある。

df1 = pd.DataFrame(np.random.randn(3, 3))
df1.columns = ['F1', 'F2', 'F3']
df1.index = ['A',  'B', 'X']
df2 = pd.DataFrame(np.random.randn(4, 3))
df2.columns = ['F1', 'F2', 'F4']
df2.index = ['A',  'B',  'C', 'Y']

df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
df = pd.concat([df1, df2], axis=0, sort=False)
df

df1 = pd.DataFrame(np.random.randn(3, 3))
df1.columns = ['F1', 'F2', 'F3']
df1.index = ['A',  'B', 'X']
df2 = pd.DataFrame(np.random.randn(4, 3))
df2.columns = ['F1', 'F2', 'F4']
df2.index = ['A',  'B',  'C', 'Y']

df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
df = pd.concat([df1, df2], axis=1, sort=False)
df

行名と列名の両方をリセットすると、2 つのデータフレームがそのまま結合される。

df1 = pd.DataFrame(np.random.randn(3, 3))
df1.columns = ['F1', 'F2', 'F3']
df1.index = ['A',  'B', 'X']
df2 = pd.DataFrame(np.random.randn(4, 3))
df2.columns = ['F1', 'F2', 'F4']
df2.index = ['A',  'B',  'C', 'Y']

df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)

df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))

df = pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
df

df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
df = pd.concat([df1, df2], axis=1, sort=False, ignore_index=True)
df

merge¶

merge メソッドの場合は、2 つのデータフレームを結合する時、両方のデータフレームに存在する同じ名前の列の値を判断して、結合される。行名（および列名）の影響がない。

df1 = pd.concat([pd.Series(['P1', 'P2', 'P3']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df1.columns = ['PlantID', 'root', 'shoot']
df1.index = ['A',  'B',  'C']
df1

df2 = pd.concat([pd.Series(['P1', 'P4', 'P2']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df2.columns = ['PlantID', 'petal', 'sepal']
df2.index = ['X',  'Y',  'Z']
df2

2 つのデータフレームを結合する時に、結合方法（how）には inner、outer、left、および right の 4 つの方法が定義されている。何も指定しないときは、inner として処理される。

# df = pd.merge(df1, df2, on = 'PlantID')
df = pd.merge(df1, df2, on = 'PlantID', how='inner')
df

df = pd.merge(df1, df2, on = 'PlantID', how='outer')
df

df = pd.merge(df1, df2, on = 'PlantID', how='left')
df

df = pd.merge(df1, df2, on = 'PlantID', how='right')
df

二つのデータフレーム間で基準となる列の名前が異なる時は on オプションの代わりに left_on および right_on オプションを使用する。

df1 = pd.concat([pd.Series(['P1', 'P2', 'P3']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df1.columns = ['plant_id', 'root', 'shoot']
df1.index = ['A',  'B',  'C']

df2 = pd.concat([pd.Series(['P1', 'P4', 'P2']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df2.columns = ['PlantID', 'petal', 'sepal']
df2.index = ['X',  'Y',  'Z']


df = pd.merge(df1, df2, left_on = 'plant_id', right_on = 'PlantID', how='outer')
df

同じ列名が存在する時、左側のデータフレームの列名が _x が、右側のデータフレームの列名が _y がつけられる。

df1 = pd.concat([pd.Series(['P1', 'P2', 'P3']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df1.columns = ['PlantID', 'root', 'shoot']

df2 = pd.concat([pd.Series(['P1', 'P4', 'P2']),
                 pd.DataFrame(np.random.randn(3, 2))],  axis = 1)
df2.columns = ['PlantID', 'flower', 'shoot']

df = pd.merge(df1, df2, on = 'PlantID', how='outer')
df

suffix オプションを使用して、列名が重複する際に、重複した名前の後ろに付けられるサフィックスを変更することができる。

df = pd.merge(df1, df2, on = 'PlantID', how='outer', suffixes = ['_left', '_right'])
df

大きさの異なるデータフレームも SQL などのデータベースと同様に join 作業を行うことができる。

df1 = pd.concat([pd.Series(['S1', 'S2', 'S3', 'S1', 'S3']),
                 pd.DataFrame(np.random.randn(5, 2))],  axis = 1)
df1.columns = ['strain', 'root', 'shoot']
df1

df2 = pd.concat([pd.Series(['S1', 'S4', 'S2']),
                 pd.Series([12, 43, 25])],  axis = 1)
df2.columns = ['strain', 'lifespan']
df2

df = pd.merge(df1, df2, on = 'strain', how='outer')
df

df = pd.merge(df1, df2, on = 'strain') # how='inner'
df

表データの形式変換¶

表データには様々な種類の表データが存在する。例えば、次のような 2 種類のデータ構造がある。1 つ目のタイプのデータ構造は、1 行が 1 セットのデータを表している。このとき、各列にはそれぞれ異なる属性を持ち、隣り合う列同士の単位が異なっている。2 つ目のタイプのデータ構造は、行列型になっていて、各行と各列がデータの属性を持ち、隣り合う列および隣り合う行同士の単位がすべて同じである。

orange.head()

pivot¶

データフレームと行列間の形式変換は pivot メソッドを使用する。上での例では、orange のオリジナルデータセットの Tree 列を列名、age 列を行名となるような行列を作るときは、columns および index オプションそれぞれに列名と行名を指定し、値 value を circumference に指定する。

df = orange.pivot(index='age', columns='Tree', values='circumference').head()
df

melt¶

行列形式のデータをデータフレーム形式に変換するときは melt メソッドを使用する。

iris.head()

iris.melt(id_vars=['Species']).head()

iris.melt(id_vars=['Species'], var_name='Type').head()

データ整理¶

pivot および melt メソッドの他に、stack および unstack メソッドも用意されている。stack および unstack メソッドは、Series あるいは DataFrame が多重行名が付けられた時に、それらの行名を結合させるか、分離させるかを調整するメソッドである。

pivot メソッドで変換した表データを元の形に戻すとき、unstack でデータフレームを崩してから、インデックスをリセットする。

orange.head()

df = orange.pivot(index='age', columns='Tree', values='circumference').head()
df

df.unstack('age').reset_index().head()

melt、pivot、unstack、stack メソッドを組み合わせることで、グループ集計を行った結果を整形することができる。これらのメソッドの組み合わせによる操作には慣れが必要。

iris.melt(id_vars=['Species'], var_name='Type').\
     groupby(['Species', 'Type']).\
     mean()

iris.melt(id_vars=['Species'], var_name='Type').\
     groupby(['Species', 'Type']).\
     mean().\
     unstack('Type')

iris.melt(id_vars=['Species'], var_name='Type').\
     groupby(['Species', 'Type']).\
     mean().\
     unstack('Species')

iris.melt(id_vars=['Species'], var_name='Type').\
     groupby(['Species', 'Type']).\
     mean().\
     unstack('Species').\
     shape

(4, 3)

欠損値除去¶

ある値が欠損値かどうかを調べるときに isna メソッドあるいは notna メソッドを使用する。欠損値を調べて、その結果を利用して行または列を抽出すれば、欠損値を含まない行または列を抽出できるようになる。

msleep.head()

msleep.isna().head()

msleep.notna().head()

brainWt 列が欠損値になっていない行を抽出するときは、次のようにする。

msleep.loc[msleep.loc[:, 'brainwt'].notna(), :].head()

欠損地を含む行あるいは列をすべて取り除くときに dropna メソッドを使用する。

msleep.dropna(axis=0).head()

msleep.dropna(axis=1).head()

Pandas 視覚化¶

Python でグラフを描くときは、matplotlib と seaborn パッケージがよく使われる。これらのパッケージを使うと、複雑なグラフが描けたり、細かい調整ができる。しかし、その反面、操作方法がやや難しい。matplotilb と seaborn のほかに、Pandas パッケージでも簡単な視覚化機能が備えられている。細かい調整ができないものの、Series あるいは DataFrame 型のオブジェクトに plot や hist などのメソッドをつけるだけで、グラフが描かれるようになる。データ解析時に、ちょっとデータの分布を確認したいといったときに、非常に便利である。

次のセル内のコードは基本的に実行しなくてもグラフを描くことができる。これらのコードを実行することで、グラフがきれいになる。

sns.set()
sns.set_style("whitegrid")
sns.set_palette('Set1')

散布図¶

fig, ax = plt.subplots()
rice.plot.scatter(x='RootDryMass', y='ShootDryMass', ax=ax)
fig.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

cols = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]

fig, ax = plt.subplots()

i = 0
for group_name, group_subset  in rice.groupby('variety'):
  ax = group_subset.plot.scatter(x='RootDryMass', y='ShootDryMass', ax=ax, color=cols[i])
  i += 1
  
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

棒グラフ¶

rice_ave = (
    rice.loc[:, ['variety', 'RootDryMass', 'ShootDryMass']]
        .groupby('variety')
        .mean()
)

fig, ax = plt.subplots()
rice_ave.plot.bar(ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
rice_ave.plot.bar(stacked=True, ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

rice_ave = (
    rice.loc[:, ['variety', 'fert', 'RootDryMass', 'ShootDryMass']]
        .groupby(['variety', 'fert'])
        .mean()
)

fig, ax = plt.subplots()
rice_ave.plot.bar(ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
rice_ave.plot.bar(stacked=True, ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

ボックスプロット¶

fig, ax = plt.subplots()
rice.plot.box(y = ['RootDryMass', 'ShootDryMass'], ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
rice.loc[:, ['RootDryMass', 'ShootDryMass']].boxplot(ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
rice.loc[:, ['variety', 'RootDryMass', 'ShootDryMass']].\
     groupby('variety').\
     boxplot(ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3296: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  exec(code_obj, self.user_global_ns, self.user_ns)
/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

#fig, ax = plt.subplots()
#rice.loc[:, ['variety', 'fert', 'RootDryMass', 'ShootDryMass']].\
#     groupby(['variety', 'fert']).\
#     boxplot(ax=ax)
#
#fig.show()

線グラフ¶

線グラフは時系列を描くときに便利なので、ここでは時系列データを使ってグラフを描く。

economics.head()

fig, ax = plt.subplots()
ax = economics.plot(x='date', y='pce', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
ax = economics.plot(x='date', y=['psavert', 'uempmed'], ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
ax = economics.plot(x='date', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
ax = economics.plot(x='date', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yscale('log')
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

ヒストグラム¶

fig, ax = plt.subplots()
rice.plot.hist(y = 'RootDryMass',  ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
## rice.plot.hist(y = ['RootDryMass', 'ShootDryMass'], stacked=False)
rice.plot.hist(y = ['RootDryMass', 'ShootDryMass'], alpha = 0.5, ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

fig, ax = plt.subplots()
rice.plot.hist(y = ['RootDryMass', 'ShootDryMass'], stacked=True, ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

ペアプロット¶

iris.head()

plt.rcParams['figure.figsize'] = (10.0, 10.0)
fig, ax = plt.subplots()
pd.plotting.scatter_matrix(iris, ax=ax)
fig.show()

/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3296: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  exec(code_obj, self.user_global_ns, self.user_ns)
/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

表データハイライト¶

DataFrame メソッドを使用すると、データフレームのデータを画面上に表示させる時に、ハイライトさせることができる。ハイライトを行うために、style.apply と style.applymap の 2 つのメソッドが用意されている。style.apply は各列または各行に対してハイライト処理を行うメソッドである。例えば、各列の最大値を赤くするなどがこれにあたる。これに対して、style.applymap はデータフレーム全体に対するハイライト処理であり、例えばデータフレーム中のマイナス値を赤くするなどがこれにあたる。

np.random.seed(2019)

df = pd.DataFrame(np.random.randn(4, 6))
df.columns = ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']
df.index = ['A',  'B', 'C', 'D']
df

def highlight_max(val):
  val_max = val.max()
  
  highlight_tags = []
  
  for v in val:
    if v == val_max:
      highlight_tags.append('background-color: orange')
    else:
      highlight_tags.append('')
  
  return highlight_tags
  

df.style.apply(highlight_max)

def highlight_max(val):
  val_max = val.max()
  
  highlight_tags = []
  
  for v in val:
    if v == val_max:
      highlight_tags.append('background-color: orange')
    else:
      highlight_tags.append('')
  
  return highlight_tags
  

df.style.apply(highlight_max, axis=1)

def highlight_negatives(val):
  
  highlight_tags = ''
  
  if val < 0:
    highlight_tags = 'color: #ff0000;'  
  
  return highlight_tags
  

df.style.applymap(highlight_negatives)

df.style.apply(highlight_max).applymap(highlight_negatives)

df.style.apply(highlight_max, subset=['F1', 'F2', 'F3']).\
         applymap(highlight_negatives, subset=['F4', 'F5', 'F6'])

df.style.applymap(highlight_negatives, subset=pd.IndexSlice[['A', 'C'], ['F1', 'F3', 'F5']])

画面上に表示されるデータの桁数は style.format メソッドで調整できる。

df.style.format('{:.2f}')

df.style.format('{:.2f}').format({'F1': '{:.5f}'})

df.iloc[0, 5] = np.nan
df.iloc[2, 2] = np.nan
df.style.highlight_null(null_color='darkgray')

#import seaborn as sns
df.style.background_gradient(cmap='Blues').\
         highlight_null(null_color='darkgray')

/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py:512: RuntimeWarning: invalid value encountered in less
  xa[xa < 0] = -1

df.style.bar(color=['lightseagreen', 'sandybrown'], align='mid').highlight_null(null_color='darkgray')

フレームワークを調整した後に、to_excel メソッドを使用して Excel の形式で保存することもできる。

import openpyxl

df.style.background_gradient(cmap='Blues').\
         highlight_null(null_color='#cccccc').\
         to_excel('output.xlsx')

	date	pce	pop	psavert	uempmed	unemploy
0	1967-07-01	507.4	198712	12.5	4.5	2944
1	1967-08-01	510.5	198911	12.5	4.7	2945
2	1967-09-01	516.3	199113	11.7	4.6	2958
3	1967-10-01	512.9	199311	12.5	4.9	3143
4	1967-11-01	518.1	199498	12.5	4.7	3066

	name	genus	vore	order	conservation	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NaN	NaN	11.9	NaN	50.000
1	Owl monkey	Aotus	omni	Primates	NaN	17.0	1.8	NaN	7.0	0.01550	0.480
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NaN	9.6	NaN	1.350
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4.0	0.7	0.666667	20.0	0.42300	600.000

	Group	Contour	Depth	Gp	Block	pH	N	Dens	P	Ca	Mg	K	Na	Conduc
0	1	Top	0-10	T0	1	5.40	0.188	0.92	215	16.35	7.65	0.72	1.14	1.09
1	1	Top	0-10	T0	2	5.65	0.165	1.04	208	12.25	5.15	0.71	0.94	1.35
2	1	Top	0-10	T0	3	5.14	0.260	0.95	300	13.02	5.68	0.68	0.60	1.41
3	1	Top	0-10	T0	4	5.14	0.169	1.10	248	11.92	7.88	1.09	1.01	1.64
4	2	Top	10-30	T1	1	5.14	0.164	1.12	174	14.17	8.12	0.70	2.17	1.85

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	Tree	age	circumference
0	1	118	30
1	1	484	58
2	1	664	87
3	1	1004	115
4	1	1231	120

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt
5	6	1	66	125	F10	F10	wt
6	7	2	41	98	F10	F10	wt
7	8	2	67	122	F10	F10	wt
8	9	2	40	114	F10	F10	wt
9	10	2	35	82	F10	F10	wt
10	11	2	44	37	F10	F10	wt
11	12	2	41	109	F10	F10	wt
36	1	1	6	8	F10 +ANU843	F10	ANU843
37	2	1	4	6	F10 +ANU843	F10	ANU843
38	3	1	4	3	F10 +ANU843	F10	ANU843
39	4	1	7	1	F10 +ANU843	F10	ANU843
40	5	1	5	7	F10 +ANU843	F10	ANU843
41	6	1	6	5	F10 +ANU843	F10	ANU843
42	7	2	6	10	F10 +ANU843	F10	ANU843
43	8	2	5	17	F10 +ANU843	F10	ANU843
44	9	2	7	3	F10 +ANU843	F10	ANU843
45	10	2	3	5	F10 +ANU843	F10	ANU843
46	11	2	12	15	F10 +ANU843	F10	ANU843
47	12	2	7	8	F10 +ANU843	F10	ANU843

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt
5	6	1	66	125	F10	F10	wt
6	7	2	41	98	F10	F10	wt
7	8	2	67	122	F10	F10	wt
8	9	2	40	114	F10	F10	wt
9	10	2	35	82	F10	F10	wt
10	11	2	44	37	F10	F10	wt
11	12	2	41	109	F10	F10	wt

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
12	1	1	12	45	NH4Cl	NH4Cl	wt
24	1	1	12	71	NH4NO3	NH4NO3	wt
36	1	1	6	8	F10 +ANU843	F10	ANU843
48	1	1	4	22	NH4Cl +ANU843	NH4Cl	ANU843
60	1	1	19	75	NH4NO3 +ANU843	NH4NO3	ANU843

	PlantNo	Block	RootDryMass	ShootDryMass
variety
ANU843	6.5	1.5	9.666667	41.805556
wt	6.5	1.5	26.472222	77.305556

		RootDryMass	ShootDryMass
variety	fert
ANU843	F10	6.000000	7.333333
	NH4Cl	9.166667	46.583333
	NH4NO3	13.833333	71.500000
wt	F10	49.500000	108.333333
	NH4Cl	12.583333	50.250000
	NH4NO3	17.333333	73.333333

	F1	F2	F3
A	-0.764024	0.306628	0.776008
B	-0.344668	-0.219920	0.564846
X	0.522977	1.337932	0.247975

	F1	F2	F4
A	0.583184	0.226492	-0.764679
B	-0.357183	-1.676316	-1.389232
C	0.439857	-0.320169	-0.637880
Y	0.730783	-0.528459	-0.103577

	F1	F2	F3	F4
0	1.816683	-0.554099	-0.666154	NaN
1	-0.372624	1.366793	0.261672	NaN
2	-1.272202	0.034904	-0.893761	NaN
3	-0.332831	-0.598696	NaN	0.365123
4	0.875792	1.053824	NaN	1.494547
5	0.853575	0.034900	NaN	1.156063
6	-1.383452	-1.800315	NaN	-0.235855

color	green	red	yellow
fruit
apple	1	3	0
banana	1	0	1
cherry	0	1	0

	0	1	2
A	2.825083	0.800283	-0.462060
B	0.735840	0.146831	0.705392
X	0.046357	-0.915951	0.224711
A	0.756206	0.761853	0.419401
B	-0.225666	0.255290	-0.704153
C	-0.465020	-0.143270	0.052430
Y	1.266917	-1.913585	0.535715

	0	1	2	0	1	2
A	0.333400	-0.015632	0.174602	1.340767	-0.667910	-0.014231
B	-0.287431	0.033686	-0.775078	-0.702372	0.594559	-0.285923
X	0.707970	0.505455	-0.360213	NaN	NaN	NaN
C	NaN	NaN	NaN	0.120742	0.473677	-2.148731
Y	NaN	NaN	NaN	-1.297087	0.815260	-0.502716

	0	1	2
0	-0.533029	0.134587	0.980444
1	-1.166124	0.522789	1.661420
2	-0.065871	-0.182839	-0.207391
3	0.118123	1.626853	-0.266506
4	-0.522242	0.066521	0.086518
5	2.005034	0.617574	0.171313
6	-0.472538	1.041956	-0.542251

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt

	PlantNo	Block	RootDryMass	ShootDryMass	trt	fert	variety
0	1	1	56	132	F10	F10	wt
1	2	1	66	120	F10	F10	wt
2	3	1	40	108	F10	F10	wt
3	4	1	43	134	F10	F10	wt
4	5	1	55	119	F10	F10	wt
5	6	1	66	125	F10	F10	wt
6	7	2	41	98	F10	F10	wt
7	8	2	67	122	F10	F10	wt
8	9	2	40	114	F10	F10	wt
9	10	2	35	82	F10	F10	wt
10	11	2	44	37	F10	F10	wt
11	12	2	41	109	F10	F10	wt

	PlantID	root	shoot
A	P1	1.525004	-0.378940
B	P2	-0.424634	1.227215
C	P3	0.521289	-1.068076

	PlantID	petal	sepal
X	P1	-0.688158	-2.188783
Y	P4	-0.117716	-0.588993
Z	P2	1.742187	0.560399

	plant_id	root	shoot	PlantID	petal	sepal
0	P1	-0.151409	-0.958494	P1	0.243587	-0.439967
1	P2	0.795065	-0.057825	P2	0.462670	2.093872
2	P3	0.908842	-0.912545	NaN	NaN	NaN
3	NaN	NaN	NaN	P4	1.306762	-0.571304

	PlantID	root	shoot_x	flower	shoot_y
0	P1	-0.944556	0.099846	0.842026	-3.792710
1	P2	-0.923298	0.724413	1.361027	-0.460408
2	P3	-1.927912	0.257420	NaN	NaN
3	P4	NaN	NaN	-2.093751	1.011422

	strain	root	shoot
0	S1	0.673465	-1.251846
1	S2	-0.097623	-1.879607
2	S3	1.109511	-0.078520
3	S1	-0.353518	-1.515448
4	S3	0.690651	2.782739

	strain	root	shoot	lifespan
0	S1	0.673465	-1.251846	12.0
1	S1	-0.353518	-1.515448	12.0
2	S2	-0.097623	-1.879607	25.0
3	S3	1.109511	-0.078520	NaN
4	S3	0.690651	2.782739	NaN
5	S4	NaN	NaN	43.0

		value
Species	Type
setosa	Petal.Length	1.462
	Petal.Width	0.246
	Sepal.Length	5.006
	Sepal.Width	3.428
versicolor	Petal.Length	4.260
	Petal.Width	1.326
	Sepal.Length	5.936
	Sepal.Width	2.770
virginica	Petal.Length	5.552
	Petal.Width	2.026
	Sepal.Length	6.588
	Sepal.Width	2.974