ggplot2 機能紹介

ggplot2 はグラフを描くためのパッケージである。グラフの仕上がりが総じて R の標準関数に比べて綺麗であるのが特徴的である。また、グラフの書き方について、x 軸および y 軸を指定するだけで簡単に描けるようになっている。

ggplot2 インストールと概要

ggplot2 インストール

ggplot2 パッケージは単体でもインストールでき、tidyverse パッケージのインストールを通してインストールすることもできる。tidyverse パッケージを通してインストールした方が、dplyr、tidyr や tibble などのパッケージも合わせてインストールされるので便利。次の例では、tidyverse をインストールする例を示している。

install.packages('tidyverse')

ggplot2 デフォルトでもきれいなグラフを描くことができるが、ggplot2 の拡張機能の拡張機能を使うことで、さらにきれいなグラフを作成できる。グラフ全体のスタイル構成（テーマ）を提供しているパッケージとして ggthemes がよく知られている。また、グラフを描くときの点や線の色パターン（カラーパレット）を提供しているパッケージとして RColorBrewer や ggsci などがある。

install.packages('ggthemes')
install.packages('RColorBrewer')
install.packages('ggsci')

ggplot2 によるグラフ作成

ggplot2 でグラフを描くとき、基本的に次のような手順を踏む。

ggplot2 で描きたいグラフを、手書きで紙に書いてみる。
グラフの x 軸と y 軸のデータを含むデータフレームを作成する。必要であれば、色の使い分けに関する情報もそのデータフレームに入れる。
ggplot2 の関数でグラフを描く。
- ggplot 関数で描画レイヤーを用意する。この際に、aes オプションで x 軸、y 軸、色情報などを指定する。
- geom_point、geom_line などの関数を使用してグラフを描く。
- xlab、ylab、xlim、ylim、theme などの関数でグラフの軸座標やカラースタイルを調整する。

ggplot2 の基本的な関数の概略は ggplot2 cheatsheet で確認できる。

ggplot2 の呼び出し

ggplot2 パッケージを呼び出して使う。ggplot2 パッケージ単独で呼び出しても、tidyverse パッケージを呼び出しても、どちらでも使えるようになる。ここで、tidyverse パッケージの他に、拡張テーマや拡張カラーパレットも合わせて呼び出しておく。

# library(ggplot2)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggthemes)
library(ggsci)

geom_point 関数

geom_point 関数は散布図を描く関数である。ここでは、rice.txt データセットを使ってグラフを描く例を示す。rice データセットは 7 列を持ち、それぞれの列が標本の個体番号（replicate）、ブロック番号（block）、根部乾燥重量（root_dry_mass）、地上部乾燥重量（shoot_dry_mass）、系統処理（trt）、処理（fert）、系統（variety）からなる。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
head(d)

##   replicate block root_dry_mass shoot_dry_mass trt fert variety
## 1         1     1            56            132 F10  F10      wt
## 2         2     1            66            120 F10  F10      wt
## 3         3     1            40            108 F10  F10      wt
## 4         4     1            43            134 F10  F10      wt
## 5         5     1            55            119 F10  F10      wt
## 6         6     1            66            125 F10  F10      wt

簡単な散布図として、各個体の根部乾燥重量（root_dry_mass）を x 軸とし、地上部乾燥重量（shoot_dry_mass）を y 軸として描く例を示す。座標の情報は ggplot 関数の中の aes 関数で指定する。aes 関数の中で x 軸と y 軸にしたい列の列名を指定する。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass)) +
      geom_point()
print(g)

上で書いた散布図では、wt 系統も ANU843 系統の 2 つの系統が含まれている、両方とも黒の点として描かれている。この散布図に対して、系統ごとに点の色を塗り分けたい場合は、aes 関数のcolor オプションで、系統を表す列名（variety）を指定すればいい。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety)) +
      geom_point()
print(g)

この rice データでは、各系統（variety）に対して 3 つの処理（fert）が行われている。具体的に言えば、wt 系統には F10 処理、NH4Cl 処理、NH4NO3 処理がある。また、ANU843 系統にも同様に F10 処理、NH4Cl 処理、NH4NO3 処理がある。系統と処理の両方で点の色を塗り分けたい場合は、color オプションで variety:fert を指定する。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point()
print(g)

系統別に色で分けて、処理別に点の形で分けることもできる。点の形を指定するには aes 関数の shape オプションで指定する。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety, shape = fert)) +
      geom_point()
print(g)

散布図の点について、点の色と形を変えられる他、点の大きさも変更することができる。ここで trees のデータセットを使って、点の大きさをある量に応じて変更してプロットする方法を示す。

data(trees)
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

ここで、周長（Girth）を x 軸とし、高さ（Height）を y 軸として散布図を描く。この際、点の大きさを容積（Volume）となるように描きたい場合は、次のように aes 関数の size オプションを指定する。

g <- ggplot(trees, aes(x = Girth, y = Height, size= Volume)) +
        geom_point()
print(g)

さらに、連続量を aes 関数の color オプションに指定すると、色は数値の大きさに応じてグラデーションとして描かれる。

g <- ggplot(trees, aes(x = Girth, y = Height, size = Volume, color = Volume)) +
        geom_point()
print(g)

ggthemes & ggsci

色の塗り分けやグラフ全体のスタイルを明示的に指定しない場合は、ggplot のデフォルトのスタイルシートおよびカラーパレットを使う。このとき、ggplot で描かれるグラフ全体の構成（テーマ）は、座標軸や座標の名前は黒に近い灰色で書き出され、グラフの背景は灰色となる。また、色については、蛍光色に似た色の組み合わせ（カラーパレット）が使われる。

ggplot のテーマ

ggplot にはいくつかのテーマが標準で組み込まれている。例えば、白背景のテーマに変更したい場合は theme_bw 関数を使用する。theme_bw のほかに、theme_classic や theme_light、theme_dark などのテーマが用意されている。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point() +
      theme_bw()
print(g)

ggthemes のテーマ

ggthemes パッケージには多様なテーマが提供されている。例えば、経済雑誌風のテーマ、エクセル風のテーマなどのようなものがある。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point() +
      theme_wsj()
print(g)

ggsci

サイエンスジャーナルに載せるためのグラフなどは、ggsci で提供されているカラーパレットを使うとグラフがきれいになる。例えば、ggsci パッケージ中の scale_color_npg のテーマを使う場合は、次のようにする。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point() +
      theme_bw() +
      scale_color_npg()
print(g)

geom_line 関数

geom_line 関数は線グラフを描く関数である。線グラフは、時系列データを視覚化するのに適しているので、ここでは時系列データ Nile に対して線グラフを描いてみる。 Nile データセットは、1871 年から 1970 年までのナイル川の流量を記録したデータである。

data(Nile)
d <- data.frame(Nile = as.integer(Nile), year = as.integer(time(Nile)))
head(d)

##   Nile year
## 1 1120 1871
## 2 1160 1872
## 3  963 1873
## 4 1210 1874
## 5 1160 1875
## 6 1160 1876

g <- ggplot(d, aes(x = year, y = Nile)) +
      geom_line()
print(g)

次に、複数の株価指数の 1991 年から 1998 年までの変化を記録したデータセットを用いて、複数の折れ線グラフを描いてみる。この際に、各株価指数ごとに色を塗り分けるものとする。

data(EuStockMarkets)
df <- data.frame(stock = as.matrix(EuStockMarkets), time = time(EuStockMarkets))
colnames(df) <- gsub('stock.', '', colnames(df))
head(df)

##       DAX    SMI    CAC   FTSE     time
## 1 1628.75 1678.1 1772.8 2443.6 1991.496
## 2 1613.63 1688.5 1750.5 2460.2 1991.500
## 3 1606.51 1678.6 1718.0 2448.2 1991.504
## 4 1621.04 1684.1 1708.1 2470.4 1991.508
## 5 1618.16 1686.6 1723.1 2484.7 1991.512
## 6 1610.61 1671.6 1714.3 2466.8 1991.515

EuStockMarkets のデータセットはそのままでは使えないので、dplyr/tidyr パッケージの機能を利用して、データの構造を変換しておく。

df <- df %>% gather(`DAX`, `SMI`, `CAC`, `FTSE`, key = 'stock', value = 'value')
head(df)

##       time stock   value
## 1 1991.496   DAX 1628.75
## 2 1991.500   DAX 1613.63
## 3 1991.504   DAX 1606.51
## 4 1991.508   DAX 1621.04
## 5 1991.512   DAX 1618.16
## 6 1991.515   DAX 1610.61

aes 関数に color オプションをつけて、株価指数ごとに色を塗りわけるようにする。

g <- ggplot(df, aes(x = time, y = value, color = stock)) +
        geom_line() +
        theme_bw() +
        scale_color_npg()
print(g)

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

aes 関数に linetype オプションをつけることで、線の形（実線、ダッシュ線、点線など）を指定することができる。

g <- ggplot(df, aes(x = time, y = value, color = stock, linetype = stock)) +
        geom_line() +
        theme_bw() +
        scale_color_npg()
print(g)

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

geome_bar 関数

geom_bar 関数は棒グラフを描くときに使用する関数である。x 軸と y 軸のデータを指定して、棒グラフを描くとき geom_bar 関数にオプションとして stat = 'identity' を指定する必要がある。geom_bar 関数のデフォルトでは、座標として 1 つだけを受け取り、その頻度（stat = 'count'）を棒グラフとして描くように実装されているので、x 軸と y 軸の両方を与えるとエラーが起こる。

ここで rice データを使って、wt 系統の root_dry_mass と shoot_dry_mass の平均値を棒グラフの高さとして、棒グラフを描く。まず、dplyr/tidyr パッケージの機能を利用して rice データに対して変形・集計を行なっていく。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
df <- d %>%
        filter(variety == 'wt') %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

## # A tibble: 2 x 3
##   variety tissue         dry_mass
##   <fct>   <chr>             <dbl>
## 1 wt      root_dry_mass      26.5
## 2 wt      shoot_dry_mass     77.3

集計後の結果を ggplot 関数に与えて、geom_bar 関数で棒グラフを描く。

g <- ggplot(df, aes(x = tissue, y = dry_mass)) +
        geom_bar(stat = 'identity') +
        theme_bw()
print(g)

次に、各系統（wt 系統と ANU843 系統）ごとに、root_dry_mass および shoot_dry_mass の平均値を棒グラフとして描く。このとき、棒グラフの色を組織（root または shoot）ごとに塗り分ける。

df <- d %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

## # A tibble: 4 x 3
##   variety tissue         dry_mass
##   <fct>   <chr>             <dbl>
## 1 ANU843  root_dry_mass      9.67
## 2 wt      root_dry_mass     26.5 
## 3 ANU843  shoot_dry_mass    41.8 
## 4 wt      shoot_dry_mass    77.3

色の塗りを指定したいとき、aes 関数の fill オプションを使用する。fill オプションの使い方は color オプションと同様に、データの列名を与える。

g <- ggplot(df, aes(x = variety, y = dry_mass, fill = tissue)) +
        geom_bar(stat = 'identity') +
        scale_fill_npg() +
        theme_bw()
print(g)

項目数（色の塗りわけ）があるとき、geom_bar 関数はデフォルトでは積み上げ棒グラフ（position = "stack"）を描くように実装されている。これを横並びに形式にしたい場合は、position の値を dodge に変更すればよい。

g <- ggplot(df, aes(x = variety, y = dry_mass, fill = tissue)) +
        geom_bar(stat = 'identity', position = "dodge") +
        scale_fill_npg() +
        theme_bw()
print(g)

また、position の値を fill にすると、積み上げグラフの合計が 1 となるような割合グラフが描かれる。

g <- ggplot(df, aes(x = variety, y = dry_mass, fill = tissue)) +
        geom_bar(stat = 'identity', position = "fill") +
        scale_fill_npg() +
        theme_bw()
print(g)

次に、もう少し複雑な棒グラフを描く方法を示す。横軸を系統と処理の組み合わせとしたときのグラフを描く。まず、グラフを描くときの、データフレームを作成する。

df <- d %>%
        group_by(variety, fert) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

## # A tibble: 6 x 4
## # Groups:   variety [2]
##   variety fert   tissue        dry_mass
##   <fct>   <fct>  <chr>            <dbl>
## 1 ANU843  F10    root_dry_mass     6   
## 2 ANU843  NH4Cl  root_dry_mass     9.17
## 3 ANU843  NH4NO3 root_dry_mass    13.8 
## 4 wt      F10    root_dry_mass    49.5 
## 5 wt      NH4Cl  root_dry_mass    12.6 
## 6 wt      NH4NO3 root_dry_mass    17.3

次に、tissue と fert の色を塗り分けるように指定する。このとき fill オプションに tissue と fert の組み合わせとなるように interaction(tissue, fert) と指定する。

g <- ggplot(df, aes(x = variety, y = dry_mass, fill = interaction(tissue, fert))) +
        geom_bar(stat = 'identity', position = "dodge") +
        scale_fill_npg() +
        theme_bw()
print(g)

geom_errorbar 関数

ggplot では、エラーバーを描く関数として geom_errorbar 関数が用意されている。この関数を geom_point 関数や geom_bar 関数をとともに用いることで、データの散らばり具合（標準偏差）あるいは母平均の推測値の散らばり具合（標準誤差）を表現できるようになる。ここで、データの平均値と標準偏差を用いてエラーバー付きの点グラフ、およびエラーバー付きの棒グラフを描いてみる。

まず、rice データセットの wt 系統に対して root_dry_mass および shoot_dry_mass の平均値および標準偏差を計算して、データを整形する。

df.mean <- d %>%
        filter(variety == 'wt') %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'mean')
df.sd <- d %>%
        filter(variety == 'wt') %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = sd(root_dry_mass),
                  shoot_dry_mass = sd(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'sd')

df <- left_join(df.mean, df.sd)

## Joining, by = c("variety", "tissue")

head(df)

## # A tibble: 2 x 4
##   variety tissue          mean    sd
##   <fct>   <chr>          <dbl> <dbl>
## 1 wt      root_dry_mass   26.5  18.4
## 2 wt      shoot_dry_mass  77.3  32.9

geom_errorbar 関数はエラーバーとして、エラーの最大値と最小値を指定する必要がある。そこで aes 関数の ymax および ymin オプションでエラーバーの最大値（平均＋標準偏差）および最小値（平均ー標準偏差）として指定する。その後、このエラーバーを既存の点グラフに重ねる。

g <- ggplot(df, aes(x = tissue, y = mean, ymin = mean - sd, ymax = mean + sd)) +
        geom_point(size = 3) +
        geom_errorbar(width = 0.2) +
        theme_bw()
print(g)

次の例では、エラーバーを棒グラフに重ねる例を表している。

g <- ggplot(df, aes(x = tissue, y = mean, ymin = mean - sd, ymax = mean + sd)) +
        geom_bar(stat = 'identity') +
        geom_errorbar(width = 0.5) +
        theme_bw()
print(g)

次に複数の属性を持つ点グラフと棒グラフにエラーバーを重ねあげる方法を紹介する。ここで、各系統（wt 系統と ANU843 系統）の各組織（root と shoot）の乾燥重量の平均値と標準偏差を使って、エラーバーを描く。

df.mean <- d %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'mean')
df.sd <- d %>%
        group_by(variety) %>% 
        summarise(root_dry_mass = sd(root_dry_mass),
                  shoot_dry_mass = sd(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'sd')

df <- left_join(df.mean, df.sd)

## Joining, by = c("variety", "tissue")

head(df)

## # A tibble: 4 x 4
##   variety tissue          mean    sd
##   <fct>   <chr>          <dbl> <dbl>
## 1 ANU843  root_dry_mass   9.67  5.13
## 2 wt      root_dry_mass  26.5  18.4 
## 3 ANU843  shoot_dry_mass 41.8  30.4 
## 4 wt      shoot_dry_mass 77.3  32.9

複数項目が存在するので、項目を分けるために color オプションに組織別となるように指定する。こうすることで、ggplot は自動的に項目別に色を塗り分ける。ただし、x 軸に指定した座標は、どの項目も同じであるため、グラフは重ねられた状態で描かれる。

g <- ggplot(df, aes(x = variety, y = mean, ymin = mean - sd, ymax = mean + sd, color = tissue)) +
        geom_point() +
        geom_errorbar() +
        theme_bw() +
        scale_fill_npg()
print(g)

そこで、root_dry_mass と shoot_dry_mass の 2 つの項目を重ねて描くのではなく、左右にずらして描くようにするために position = position_dodge() オプションを追加する。

g <- ggplot(df, aes(x = variety, y = mean, ymin = mean - sd, ymax = mean + sd, color = tissue)) +
        geom_point(position = position_dodge(0.5), size = 3) +
        geom_errorbar(aes(width = 0), position = position_dodge(0.5)) +
        theme_bw() +
        scale_fill_npg()
print(g)

棒グラフについても同様に描く。ただし、棒グラフの場合は position = position_dodge(0.9) の代わりに position = 'dodge' と指定してもよい。

g <- ggplot(df, aes(x = variety, y = mean, ymin = mean - sd, ymax = mean + sd,fill = tissue)) +
        geom_bar(stat = 'identity', position = position_dodge(0.9)) +
        geom_errorbar(aes(width = 0.3), position = position_dodge(0.9)) +
        theme_bw() +
        scale_fill_npg()
print(g)

データの平均値と標準偏差だけを描く場合、データの真の分布を表していない。そこで、平均値と標準偏差だけでなく、データ自身の値もグラフ上に重ねるといったことが行われている。ここで、データ自身の値を整形して新しいデータとして用意する。

df2 <- d %>%
        select(variety, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df2)

##   variety        tissue dry_mass
## 1      wt root_dry_mass       56
## 2      wt root_dry_mass       66
## 3      wt root_dry_mass       40
## 4      wt root_dry_mass       43
## 5      wt root_dry_mass       55
## 6      wt root_dry_mass       66

エラーバー付きの点グラフに、データの実際の値をプロットした点を重ねて描く。この際にデータセットが 2 つ存在するので、ggplot 関数にはデータセットを与えずに、geom_point などの関数で使うときに初めてデータセットを与えるようにする。

g <- ggplot() + theme_bw() + scale_color_npg()

# points
g <- g + geom_point(aes(x = variety, y = dry_mass, color = tissue),
                    position = position_dodge(0.8),
                    alpha = 0.3, size = 3,
                    data = df2)

# points (the average) with errorbars
g <-  g + geom_point(aes(x = variety, y = mean, color = tissue), position = position_dodge(0.8), size = 3, data = df) +
        geom_errorbar(aes(x = variety, group = tissue, ymin = mean - sd, ymax = mean + sd, width = 0),
                      position = position_dodge(0.8), data = df)

print(g)

このまま描くと、例えば ANU843 系統の root_dry_mass のデータの x 軸はすべて wt × root_dry_mass となっているので、x 軸が同じであるので、縦軸の座標も同じだと、すべての点が密集して描かれる。これでは、実際にどれぐらいのデータが分布しているのかがわかりづらいので、データの点を左右にずらす。そのためには position_dodge 関数ではなく、position_jitterdodge 関数を使う。

g <- ggplot() + theme_bw() + scale_color_npg()

# points
g <- g + geom_point(aes(x = variety, y = dry_mass, color = tissue),
                    position = position_jitterdodge(jitter.width = 0.2, dodge.width=0.8),
                    alpha = 0.3, size = 3,
                    data = df2)

# points (the average) with errorbars / use `fill = tissue` to plot mean as black point
g <-  g + geom_point(aes(x = variety, y = mean, fill = tissue), position = position_dodge(0.8), size = 3, data = df) +
        geom_errorbar(aes(x = variety, group = tissue, ymin = mean - sd, ymax = mean + sd, width = 0),
                      position = position_dodge(0.8), data = df)

print(g)

棒グラフについても同様に描くことができる。

g <- ggplot() + theme_bw() + scale_fill_npg() + scale_color_npg()

# barplot with errorbars
g <-  g + geom_bar(aes(x = variety, y = mean, fill = tissue), stat = 'identity', position = "dodge", data = df) +
        geom_errorbar(aes(x = variety, group = tissue, ymin = mean - sd, ymax = mean + sd, width = 0.3),
                      position = position_dodge(0.9), data = df)

# points / use `fill = tissue` to plot raw data as black point
g <- g + geom_point(aes(x = variety, y = dry_mass, fill = tissue),
                    position = position_jitterdodge(jitter.width = 0.2, dodge.width=0.9),
                    alpha = 0.3, size = 3,
                    data = df2)

print(g)

次に、もう少し複雑なグラフを描く方法を示す。横軸を系統と処理の組み合わせとしたときのグラフを描く。まず、グラフを描くときの、データフレームを作成する。データフレームの作成方法は、前出方法とほぼ同じで、違うところとして group_by 関数に variety のほかに fert を追加した。

df.mean <- d %>%
        group_by(variety, fert) %>% 
        summarise(root_dry_mass = mean(root_dry_mass),
                  shoot_dry_mass = mean(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'mean')
df.sd <- d %>%
        group_by(variety, fert) %>% 
        summarise(root_dry_mass = sd(root_dry_mass),
                  shoot_dry_mass = sd(shoot_dry_mass)) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'sd')

df <- left_join(df.mean, df.sd)

## Joining, by = c("variety", "fert", "tissue")

df2 <- d %>%
        select(variety, fert, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')

次にグラフを描く。次に、tissue と fert の色を塗り分けるように指定する。このとき fill オプションに tissue と fert の組み合わせとなるように interaction(tissue, fert) と指定する。

g <- ggplot() + theme_bw() + scale_fill_npg() + scale_color_npg()

# barplot with errorbars
g <-  g + geom_bar(aes(x = variety, y = mean, fill = interaction(tissue, fert)),
                   stat = 'identity', position = "dodge", data = df) +
        geom_errorbar(aes(x = variety, group = interaction(tissue, fert), ymin = mean - sd, ymax = mean + sd, width = 0.3),
                      position = position_dodge(0.9), data = df)

# points / use `fill = tissue` to plot raw data as black point
g <- g + geom_point(aes(x = variety, y = dry_mass, fill = interaction(tissue, fert)),
                    position = position_jitterdodge(jitter.width = 0.2, dodge.width=0.9),
                    alpha = 0.3, size = 2,
                    data = df2)
print(g)

また、あとで紹介するが facet_wrap 関数などを用いることで、もう少しグラフを見やすく調整することができる。

g <- ggplot() +  scale_fill_npg() + scale_color_npg() + facet_wrap(~ fert)

# barplot with errorbars
g <-  g + geom_bar(aes(x = variety, y = mean, fill = tissue),
                   stat = 'identity', position = "dodge", data = df) +
        geom_errorbar(aes(x = variety, group = interaction(tissue, fert), ymin = mean - sd, ymax = mean + sd, width = 0.3),
                      position = position_dodge(0.9), data = df)

# points / use `fill = tissue` to plot raw data as black point
g <- g + geom_point(aes(x = variety, y = dry_mass, fill = tissue),
                    position = position_jitterdodge(jitter.width = 0.2, dodge.width=0.9),
                    alpha = 0.3, size = 2,
                    data = df2)
print(g)

geom_boxplot 関数

geom_boxplot はボックスプロットを描く関数である。ここで、wt 系統の root_dry_mass と shoot_dry_mass のデータに対して、ボックスプロットを描く例を示す。まず、グラフを描くためのデータを集計する。

df <- d %>%
        filter(variety == 'wt') %>%
        select(root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

##          tissue dry_mass
## 1 root_dry_mass       56
## 2 root_dry_mass       66
## 3 root_dry_mass       40
## 4 root_dry_mass       43
## 5 root_dry_mass       55
## 6 root_dry_mass       66

aes 関数の x オプションに tissue を指定し、y オプションに dry_mass を指定する。

g <- ggplot(df, aes(x = tissue, y = dry_mass)) +
        geom_boxplot() +
        theme_bw()
print(g)

次に、wt 系統と ANU843 系統それぞれについて root_dry_mass および shoot_dry_mass のデータの分布をボックスプロットで示す。この際に、組織（root または shoot）に応じて色を塗り分けるようにする。

df <- d %>%
        select(variety, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

##   variety        tissue dry_mass
## 1      wt root_dry_mass       56
## 2      wt root_dry_mass       66
## 3      wt root_dry_mass       40
## 4      wt root_dry_mass       43
## 5      wt root_dry_mass       55
## 6      wt root_dry_mass       66

g <- ggplot(df, aes(x = variety, y = dry_mass, fill = tissue)) +
        geom_boxplot() +
        theme_bw() +
        scale_fill_npg()
print(g)

ボックスプロットに実際のデータを重ね書きする時、geom_boxplot 関数を実行した後に geom_point 関数を実行する。このとき、点が重ならないようにするために、position = position_jitterdodge() オプションを指定する。まず、グラフを描くためにデータセットを集計する。

df2 <- d %>%
        select(variety, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df2)

##   variety        tissue dry_mass
## 1      wt root_dry_mass       56
## 2      wt root_dry_mass       66
## 3      wt root_dry_mass       40
## 4      wt root_dry_mass       43
## 5      wt root_dry_mass       55
## 6      wt root_dry_mass       66

g <- ggplot() + theme_bw() + scale_color_npg()

# barplot with errorbars
g <-  g + geom_boxplot(aes(x = variety, y = dry_mass, color = tissue), data = df)

# points
g <- g + geom_point(aes(x = variety, y = dry_mass, color = tissue),
                    position = position_jitterdodge(jitter.width = 0.2, dodge.width=0.8),
                    alpha = 0.3, size = 3,
                    data = df2)

print(g)

geom_hist 関数

geom_hist 関数はヒストグラムを描く関数である。ここで rice データを使って、ヒストグラムを描く例を示す。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
df <- d %>% 
        filter(variety == 'wt') %>%
        select(root_dry_mass)
head(df)

##   root_dry_mass
## 1            56
## 2            66
## 3            40
## 4            43
## 5            55
## 6            66

ヒストグラムは 1 変量変数の分布を視覚化するためのグラフであるから、グラフを描く際に座標軸を指定するとき、x だけを指定すればよい。デフォルトでは、ヒストグラムのビンの数は 30 に設定される。

g <- ggplot(df, aes(x = root_dry_mass)) +
      geom_histogram() +
      theme_bw()
print(g)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ビンの幅は geome_histogram 関数の binwidth で変更することができる。例えば、ビンの幅を 1 にしたい場合は次のようにする。

g <- ggplot(df, aes(x = root_dry_mass)) +
      geom_histogram(binwidth = 1) +
      theme_bw()
print(g)

ビンの幅をスタージェスの公式にしたがって自動的に決めたいときは、自分でビンの幅を計算する関数を定義する必要がある。

library(grDevices)

sturges.width <- function(x) {
  ceiling ((max(x)- min (x)) / nclass.Sturges(x))
}

scott.width <- function(x) {
  ceiling ((max(x)- min (x)) / nclass.scott(x))
}


g <- ggplot(df, aes(x = root_dry_mass)) +
      geom_histogram(binwidth = sturges.width) +
      theme_bw()
print(g)

geom_histgram に color を指定するとヒストグラムの外枠の色を変更できるようになる。また、fill を指定するとヒストグラムの中身の塗りを指定することができる。

g <- ggplot(df, aes(x = root_dry_mass)) +
      geom_histogram(binwidth = 5, color='#00B5E2', fill='#00B5E210') +
      theme_bw()
print(g)

項目数が 2 つある場合、例えば wt 系統の root_dry_mass と shoot_dry_mass の文法を見たい時、前述の関数などと同様に fill オプションを指定すれば、ggplot が自動的に色を使い分けてグラフを描く。ここで、まずデータセットを用意する。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
df <- d %>% 
        filter(variety == 'wt') %>%
        select(root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

##          tissue dry_mass
## 1 root_dry_mass       56
## 2 root_dry_mass       66
## 3 root_dry_mass       40
## 4 root_dry_mass       43
## 5 root_dry_mass       55
## 6 root_dry_mass       66

geom_histogram 関数をデフォルトのオプションで描くと、次のように積み上げヒストグラムになる。このとき、デフォルトでは position = 'stack' として実行される。

g <- ggplot(df, aes(x = dry_mass, fill  = tissue)) +
      geom_histogram(binwidth = 5) +
      theme_bw() +
      scale_fill_npg()

print(g)

次に position オプションを position = 'identity' に変更すると、次のように 2 つの項目が重なった状態で描かれる。

g <- ggplot(df, aes(x = dry_mass, fill  = tissue)) +
      geom_histogram(binwidth = 5, position = 'identity', alpha = 0.8) +
      theme_bw() +
      scale_fill_npg()

print(g)

position オプションを position = 'dodge' に変更すると、2 つの項目が並んでいる状態で描かれる。

g <- ggplot(df, aes(x = dry_mass, fill  = tissue)) +
      geom_histogram(binwidth = 5, position = 'dodge') +
      theme_bw() +
      scale_fill_npg()

print(g)

position オプションを position = 'fill' に変更すると、次のように 2 つの項目が積み上げた状態で描かれ、それぞれのビンの最大値が 1 となる。

g <- ggplot(df, aes(x = dry_mass, fill  = tissue)) +
      geom_histogram(binwidth = 5, position = 'fill') +
      theme_bw() +
      scale_fill_npg()

print(g)

## Warning: Removed 4 rows containing missing values (geom_bar).

facet_wrap 関数と facet_grid 関数

データのある項目でグラフを複数のサブプロットに分けて描くとき、facet_wrap 関数および facet_grid 関数を使用する。facet_wrap はある 1 つの項目に対して、グラフを分けたい場合に利用する。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
df <- d %>% 
        select(variety, fert, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

##   variety fert        tissue dry_mass
## 1      wt  F10 root_dry_mass       56
## 2      wt  F10 root_dry_mass       66
## 3      wt  F10 root_dry_mass       40
## 4      wt  F10 root_dry_mass       43
## 5      wt  F10 root_dry_mass       55
## 6      wt  F10 root_dry_mass       66

例えば、系統ごとにサブプロットを作る場合は、facet_wrap 関数に variety を指定して、2 列となるように描くと次のようになる。

g <- ggplot(df, aes(x = dry_mass, fill = fert)) +
      geom_histogram(binwidth = 5) +
      facet_wrap(~ variety, ncol = 2) +
      scale_fill_jco()
print(g)

系統ごとにサブプロットを作る場合は、facet_wrap 関数に variety を指定して、2 行となるように描くと次のようになる。

g <- ggplot(df, aes(x = dry_mass, fill = fert)) +
      geom_histogram(binwidth = 5) +
      facet_wrap(~ variety, nrow = 2) +
      scale_fill_jco()
print(g)

2 つの項目について、その組み合わせでサブプロット描く時 facet_grid 関数を使用する。例えば、系統（variety）と処理（fert）の組み合わせでサブプロット描く例をここで示す。まず、データセットを用意する。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
df <- d %>% 
        select(variety, fert, root_dry_mass, shoot_dry_mass) %>%
        gather(`root_dry_mass`, `shoot_dry_mass`, key = 'tissue', value = 'dry_mass')
head(df)

##   variety fert        tissue dry_mass
## 1      wt  F10 root_dry_mass       56
## 2      wt  F10 root_dry_mass       66
## 3      wt  F10 root_dry_mass       40
## 4      wt  F10 root_dry_mass       43
## 5      wt  F10 root_dry_mass       55
## 6      wt  F10 root_dry_mass       66

系統（variety）と処理（fert）の組み合わせでサブプロット描く。

g <- ggplot(df, aes(x = dry_mass)) +
      geom_histogram(binwidth = 5) +
      facet_grid(variety ~ fert)
print(g)

g <- ggplot(df, aes(x = dry_mass)) +
      geom_histogram(binwidth = 5) +
      facet_grid(fert ~ variety)
print(g)

グラフの調整

グラフのタイトルおよび軸ラベルなど

この小節で使う rice データセットを読み込む。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
head(d)

##   replicate block root_dry_mass shoot_dry_mass trt fert variety
## 1         1     1            56            132 F10  F10      wt
## 2         2     1            66            120 F10  F10      wt
## 3         3     1            40            108 F10  F10      wt
## 4         4     1            43            134 F10  F10      wt
## 5         5     1            55            119 F10  F10      wt
## 6         6     1            66            125 F10  F10      wt

ggplot を使ってグラフを描くとき、aes 関数の x および y オプションに指定した変数の名前がそのまま x 軸および y 軸のラベルになる。このラベルを後から書き換える時は、labs 関数の中の x および y オプションを使用する。また、aes の color あるいは fill オプション使用すると、そこで指定した変数の名前がそのままグラフの判例ラベルになる。この判例のラベルを書き換える場合は labs 関数の該当オプション（color あるいは fill）の値を変更すればよい。labs 関数を使用すると、さらにグラフのタイトルをつけることもできるようになる。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point(size = 4) +
      scale_color_npg() +
      labs(x = 'root dry mass [g]',
           y = 'shoot dry mass [g]',
           title = 'Relations between the dry mass of roots and shoots',
           color = 'treatments')
print(g)

座標軸やラベルの文字の大きさ、色、そしてフォントスタイル（太字、斜体など）を変更することができる。とくにグラフをファイルに保存する場合は、ファイルサイズが大きいと、グラフの点や線、軸ラベルが小さくなりがちである。このとき、次のように theme 関数でフォントの大きさを調整することができる。

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety:fert)) +
      geom_point(size = 4) +
      scale_color_npg() +
      labs(x = 'root dry mass [g]',
           y = 'shoot dry mass [g]',
           title = 'Relations between the dry mass of roots and shoots',
           color = 'treatments') +
      theme(axis.title = element_text(face = 'bold'),
            axis.text.y = element_text(angle = 90, h = 0.5),
            plot.title = element_text(face = 'bold', size = 18, color = 'darkgrey'),
            legend.title = element_text(face = 'bold', size = 14),
            legend.text = element_text(face = 'italic', size = 10))
print(g)

グラフの判例に color と shape の両方が存在するときの調整例。

fontsize <- 16

g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass, color = variety, shape = fert)) +
      geom_point(size = 4) +
      scale_color_npg() +
      labs(x = 'root dry mass [g]',
           y = 'shoot dry mass [g]',
           title = 'Relations between the dry mass of roots and shoots',
           color = 'variety',
           shape = 'treatment') +
      theme(axis.title = element_text(face = 'bold', size = fontsize),
            axis.text = element_text(size= fontsize),
            axis.text.y = element_text(angle = 90, h = 0.5),
            plot.title = element_text(face = 'bold', size = fontsize, color = 'darkgrey'),
            legend.title = element_text(face = 'bold', size = fontsize- 4),
            legend.text = element_text(face = 'italic', size = fontsize - 4))
print(g)

座標軸

ggplot では、座標軸のメモリや表示方法などを調整することができる。例えば、縦軸を指数形式で表示させたり、対数で表示させたりすることができるようになる。ここでこの小節で使うデータセットを読み込む。

data(msleep)
head(msleep)

## # A tibble: 6 x 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle
##   <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
## 1 Chee… Acin… carni Carn… lc                  12.1      NA        NA    
## 2 Owl … Aotus omni  Prim… <NA>                17         1.8      NA    
## 3 Moun… Aplo… herbi Rode… nt                  14.4       2.4      NA    
## 4 Grea… Blar… omni  Sori… lc                  14.9       2.3       0.133
## 5 Cow   Bos   herbi Arti… domesticated         4         0.7       0.667
## 6 Thre… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
## # … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>

まずは、座標軸を調整せずに、デフォルトの状態でグラフを描いてみる。

g <- msleep %>% 
      select(name, bodywt, vore) %>%
      drop_na() %>%
      ggplot(aes(x = name, y = bodywt, fill = vore)) +
      geom_bar(stat = 'identity') +
      scale_fill_npg() +
      labs(x = '',
           y = 'body weights [kg]',
           title = 'body weight of mammals',
           color = 'vore') +
      theme_bw() +
      theme(axis.title = element_text(face = 'bold'),
            axis.text.x = element_text(angle = 90, h = 1),
            plot.title = element_text(face = 'bold'),
            legend.title = element_text(face = 'bold'))
print(g)

次に、y 軸を指数表記に変更する。このとき scale_y_continuous 関数を利用し、その labels オプションに scales::scientific を指定する。

g <- msleep %>% 
      select(name, bodywt, vore) %>%
      drop_na() %>%
      ggplot(aes(x = name, y = bodywt, fill = vore)) +
      geom_bar(stat = 'identity') +
      scale_fill_npg() +
      labs(x = '',
           y = 'body weights [kg]',
           title = 'body weight of mammals',
           color = 'vore') +
      theme_bw() +
      theme(axis.title = element_text(face = 'bold'),
            axis.text.x = element_text(angle = 90, h = 1),
            plot.title = element_text(face = 'bold'),
            legend.title = element_text(face = 'bold')) + 
      scale_y_continuous(labels = scales::scientific)
print(g)

y 軸の目盛りを調整する場合は scale_y_continuous 関数の breaks オプションを使用する。このとき、実際に目盛り線を書き入れたい値を breaks オプションに与える。

g <- msleep %>% 
      select(name, bodywt, vore) %>%
      drop_na() %>%
      ggplot(aes(x = name, y = bodywt, fill = vore)) +
      geom_bar(stat = 'identity') +
      scale_fill_npg() +
      labs(x = '',
           y = 'body weights [kg]',
           title = 'body weight of mammals',
           color = 'vore') +
      theme_bw() +
      theme(axis.title = element_text(face = 'bold'),
            axis.text.x = element_text(angle = 90, h = 1),
            plot.title = element_text(face = 'bold'),
            legend.title = element_text(face = 'bold')) + 
      scale_y_continuous(labels = scales::scientific,
                         breaks = c(1, 10, 100, 1000))
print(g)

y 軸を対数スケールで描くことも可能である。この場合、scale_y_continuous 関数の trans オプションに log10 を指定する。log10 のほかに log2 や sqrt などを指定することができる。

g <- msleep %>% 
      select(name, bodywt, vore) %>%
      drop_na() %>%
      ggplot(aes(x = name, y = bodywt, fill = vore)) +
      geom_bar(stat = 'identity') +
      scale_fill_npg() +
      labs(x = '',
           y = 'body weights [kg]',
           title = 'body weight of mammals',
           color = 'vore') +
      theme_bw() +
      theme(axis.title = element_text(face = 'bold'),
            axis.text.x = element_text(angle = 90, h = 1),
            plot.title = element_text(face = 'bold'),
            legend.title = element_text(face = 'bold')) +
      scale_y_continuous(trans = 'log10')
print(g)

グラフの保存

グラフの保存は ggsave 関数を利用する。その際に、グラフの横幅と縦幅をインチで指定することができる。また、解像度 DPI も合わせて指定できる。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass)) +
      geom_point()

# 7 x 5 inches
ggsave("figure-1.pdf", g)

# 10 x 8 inches
ggsave("figure-1.pdf", g, width = 10, height = 8)

# 10 x 8 inches, with 300 dpi
ggsave("figure-1.pdf", g, width = 10, height = 8, dpi = 300)

次のように、R の画像保存用の標準関数を使用して、グラフを保存することもできる。例えば、ggplot を使ってグラフ g を作り終えたとする。このグラフを保存したい場合は、この print(g) を png および dev.off 関数で囲んで実行すればよい。ここで、print(g) の代わりに plot(g) を使っても可能だが、for 文の中で plot(g) を使うと、場合によってグラフを保存できない場合もある。

d <- read.table('data/rice.txt', header = TRUE, sep = '\t')
g <- ggplot(d, aes(x = root_dry_mass, y = shoot_dry_mass)) +
      geom_point()

png('figure-1.png')
print(g)
dev.off()

リファレンス

R の便利な機能を紹介している書籍が多数あり、それらが HTML としてウェブサイトで公開されている。tidyverse や ggplot2 についてもっと詳しく知りたい方、データ解析をもっと効率よく行いたい方は、ぜひ、次のウェブ資料を参考してみてください。