R言語による医療データ分析

R言語によるデータ分析のオンラインコースを中心に、さまざまなデータ分析について記載してあります。

067-069　データクリーニングの全体像とtidyデータ

医師が教えるR言語での医療データ分析入門

まとめ一覧

データクリーニングの全体像

全体像

f:id:tentencon:20191221163054j:plain — dc全体像

>

tidy vs messyデータ

tidy　＝　整頓、整理された
messy　＝　やっかいな

tidyデータとは

1列が1変数に対応している
1行が1観察に対応している
(1観察単位が1つの表を形作る)

Wickham H (2014). “Tidy Data” Journal of Statistical Software, 59(10), 2014.

よくあるmessyなデータ

コラム名が変数名でなく、値である。
複数の変数が1つのコラムに含まれている
変数が行と列両方に含まれている
複数の観察単位が同じテーブルに含まれている
1つの観察単位が複数のテーブルに分かれて含まれている

f:id:tentencon:20191221163144j:plain — messydata

人にとってはこのmessyな形の方が理解しやすかったりします。（天気予報とか)

まとめ一覧

065 補足:X軸のラベルの回転

医師が教えるR言語での医療データ分析入門

まとめ一覧

補足：themeでX軸のラベルを回転させる

themeでの個別設定が必要になる場合について解説します

table <- data.frame(
  item_name　= c("究極のマスクメロンアイスクリーム",
                "イチゴたっぷりショートケーキイタリア風",
                "和栗の贅沢ブラックモンブラン",
                "朝どれ卵のなめらかプリン",
                "マンゴーと南国フルーツのタルト",
                "フルーツをたっぷりつかったロールケーキ"),
  uriage_kosu = c(39,42,73,88,93,132)
)

ggplot(table) + 
  geom_bar(aes(item_name, uriage_kosu), stat="identity")

このようにX軸が重なってしまったというときは、themeで設定を変えましょう

ggplot(table) + 
  geom_bar(aes(item_name, uriage_kosu), stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

ただし、theme_classicなどを設定すると上書きされるので、

ggplot(table) + 
  geom_bar(aes(item_name, uriage_kosu), stat="identity") + 
  theme(axis.text.x = element_text(angle=45, hjust = 1)) +
  theme_classic()

順番を入れ替えましょう

ggplot(table) + 
  geom_bar(aes(item_name, uriage_kosu), stat="identity") +
  theme_classic() +
  theme(axis.text.x = element_text(angle=45, hjust = 1))

まとめ一覧

064 補足:集計済みデータの描画

医師が教えるR言語での医療データ分析入門

まとめ一覧

補足：集計済みデータの描画～statオプション~

集計済みのデータを表示したい：

library(ggplot2)

table <- data.frame(
  age_group　= c("~20","21~40","41~60","61~80","81~100","100~"),
  yearly_admission = c(39,42,73,88,93,132)
)

ggplot(table) + geom_bar(aes(age_group))

ggplot(table) + 
  geom_bar(aes(age_group, yearly_admission), 
           stat = "identity")

このように、stat = "identity"というオプションをつけることで、geom_point等と同様の動作を設定できます。

ggplot(table) + geom_point(aes(age_group), stat="count")

通常のgeom_barのstat = "count"　という設定を、手動で打ち消してあげることでこの動作になります。

ということは・・・

diamonds
ggplot(diamonds) + geom_bar(aes(x = clarity))
ggplot(diamonds) + geom_point(aes(x = clarity)) #エラー！
ggplot(diamonds) + geom_point(aes(x = clarity), stat = "count") #成功！

このように、実は、変数×集計の組み合わせgeom_XXXの動作を変数×変数の組み合わせの動作にしたり、そのまた逆にしたりをstatオプションを設定することで実現できます。

まとめ一覧

061-062 ggplotのテーマ設定方法

医師が教えるR言語での医療データ分析入門

まとめ一覧

　Theme 設定

単純にグラフにtheme_XXX()を足すだけで、お手軽にテーマ設定ができます。

text_label_of_clarity <- c("含まれる",
                           "わずかにSI2","わずかにSI1",
                           "ほんのわずかにVS2","ほんのわずかにVS1",
                           "ごくごくわずかにVVS2","ごくごくわずかにVVS1",
                           "内部が無傷")


graph <- ggplot(diamonds) + 
  geom_histogram(aes(carat, fill = clarity)) +
  labs(title = "値段と重さと透明度の関係", x = "重さ(カラット)", y = "値段") +
  scale_fill_discrete(name = "透明度",
                       labels = text_label_of_clarity) 
graph

graph + theme_gray()

graph + theme_bw()

graph + theme_linedraw()

graph + theme_light()

graph + theme_dark()

graph + theme_minimal()

graph + theme_classic()

?theme

色々なテーマを集めた、ggthemesパッケージというものがあります。

install.packages("ggthemes")

graph + ggthemes::theme_base()
graph + ggthemes::theme_calc()
graph + ggthemes::theme_economist() #Economist(雑誌)とにたテーマ
graph + ggthemes::theme_economist_white()
graph + ggthemes::theme_excel() #helpの説明分がひどい・・・「絶対につかわないで」
graph + ggthemes::theme_few()
graph + ggthemes::theme_fivethirtyeight()
graph + ggthemes::theme_gdocs()
graph + ggthemes::theme_stata()
graph + ggthemes::theme_wsj() #Wall Street Journalとにたテーマ

まとめ一覧

056-059 凡例の設定

医師が教えるR言語での医療データ分析入門

まとめ一覧

凡例操作

library(tidyverse)

graph <- ggplot(diamonds) + geom_histogram(aes(x = price, fill = clarity))
graph

まず、凡例をけしてみる　guide = FALSE

graph + scale_fill_discrete(guide = FALSE)

タイトルをいじる

graph
graph + scale_fill_discrete(name = "透明度")

表示される順番を変えてみる

graph
graph + 
  scale_fill_discrete(breaks = c("I1", "IF", "VVS1",
                                 "VVS2", "VS2", "VS1", "SI2", "SI1"))

ところで、ダイヤモンドの透明度について知らないことに気づいたので、調べてみました。
透明度とは、ダイヤモンドに含まれる微少な包有物
- I1:含まれる
- SI2:わずかに含まれる
- SI1:わずかに含まれる
- VS2:ほんのわずかに含まれる
- VS1:ほんのわずかに含まれる
- VVS2:ごくごくわずかに含有
- VVS1:ごくごくわずかに含有
- IF:内部が無傷
ラベルをつけます

levels(diamonds$clarity)　　* で表示された順番に、

text_label_of_clarity <- c("含まれる",
                           "わずかにSI2","わずかにSI1",
                           "ほんのわずかにVS2","ほんのわずかにVS1",
                           "ごくごくわずかにVVS2","ごくごくわずかにVVS1",
                           "内部が無傷")

graph
graph + scale_fill_discrete(labels = text_label_of_clarity)

ここまでの情報をまとめると

ggplot(diamonds) + 
  geom_histogram(aes(x = price, fill = clarity)) +
  labs(title = "値段と含有物のヒストグラム", x = "値段", y = "件数") +
  scale_fill_discrete(name = "透明度", 
                      labels = text_label_of_clarity)

こんなグラフがかけるようになりました！
練習問題：次のグラフにタイトルと軸のラベルをつけて、透明度の凡例をつけてください。

ggplot(diamonds) + geom_point(aes(carat, price, color = clarity))

答え：

ggplot(diamonds) + geom_point(aes(carat, price, color = clarity))

levels(diamonds$clarity)　* で表示された順番に、

text_label_of_clarity <- c("含まれる",
                           "わずかにSI2","わずかにSI1",
                           "ほんのわずかにVS2","ほんのわずかにVS1",
                           "ごくごくわずかにVVS2","ごくごくわずかにVVS1",
                           "内部が無傷")

ggplot(diamonds) + 
  geom_point(aes(carat, price, color = clarity)) +
  labs(title = "値段と重さと透明度の関係", x = "重さ(カラット)", y = "値段") +
  scale_color_discrete(name = "透明度",
                       labels = text_label_of_clarity)

まとめ一覧

053-055 タイトルとラベルの指定

医師が教えるR言語での医療データ分析入門

まとめ一覧

ggplot(diamonds) +
  geom_histogram(aes(x = price, fill = color))

タイトル：「ダイアモンドの値段分布」　
X軸：「値段(ドル)」
Y軸：「件数」とつけてみましょう。

ggplot(diamonds) + 
  geom_histogram(aes(price, fill=color)) +
  labs(title = "ダイヤモンドの値段分布",
       x = "値段[ドル]", 
       y = "件数")

もちろん、変数も使えるので

title_text = "なにかすごいタイトル"
label_x_axis = "すごいラベルX"
label_y_axis = "とてつもないラベルY"

yoi_graph <- ggplot(diamonds) + 
  geom_histogram(aes(price, fill=color)) 

yoi_graph + labs(title = title_text, 
                 x = label_x_axis, 
                 y = label_y_axis)

こんなこともできます

title_text <- "もっとちゃんとしたタイトル"
label_x_axis <- "値段です"
label_y_axis <- "件数です"

yoi_graph + labs(title = title_text, 
                 x = label_x_axis, 
                 y = label_y_axis)

まとめ一覧

050-052 ggplotでの色の指定

まとめ一覧

演習問題で作ったグラフを塗り分けて、より「意味を感じられる」グラフを作ってみましょう。

library(ggplot2)

Q1: ダイヤモンドの重さと値段の関係を、散布図で描画してください。

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price))

何となく右肩上がりですが、同じ重さでも、値段に相当な開きがありそうです。カットが影響している？？？

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, 
                           y = price, 
                           color = cut))

??? 透明度は？

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat,
                           y = price,
                           color = clarity))

なんか意味ありそうですね。　色もみて見ましょう

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat,
                           y = price,
                           color = color))

こっちも関係ありそう。
補足、mappingは、実はggplotの中に記載してもOKで、＋以降のgeom関数にその効果は続きます。なので、うえの例のようにほぼ同じグラフをたくさん書いて探索的なデータ可視化を行う場合は、次のように書いてもOKです

gg <- ggplot(data = diamonds, mapping = aes(x = carat,
                                            y = price))

gg + geom_point()
gg + geom_point(mapping = aes(color = cut))
gg + geom_point(mapping = aes(color = clarity))

さらに、argumentが指定されていない場合、ヘルプファイルの順番通りに記載されていると解釈されるため、data= mapping=部分は省略できるため、

gg <- ggplot(diamonds, aes(carat, price))
gg + geom_point()
gg + geom_point(aes(color = cut))

という風に簡略できます。

Q2: ダイヤモンドの色と値段の関係を、箱ひげ図で描画してください。

library(ggplot2)

ggplot(data =diamonds) +
  geom_boxplot(mapping = aes(x = color, y = price))

これも、色と値段に何かしらの関係がありそうです。他の変数で色分けしてみましょう

ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = color, 
                               y = price, 
                               color = cut))
 
ggplot(data = diamonds) +
   geom_boxplot(mapping = aes(x = color,
                              y = price,
                              color = clarity))

colorでなくてfillを使うと、

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = color, 
                             y = price, 
                             fill = cut))

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = color,
                             y = price,
                             fill = clarity))

こんな書き方もあります

gg <- ggplot(diamonds, aes(color, price))

gg + geom_boxplot(aes(color=cut))
gg + geom_boxplot(aes(color=clarity))

gg + geom_boxplot(aes(fill=cut))
gg + geom_boxplot(aes(fill=clarity))

gg + geom_boxplot(aes(fill=clarity, color = clarity))

Q3:　ダイヤモンドの透明度と色の関係を、何らかの形で描画してください

gdia <- ggplot(data = diamonds)

gdia + geom_count(aes(clarity, color))
gdia + geom_jitter(aes(clarity, color))

gdia + geom_jitter(aes(clarity, color, color = cut))
gdia + geom_jitter(aes(clarity, color, color = clarity))

gdia + geom_jitter(aes(clarity, color, color = carat))

Q4: ダイヤモンドの値段の分布をヒストグラムにして描画してください

gdia <- ggplot(data = diamonds)

gdia + geom_histogram(aes(price))

gdia + geom_histogram(aes(price, color = cut)) #???
gdia + geom_histogram(aes(price, fill = cut))
gdia + geom_histogram(aes(price, fill = clarity))

Q5: ダイヤモンドのカットの質が分類毎に、このデータセットに何件ずつあるのかを描画してください。

gdia + geom_bar(aes(cut))

gdia + geom_bar(aes(cut, fill = clarity))
gdia + geom_bar(aes(cut, fill = price)) #これはだめです

gdia + geom_bar(aes(cut, fill = clarity))　　#これを種類毎に分けたい場合はどうすればいでしょうか？

positionというオプションがあるgeom_XXXでは、次のようなことができます。

gdia + geom_bar(mapping = aes(cut, fill=clarity), 
                position = "dodge")　
　　　#dodgeはよけるという意味です。
gdia + geom_bar(aes(cut, fill=clarity),
                position = "stack")

Q4のヒストグラムも同様です。

gdia + geom_histogram(aes(price, fill = clarity), 
                      position = "dodge")

Q6: 米国の失業者数の推移を何らかの形で描画してください。

economics
geco <- ggplot(economics) 

geco + geom_line(aes(date, unemploy,color = pop))

geco + geom_line(mapping = aes(date,unemploy),
                 color = "red",
                 size = 1.5,
                 linetype = "dashed")

まとめ一覧