データ可視化ライブラリ Altairを使ってみる(クロス集計編) | ハムレットエンジニアのカンニングノート

クロス集計でのデータ可視化ライブラリ Altair を使ってみる．

[to

モジュールのインストール

python

!pip3 install altair
!pip3 install altair_saver

データの作成

可視化するためにでもデータを作成します．

python

import numpy as np
import pandas as pd

np.random.seed(1)# 乱数の固定

n = 300 # 学生の人数
s = np.random.normal(55,10,n) # 学生の学力（score）
c = np.random.randint(0,3,n) # クラス
s = s * (1 + c * 0.015) # クラスの学力差をつける
g = np.random.randint(0,2,n) # 性別

# 得点データの生成
s1 = np.random.uniform(0.75,1.1,n) * s * (1 + g * 0.02)
s2 = np.random.uniform(0.9,1.1,n) * s * (1 - g * 0.05)
s3 = np.random.uniform(0.9,1.05,n) * s * (1 + g * 0.03)
s4 = np.random.uniform(0.9,1.2,n) * s * (1 - g * 0.02)
s5 = np.random.uniform(0.8,1.1,n) * s * (1 + g * 0.01)

sex = ['男','女'] # 性別
cl = ['普通','理数','特進'] # クラス
sub = ['国語','数学','理科','社会','英語'] # 教科

df = pd.DataFrame()
df['学生番号'] = list(map(lambda x: 'ID'+str(x).zfill(3), range(1,1+n)))
df['国語'] = list(map(lambda x: round(x), s1))
df['数学'] = list(map(lambda x: round(x), s2))
df['理科'] = list(map(lambda x: round(x), s3))
df['社会'] = list(map(lambda x: round(x), s4))
df['英語'] = list(map(lambda x: round(x), s5))
df['合計'] = df['国語'] + df['数学'] + df['社会'] + df['理科'] + df['英語']
df['クラス'] = list(map(lambda x: cl[x], c))
df['性別'] = list(map(lambda x: sex[x], g))
display(df.head(5))

	学生番号	国語	数学	理科	社会	英語	合計	クラス	性別
0	ID001	65	68	68	72	76	349	普通	男
1	ID002	48	52	49	56	47	252	普通	男
2	ID003	52	45	50	49	45	241	普通	女
3	ID004	48	39	46	45	39	217	普通	女
4	ID005	52	62	71	68	63	316	特進	女

python

# 整然データへの変換
mdf = pd.melt(df.drop('合計',axis=1),id_vars=['学生番号','性別','クラス'],var_name="科目",value_name="得点" )
display(mdf) # melted dataframe
display(mdf[mdf['学生番号']=='ID001']) # melted dataframe

	学生番号	性別	クラス	科目	得点
0	ID001	男	普通	国語	65
1	ID002	男	普通	国語	48
2	ID003	女	普通	国語	52
3	ID004	女	普通	国語	48
4	ID005	女	特進	国語	52
...	...	...	...	...	...
1495	ID296	男	特進	英語	50
1496	ID297	男	理数	英語	65
1497	ID298	女	普通	英語	69
1498	ID299	男	特進	英語	44
1499	ID300	男	特進	英語	52

1500 rows × 5 columns

	学生番号	性別	クラス	科目	得点
0	ID001	男	普通	国語	65
300	ID001	男	普通	数学	68
600	ID001	男	普通	理科	68
900	ID001	男	普通	社会	72
1200	ID001	男	普通	英語	76

デモデータの可視化

散布図

python

import altair as alt
from altair_saver import save

scatter = alt.Chart(df).mark_circle(
        size=30
        ).encode(
        x=alt.X('国語',
            scale=alt.Scale(
                domain=[0,100]
                ),
            axis=alt.Axis(
                labelFontSize=15, 
                ticks=True, 
                titleFontSize=18, 
                title='国語の得点')
            ),
        y=alt.Y('数学',
            scale=alt.Scale(
                domain=[0, 100]
                ),
            axis=alt.Axis(labelFontSize=15, 
                ticks=True, 
                titleFontSize=18, 
                title='数学の得点')
            ),
        column=alt.Column('クラス',
            header=alt.Header(
                labelFontSize=15, 
                titleFontSize=18), 
            sort = alt.Sort(
                cl
                ), 
            title='クラス'
            ),
        color=alt.Color('性別', 
            scale=alt.Scale(
                domain=sex,
                range=['blue', 'red']
                ),
            ),
        tooltip=['国語', '数学'],
    ).properties(
        width=300,
        height=300,
        title="国語と数学の得点分布"
    ).interactive()

# 描画
display(scatter)
# 保存
# save(scatter,'qiita1.html',embed_options={'actions':True})

!(/image/Altair_scatter01.png)

散布図（統計量の表示）

平均や標準偏差といった統計量を表示することができます．

ちなみに散布図は mark_circle()でも、mark_point() でも作成できます．

aggregation 関数を用いることで主要な統計量は計算可能である。代表的な統計量は mean()、median()、sum()、min()、max()、stderr()（標準誤差）、stdev()（標準偏差）など。

python

import altair as alt
from altair_saver import save

scatter = alt.Chart(df).mark_point(
        filled=True, 
        size=200,
        opacity=0.7
    ).encode(
        x=alt.X(
            'mean(合計):Q',
            scale=alt.Scale(
                domain=[0,500]
                ),
            axis=alt.Axis(
                title='合計得点の平均'
                )
            ),
        y=alt.Y(
            'stdev(合計):Q',
            scale=alt.Scale(
                domain=[0,100]
                ),
            axis=alt.Axis(
                title='合計得点の標準偏差'
                )
            ),
        color=alt.Color('性別', 
            scale=alt.Scale(
                domain=sex,
                range=['blue', 'red']
                )
            )
    )
# 描画
display(scatter)
# 保存
# save(scatter,'qiita2.html',embed_options={'actions':True})

!(/image/Altair_scatter02.png)

ヒストグラム

得点の分布を調べる際に有効です

python

import altair as alt
from altair_saver import save

histgram = alt.Chart(df).mark_bar(opacity=0.5).encode(
    x=alt.X("合計", 
        bin=alt.Bin(
            step=10,
            extent=[0,500]
            ),
        axis=alt.Axis(
            labelFontSize=15, 
            ticks=True, 
            titleFontSize=18, 
            title='得点の分布'
            )
        ),
    y=alt.Y('count(合計)',
        axis=alt.Axis(
            labelFontSize=15, 
            ticks=True, 
            titleFontSize=18,
            title='人数'
            ),
        stack=None
        ),
    color=alt.Color('性別', 
            scale=alt.Scale(
                domain=sex,
                range=['blue','red']
                ),
            ),
    ).properties(
    width=600,
    height=500
    ).interactive()

# 描画
display(histgram)
# 保存
# save(histgram,'qiita3.html',embed_options={'actions':True})

!(/image/Altair_histgram.png)

まとめ

今回は，クロス集計データでの代表的なもの3つで実装しました．

次は，時系列データでの実装もし，後に別データで可視化してみます．

参考サイト

【Python】データ可視化ライブラリ Altair を使いこなす

モジュールのインストール ​

データの作成 ​

デモデータの可視化 ​

散布図 ​

散布図（統計量の表示） ​

ヒストグラム ​

まとめ ​

参考サイト ​

モジュールのインストール

データの作成

デモデータの可視化

散布図

散布図（統計量の表示）

ヒストグラム

まとめ

参考サイト