pandasでRDBの読み書きをする

概要

postgresqlのデータをpandasで分析したかったので、pandas.io.sqlを使って読み込んでみる。

あと、書き込みの方もやってみる。こちらはsqliteで。

@CretedDate 2014/09/20
@Versions Python2.7.6 pandas0.14.1-2

参考資料

mysqlやsqliteについてはstackoverflowの下記記事参照

http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql

尚、上記で使っているframe_queryはdeprecatedになっていて、現在ではread_queryを使うらしい。

postgresqlとの接続

psycopg2を使う

pip install psycopg2

SELECTする際の適当なサンプルコード

import psycopg2 as pg
with pg.connect(database='db_name',
                user='user_name',
                host='host_name',
                port=5432) as conn, conn.cursor() as cur:
  cur.execute( 'SELECT * FROM foo' )
  print( cur.fetchall() )

pandasに食べさせる

pandas.io.sqlを使えば良いらしい。

import pandas.io.sql as psql
import psycopg2 as pg
with pg.connect(database='db_name',
                user='user_name',
                host='host_name',
                port=5432) as conn:
  sql = 'SELECT * FROM foo'
  df = psql.read_sql(sql, conn)

これだけでカラム名もちゃんと認識したdataframeを作ってくれる。便利。

dataframeをsqliteに書き込む

# 登録するデータを作る
import pandas as pd
df = pd.DataFrame( [ [1, 2, 3], [4, 5, 6] ], columns=[ 'a', 'b', 'c' ] )

# 登録実行（CREATEも行われるよ）
import sqlite3
import pandas.io.sql as psql
with sqlite3.connect("tmp.db") as conn:
  # これでINSERTされる
  psql.to_sql( df, 'foo', conn )

# 中身確認
with sqlite3.connect("tmp.db") as conn:
  cur = conn.execute("select * from foo")
  print( cur.fetchall() )
  cur.close()

実行結果

[(0, 1, 2, 3), (1, 4, 5, 6)]

先頭にindexが付いた状態で登録される。indexはいらないという場合は、index=False を設定する。

import pandas.io.sql as psql
with sqlite3.connect("tmp.db") as conn:
  # DROPしておく
  conn.execute( 'DROP TABLE foo' )
  # ここでindex=Falseを設定
  psql.to_sql( df, 'foo', conn, index=False )
  cur = conn.execute("select * from foo")
  print( cur.fetchall() )

これでindexが消える。普通にデータを出し入れする場合はこちらの方が良いかも。

実行結果

[(1, 2, 3), (4, 5, 6)]

あとは既にテーブルができている場合は、if_exists='replace' で置き換えたり

import pandas as pd
df = pd.DataFrame( [ [7, 8, 9], [10, 11, 12] ], columns=[ 'a', 'b', 'c' ] )

import pandas.io.sql as psql
with sqlite3.connect("tmp.db") as conn:
  psql.to_sql( df, 'foo', conn, index=False, if_exists='replace' )
  cur = conn.execute("select * from foo")
  print( cur.fetchall() )

実行結果

[(7, 8, 9), (10, 11, 12)]