Study Anything ๐Ÿง

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ - PySpark์—์„œ SQL ์‚ฌ์šฉํ•˜๊ธฐ ๋ณธ๋ฌธ

ํ”„๋กœ์ ํŠธ/[DA] ๋ฐ์ดํ„ฐ ๋ถ„์„ : ๋ฐฐ๊ตฌ

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ - PySpark์—์„œ SQL ์‚ฌ์šฉํ•˜๊ธฐ

์†” 2022. 3. 12. 22:48

์ €๋ฒˆ ํฌ์ŠคํŠธ์—์„œ ์„ค๋ช…ํ–ˆ๋“ฏ์ด ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ๋ฐฐ๊ตฌ ์„ ์ˆ˜๋“ค์˜ ๋Šฅ๋ ฅ์น˜๋ฅผ ํ†ตํ•ด ํŒ€ ๋ณ„ ๊ฒฝ๊ธฐ์—์„œ ์–ด๋–ค ํŒ€์ด ์Šน๋ฆฌํ• ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์„ ์ˆ˜ ๋Šฅ๋ ฅ์น˜๋ฅผ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ํ‘œ๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ์ด์ „์— ๋ชจ์•˜๋˜ ๋‹จ์ˆœ ๋ฐ์ดํ„ฐ๋กœ๋Š” ํ•œ ๋ˆˆ์— ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. 

 

ํ˜„์žฌ ๋ฐ์ดํ„ฐ์˜ ์ปฌ๋Ÿผ์—๋Š” ์ด๋ฆ„, ์†Œ์† ํŒ€, ํฌ์ง€์…˜, ์ด ์ ์ˆ˜ ์™ธ์—๋„ ๊ณต๊ฒฉ ์‹œ๋„, ๊ณต๊ฒฉ ์„ฑ๊ณต, ๋ฆฌ์‹œ๋ธŒ ์‹œ๋„, ๋ฆฌ์‹œ๋ธŒ ์ •ํ™•, ๋ฆฌ์‹œ๋ธŒ ์‹คํŒจ, ์„œ๋ธŒ ์„ฑ๊ณต, ์„ธํŠธ ์„ฑ๊ณต, ๋ธ”๋กœํ‚น ์„ฑ๊ณต, ๋””๊ทธ ์„ฑ๊ณต ์ด ์žˆ๋‹ค.

์ด ์ค‘ ๊ณต๊ฒฉ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต๊ฒฉ ์„ฑ๊ณต๋ฅ ๊ณผ ์ˆ˜๋น„ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฆฌ์‹œ๋ธŒ ํšจ์œจ์„ ๊ตฌํ•˜๊ณ , ์ฐธ์—ฌํ•œ ์„ธํŠธ ์ˆ˜๋ฅผ ํ†ตํ•ด ์„ธํŠธ๋‹น ์„œ๋ธŒ, ์„ธํŠธ, ๋ธ”๋กœํ‚น, ๋””๊ทธ ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•ด์„œ ์ƒˆ๋กญ๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค.

 

์œ„์˜ ๋‚ด์šฉ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•ด ๋ณผ ํ…๋ฐ, ์–ด๋–ค ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ• ๊นŒ ๊ณ ๋ฏผํ•˜๋‹ค๊ฐ€ ํŒŒ์ด์ŠคํŒŒํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋‹ˆ ์ŠคํŒŒํฌ์—์„œ ์ œ๊ณตํ•˜๋Š” SQL์„ ์‚ฌ์šฉํ•ด๋ณด๊ธฐ๋กœ ํ–ˆ๋‹ค.

 

 

 

๋จผ์ €, ์ŠคํŒŒํฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์„ import ํ•˜๊ณ  ์„ธ์…˜์„ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.

# spark ์ƒ์„ฑ

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

ํฌ๋กค๋งํ•ด์„œ ๋งŒ๋“  ์„ ์ˆ˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‘ก์— ์ €์žฅํ•˜๊ณ  ๋ถˆ๋Ÿฌ์˜จ๋‹ค.

csv ํŒŒ์ผ์„ ํ•˜๋‘ก์— ์ €์žฅํ•˜๊ณ  ์ฅฌํ”ผํ„ฐ ๋…ธํŠธ๋ถ์„ ํ†ตํ•ด ํŒŒ์ด์ŠคํŒŒํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‚ด์šฉ์€ ์ด์ „์— ํฌ์ŠคํŠธ๋กœ ์ž‘์„ฑํ•ด๋‘์—ˆ๋‹ค.

data = spark.read.csv("hdfs://localhost:9000/data/player.csv", header="true", inferSchema="true")

๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด๋ณด์•˜๋‹ค. show() ๋ฉ”์†Œ๋“œ๋ฅผ ์†์„ฑ๊ฐ’ ์ง€์ • ์—†์ด ์‚ฌ์šฉํ•˜๋ฉด ์ƒ์œ„ 20์ค„์˜ ๋‚ด์šฉ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์„ ์ˆ˜ ๋ฐ์ดํ„ฐ

๋ฐ์ดํ„ฐ๋ฅผ ํ…Œ์ด๋ธ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์„ ์–ธํ•ด์ค€๋‹ค.

data.createOrReplaceTempView("player")

 

 

๊ธฐ์กด ๋ฐ์ดํ„ฐ ๋‚ด์šฉ์„ ํ† ๋Œ€๋กœ ๊ณต๊ฒฉ ์„ฑ๊ณต๋ฅ , ๋ฆฌ์‹œ๋ธŒ ํšจ์œจ, ์„ธํŠธ๋‹น ์„œ๋ธŒ/์„ธํŠธ/๋ธ”๋กœํ‚น/๋””๊ทธ ๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ๊ตฌํ•˜๊ณ  ํ™•์ธํ•ด๋ณธ๋‹ค.

# ๊ณต์„ฑ, ๊ณตํšจ(๋ณด๋ฅ˜), ๋ฆฌํšจ, ์„ธํŠธ๋‹น ์„œ๋ธŒ/์„ธํŠธ/๋ธ”๋กœํ‚น/๋””๊ทธ ๊ณ„์‚ฐ

player_rate = spark.sql("""
      select name,team,pos,score,error,setcount,round(at_succ/at_try,2) as at_srate,
          round((rs_corr-rs_fail)/rs_try,2) as rs_rate, round(sv_succ/setcount,2) as sv, 
          round(set_succ/setcount,2) as set,round(bl_succ/setcount,2) as bl,
          round(dg_succ/setcount,2) as dg
      from player
      """)

player_rate.show()

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ€๊ณตํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ…Œ์ด๋ธ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์„ ์–ธํ•ด์ฃผ์—ˆ๋‹ค.

player_rate.createOrReplaceTempView("player_rate")

SQL์˜ WHERE ์ ˆ์„ ์ด์šฉํ•ด ๊ฐ ๊ตฌ๋‹จ๋ณ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ ์ฃผ์—ˆ๋‹ค.

hgun_rate = spark.sql("""
          select *
          from player_rate
          where team like 'ํ˜„๋Œ€๊ฑด์„ค'
          """)
hgun_rate.show(5)

 

 

๊ฐ€๊ณตํ•œ ์„ ์ˆ˜ ๋ฐ์ดํ„ฐ์™€ ๊ตฌ๋‹จ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‘ก์— csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ณ  ํ™•์ธํ•ด๋ณธ๋‹ค.

# ์„ ์ˆ˜ ๋ฐ์ดํ„ฐ ์ €์žฅ - ํ•˜๋‘ก

player_rate.write.format('csv').save('hdfs://localhost:9000/data/player_r.csv',header='true')

data = spark.read.csv("hdfs://localhost:9000/data/player_r.csv", header='true', inferSchema='true')
data.show(5)

# ๊ตฌ๋‹จ๋ณ„ ๋ฐ์ดํ„ฐ ์ €์žฅ - ํ•˜๋‘ก

hgun_rate.write.format('csv').save('hdfs://localhost:9000/data/hgun_r.csv',header='true')
dogong_rate.write.format('csv').save('hdfs://localhost:9000/data/dogong_r.csv',header='true')
gscal_rate.write.format('csv').save('hdfs://localhost:9000/data/gscal_r.csv',header='true')
insam_rate.write.format('csv').save('hdfs://localhost:9000/data/insam_r.csv',header='true')
ibk_rate.write.format('csv').save('hdfs://localhost:9000/data/ibk_r.csv',header='true')
hguk_rate.write.format('csv').save('hdfs://localhost:9000/data/hguk_r.csv',header='true')
pepper_rate.write.format('csv').save('hdfs://localhost:9000/data/pepper_r.csv',header='true')
# ๋ฐ์ดํ„ฐ ์ž˜ ์ €์žฅ๋๋Š”์ง€ ํ™•์ธ

data = spark.read.csv("hdfs://localhost:9000/data/pepper_r.csv", header='true', inferSchema="true")
data.show(5)

 

๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ์ €์žฅ๋˜์—ˆ๋Š”์ง€ ํ•˜๋‘ก์—์„œ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.

 

 

 

์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•ด์„œ ํ•˜๋‘ก์— ์ €์žฅํ•˜๋Š” ๊ฒƒ๊นŒ์ง€ ์™„๋ฃŒํ–ˆ๊ณ , ์•ž์œผ๋กœ๋Š” ์ด ๋ฐ์ดํ„ฐ๋“ค์„ ํ†ตํ•ด ๊ฒฝ๊ธฐ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณผ ์˜ˆ์ •์ด๋‹ค.

 

 

728x90
Comments