[Day38 ~ Day43] Section 2 Project

[Day38 ~ Day43] Section 2 Project

2022. 3. 25. 21:11ㆍAI/Codestates

728x90

https://github.com/JooJaeHwan/Codestates-Project/tree/main/Section_2

GitHub - JooJaeHwan/Codestates-Project

Contribute to JooJaeHwan/Codestates-Project development by creating an account on GitHub.

github.com

타자 데이터 크롤링

from selenium import webdriver
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import numpy as np
import re
# 크롤링
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(3)
for i in range(89):

  url = 'http://www.statiz.co.kr/stat.php?mid=stat&re=0&ys=1982&ye=2021&sn=100&pa={}'.format(i * 100)

  driver.get(url)
  driver.implicitly_wait(5)

  html = driver.find_element_by_xpath('//*[@id="mytable"]/tbody').get_attribute("innerHTML")  # 기록 table을 str형태로 저장
  soup = BeautifulSoup(html, 'html.parser')  # str 객체를 BeautifulSoup 객체로 변경

  temp = [i.text.strip() for i in soup.findAll("tr")]  # tr 태그에서, text만 저장하기
  temp = pd.Series(temp)  # list 객체에서 series 객체로 변경

  # '순'이나 'W'로 시작하는 row 제거
  # 즉, 선수별 기록만 남기고, index를 reset 해주기
  temp = temp[~temp.str.match("[순W]")].reset_index(drop=True)

  temp = temp.apply(lambda x: pd.Series(x.split(' ')))  # 띄어쓰기 기준으로 나눠서 dataframe으로 변경

  # 선수 팀 정보 이후 첫번째 기록과는 space 하나로 구분, 그 이후로는 space 두개로 구분이 되어 있음
  # 그래서 space 하나로 구분을 시키면, 빈 column들이 존재 하는데, 해당 column들 제거
  temp = temp.replace('', np.nan).dropna(axis=1)

  # WAR 정보가 들어간 column이 2개 있다. (index가 1인 column과, 제일 마지막 column)
  # 그 중에서 index가 1인 columm 제거
  temp = temp.drop(1, axis=1)

  # 선수 이름 앞의 숫자 제거
  temp[0] = temp[0].str.replace("^\d+", '')

  # 선수들의 생일 정보가 담긴 tag들 가지고 오기
  birth = [i.find("a") for i in soup.findAll('tr') if 'birth' in i.find('a').attrs['href']]

  # tag내에서, 생일 날짜만 추출하기
  p = re.compile("\d{4}\-\d{2}\-\d{2}")
  birth = [p.findall(i.attrs['href'])[0] for i in birth]

  # 생일 column 추가
  temp['생일'] = birth

  # page별 완성된 dataframe을 계속해서 result에 추가 시켜주기
  if i == 0:
    result = temp
  else:
    result = result.append(temp)
    result = result.reset_index(drop=True)

  print(i, "완료")

# column 명 정보 저장
columns = ['선수'] + [i.text for i in soup.findAll("tr")[0].findAll("th")][4:-3] + ['타율', '출루', '장타', 'OPS', 'wOBA',
                                                                                  'wRC+', 'WAR+', '생일']

# column 명 추가
result.columns = columns

# webdriver 종료
driver.close()

print("최종 완료")

투수 데이터 크롤링

from selenium import webdriver
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import numpy as np
import re
# 크롤링
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(3)
for i in range(65):

  url = 'http://www.statiz.co.kr/stat.php?mid=stat&re=1&ys=1982&ye=2021&sn=100&pa={}'.format(i * 100)

  driver.get(url)
  driver.implicitly_wait(5)

  html = driver.find_element_by_xpath('//*[@id="mytable"]/tbody').get_attribute("innerHTML")  # 기록 table을 str형태로 저장
  soup = BeautifulSoup(html, 'html.parser')  # str 객체를 BeautifulSoup 객체로 변경

  temp = [i.text.strip() for i in soup.findAll("tr")]  # tr 태그에서, text만 저장하기
  temp = pd.Series(temp)  # list 객체에서 series 객체로 변경

  # '순'이나 'W'로 시작하는 row 제거
  # 즉, 선수별 기록만 남기고, index를 reset 해주기
  temp = temp[~temp.str.match("[순W]")].reset_index(drop=True)

  temp = temp.apply(lambda x: pd.Series(x.split(' ')))  # 띄어쓰기 기준으로 나눠서 dataframe으로 변경

  # 선수 팀 정보 이후 첫번째 기록과는 space 하나로 구분, 그 이후로는 space 두개로 구분이 되어 있음
  # 그래서 space 하나로 구분을 시키면, 빈 column들이 존재 하는데, 해당 column들 제거
  temp = temp.replace('', np.nan).dropna(axis=1)

  # WAR 정보가 들어간 column이 2개 있다. (index가 1인 column과, 제일 마지막 column)
  # 그 중에서 index가 1인 columm 제거
  temp = temp.drop(1, axis=1)

  # 선수 이름 앞의 숫자 제거
  temp[0] = temp[0].str.replace("^\d+", '')

  # 선수들의 생일 정보가 담긴 tag들 가지고 오기
  birth = [i.find("a") for i in soup.findAll('tr') if 'birth' in i.find('a').attrs['href']]

  # tag내에서, 생일 날짜만 추출하기
  p = re.compile("\d{4}\-\d{2}\-\d{2}")
  birth = [p.findall(i.attrs['href'])[0] for i in birth]

  # 생일 column 추가
  temp['생일'] = birth

  # page별 완성된 dataframe을 계속해서 result에 추가 시켜주기
  if i == 0:
    result = temp
  else:
    result = result.append(temp)
    result = result.reset_index(drop=True)

  print(i, "완료")

# column 명 정보 저장
columns = ['선수'] + [i.text for i in soup.findAll("tr")[0].findAll("th")][4:17] + [i.text for i in soup.findAll("tr")[0].findAll("th")][19:-3] + ['ERA', 'FIP', 'WHIP', 'ERA+', 'FIP+',
                                                                                  'WAR', '생일']

# column 명 추가
result.columns = columns

# webdriver 종료
driver.close()

print("최종 완료")

골든 글러브 표 크롤링

import requests
import pandas as pd
from bs4 import BeautifulSoup
from html_table_parser import parser_functions

url = "https://www.koreabaseball.com/History/Etc/GoldenGlove.aspx"

request = requests.get(url)
soup = BeautifulSoup(request.text, "html.parser")
data = soup.find("table" , {"class" : "tData mgt20"})
table = parser_functions.make2d(data)

df = pd.DataFrame(data = table[1:], columns = table[0])
df.to_csv("/Users/jjwani/Downloads/Golden_Glove.csv")

728x90

'AI > Codestates' 카테고리의 다른 글

[Day 45] 개발환경 (0)	2022.03.29
[Day 44] Section 2 Review (0)	2022.03.25
[Day 37] Sprint Review (0)	2022.03.25
[Day 36] Interpreting ML Model (0)	2022.03.25
[Day 35] Feature Importances (0)	2022.03.15

쿼카의 개발일지

쿼카의 개발일지

태그

최근글

댓글

공지사항

아카이브

타자 데이터 크롤링

투수 데이터 크롤링

골든 글러브 표 크롤링

'AI > Codestates' 카테고리의 다른 글

관련글

티스토리툴바