The Relationship between news and stocks 16

The Relationship between news and stocks 16

2022. 8. 3. 23:47ㆍProject/뉴스기사로 인한 주가 등락 예측

728x90

TextCNN

1. Word embedding vector를 input으로 받음
2. Filter와 Word embedding vector의 convolution 연산을 통해 feature map 생성
3. activation function을 통해 feature map을 activation map으로 사상
4. 각 activation map을 max pooling하여 concatenation
5. concat한 벡터를 fully-connected layer의 input으로 넣은 후 classification

▶ TextCNN 장점

- 문장의 문맥적 의미를 파악하는 과정에서 정보를 집약 → 연산속도 향상

- 분류 문제에서 RNN보다 좋은 성능을 보임

코드

필요 라이브러리 설치

%pip install gensim --upgrade
%pip install -U keras-tuner
%pip install pymysql

라이브러리 불러오기

import IPython
import keras_tuner as kt
from tensorflow import keras
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import pandas as pd
import numpy as np
import pymysql

MySQL 연동

conn = pymysql.connect(
                        user    = 'stocks',
                        passwd  = 'Stocks!',
                        host    = "-",
                        port    = 3306,
                        db      = 'Data',
                        charset = 'utf8'
        )

데이터 불러오기

sql = 'SELECT stock_id, text, date, token, label FROM Token'
news = pd.read_sql(sql, conn)
news

데이터 전처리 - token열을 Str -> List로 변환

import re

def str_to_list(d):
  text = re.sub(r'[\[\'\]]', '', d)
  return text.split(", ")

news["token"] = news.token.apply(str_to_list)

데이터 전처리 - 불용어 처리

from tqdm import tqdm
def stopword(x):
  stopword = [r'상승.*', r'하락.*', r'급등.*', r'급락.*', '상승세', '하락세', '폭등', '폭락', '오름세', '약세', '강세', '의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로', '자', '에', '와', '한', '하다', '에', '은', '는', '하']
  return [i for i in x if i not in stopword and not i.isdigit()]
tqdm.pandas()
news["token"] = news.token.progress_apply(stopword)
news

모델링 - 데이터 셋 나누기

test = news.loc[news["date"] >= '2022-07-01 00:00:00']
train = news.loc[news["date"] < '2022-07-01 00:00:00']

X_train = train['token']
y_train = train['label']
X_test = test['token']
y_test = test['label']

모델링 - Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

threshold = 4
words_cnt = len(tokenizer.word_index)
rare_cnt = 0
words_freq = 0
rare_freq = 0

for key, value in tokenizer.word_counts.items():
  words_freq += value

  if value < threshold:
    rare_cnt +=1
    rare_freq += value

print("전체 단어 수", words_cnt)
print("빈도가 {} 이하인 희귀 단어 수: {}".format(threshold-1, rare_cnt))
print("희귀 단어 비율: {}".format((rare_cnt / words_cnt) * 100))
print("희귀 단어 등장 빈도 비율: {}".format((rare_freq / words_freq) * 100))

'''
전체 단어 수 62463
빈도가 3 이하인 희귀 단어 수: 18802
희귀 단어 비율: 30.101019803723805
희귀 단어 등장 빈도 비율: 0.07036167954849475
'''

vocab_size = words_cnt - rare_cnt + 2

tokenizer = Tokenizer(vocab_size, oov_token='OOV')
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

y_train = np.array(y_train)
y_test = np.array(y_test)

drop_train = [index for index, sentence in enumerate(X_train) if len(sentence) < 1]

X_trian = np.delete(X_train, drop_train, axis=0)
y_train = np.delete(y_train, drop_train, axis=0)

print('리뷰 최대 길이:', max(len(l) for l in X_train))
print('리뷰 평균 길이:', sum(map(len, X_trian)) / len(X_train))


'''
리뷰 최대 길이: 5759
리뷰 평균 길이: 489.9088054952696
'''

import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

plt.hist([len(s) for s in X_train], bins=50)
plt.xlabel('Length of Samples')
plt.ylabel('Number of Samples')
plt.show()

모델링 - Padding

max_len = 800

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

모델링 - BI-LSTM

embedding_dim = max_len # 임베딩 벡터의 차원
dropout_ratio = 0.4 # 드롭아웃 비율
num_filters = 2 # 커널의 수
kernel_size = 3 # 커널의 크기
hidden_units = 128 # 뉴런의 수

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim))
model.add(Dropout(dropout_ratio))
model.add(Conv1D(num_filters, kernel_size, padding = 'valid', activation = 'relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_units, activation='relu'))
model.add(Dropout(dropout_ratio))
model.add(Dense(1, activation = 'sigmoid'))

model.summary()

es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('TextCNN_best_model.h5', monitor = 'val_accuracy', mode = 'max', verbose = 1, save_best_only = True)

history = model.fit(X_train, y_train, epochs=10, validation_split = 0.2, callbacks=[es, mc], batch_size = 128)

loaded_model = load_model('TextCNN_best_model.h5')
loaded_model.evaluate(X_test, y_test)

참고

- 김윤 박사님의 논문 「Convolutional Neural Networks for Sentence Classification」