불용어(stopword)
#

ref: https://bkshin.tistory.com/entry/NLP-3-%EB%B6%88%EC%9A%A9%EC%96%B4Stop-word-%EC%A0%9C%EA%B1%B0 분석에 큰 의미가 없는 단어들. a, an, the와 같은 관사나 I, my 같은 대명사들이 해당된다.

import nltk
nltk.download('stopwords')
print('영어 불용어 갯수:',len(nltk.corpus.stopwords.words('english')))

Lemmatization
#

단어는 어간과 접사가 있다.

어간을 추출하는 작업이 lemmatization이다.

Punctuation(구두점) 제거는 가장 흔하게 쓰이는 text normalization.

Regex로 제거하기
- text = re.sub(r"[^a-zA-Z0-9]", " “, text)
- 알파벳, 숫자 외는 모두 공백으로 변경.
- 보통 공백으로 치환해서 문장의 구조를 최대한 유지해준다.
spacy의 token에서 is_punct를 호출하면 puncutaion인지 알 수 있다.
python built in function을 써도 된다.
- punctuation list인 string.punctuation를 사용.

Reply by Email