[Python] BeautifulSoup으로 데이터(html) 수집하기

BeautifulSoup으로 웹 html 데이터를 수집해보았다.

(https://github.com/aruymeek/python_BeautifulSoup/tree/master/sportstest)

네이버 스포츠(https://sports.news.naver.com/index.nhn)에서 뉴스 기사의 제목, 링크, 내용, 언론사 등의 정보를 수집해보려고 한다.

네이버 스포츠 메인화면에서 보이는 기사 중, 빨간색 사각형으로 표시해 둔 부분의 정보만 가져왔다.

1. 클래스 생성하기

먼저 데이터를 간편하게 관리(?)하기 위해 SportsModel 클래스를 생성했다.

class SportsModel:

    def __init__(self, _title, _href, _content, _media, _league):
        self.title = _title
        self.href = _href
        self.content = _content
        self.media = _media
        self.league = _league

2. 데이터 수집하기

수집하기 전에, 필요한 모듈을 import 해준다.

import requests
from bs4 import BeautifulSoup

BeautifulSoup에서 원하는 태그가 포함된 부분을 찾는 역할을 하는 것에는 find()와 select()가 있다. 기본적으로 같은 역할을 하지만, 조금씩 차이가 있다.

url = 'https://sports.news.naver.com/index.nhn'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

네이버 스포츠 뉴스 페이지의 메인 메뉴 부분의 코드를 불러오는 것을 예시로 find()와 select()의 차이점을 확인해보았다.

2-1) find

result_find = soup.find('div', class_='menu_area')

find()는 'menu_area'라는 class 속성을 가진 <div> 태그를 불러올 경우, 어떤 속성인지 그 속성이 어떤 값을 가지고 있는지 다 나열해줘야 원하는 결과를 찾을 수 있다.

2-2) select

result_select = soup.select('div.menu_area')

select()는 '태그.속성'과 같이 find()를 사용할 때보다 간편하게 원하는 결과를 찾아낼 수 있다. ('.'은 class 속성임을 의미)

조건에 맞는 부분을 전부 찾아 리스트 형태로 반환된다는 특징을 가지고 있기도 한데, 여러개가 아닌 하나의 요소만을 찾고 싶다면 select_one()을 사용하면 된다.

사용이 편리한 select()를 활용하여 데이터를 수집해보자.

url = 'https://sports.news.naver.com/index.nhn'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

result = soup.select_one('.today_list')

가져오고자 했던 부분은 class가 'today_list'인 <ul> 태그로 묶여있다. 해당 부분의 html을 result에 담았다.

newsList = []

for r in result.select('li'):    
    title = r.select_one('a')['title']
    href = 'https://sports.news.naver.com/' + r.select_one('a')['href']

    atag = r.select_one('a')
    content = atag.select_one('.news').text.strip()
    
    info = atag.select_one('.information').text
    media = info.split('\n')[1]
    league = info.split('\n')[2]

    spm = SportsModel(title, href, content, media, league)
    newsList.append(spm)

<ul> 태그 안에는 각 기사에 대한 정보들이 <li> 태그로 묶여있다. 역시 마찬가지로 for문으로 기사 하나하나 정보를 추출하면 된다.

제목(title)과 기사 링크(href)를 가지고 온 방법이 특이하다. .select_one(tag)[attribute]는 해당 태그가 갖고 있는 속성의 값을 바로 반환해주는 역할을 한다. <a> 태그의 title, href 속성의 값으로 기사의 제목과 링크를 바로 뽑아낼 수 있었다.

뉴스 기사 정보를 담기 위해 빈 리스트 newsList를 만들어주고, SportsModel 클래스를 활용하여 spm이라는 인스턴스를 생성한 후 리스트에 추가해주었다.

3. 데이터 출력하기

클래스 객체 형태로 리스트에 담겨있는 정보들을 알아보기 쉽게 출력해보았다.

for news in newsList:
    print('title: {0}\nlink: {1}\ncontent: {2}\nmedia: {3}\nleague: {4}'
          .format(news.title, news.href, news.content, news.media, news.league))
    print('------------------------------')

'Python > crawling' 카테고리의 다른 글

[Python] selenium을 활용하여 해외 증시 정보 크롤링하기 (0)	2020.08.13
[Python] BeautifulSoup으로 수집한 데이터를 txt 파일로 저장하기 (0)	2020.08.12
[Python] BeautifulSoup 연습하기 (0)	2020.08.10

imymemine

[Python] BeautifulSoup으로 데이터(html) 수집하기

1. 클래스 생성하기

2. 데이터 수집하기

3. 데이터 출력하기

'Python > crawling' 카테고리의 다른 글

티스토리툴바

[Python] BeautifulSoup으로 데이터(html) 수집하기

1. 클래스 생성하기

2. 데이터 수집하기

3. 데이터 출력하기

'Python > crawling' 카테고리의 다른 글

'Python/crawling' Related Articles

티스토리툴바