본문 바로가기

web crawling

인스타그램 크롤러 #posted_texts, hashtags, ids, posted_time crawler #selenium, beutifulsoup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from selenium import webdriver
import numpy as np
import urllib
import selenium
from urllib.request import Request, urlopen
import requests
import time
from bs4 import BeautifulSoup
from urllib import request
import pandas as pd
#스토리 취미(여기에 일상이 더 많음)
 
search = input('검색어를 입력하시오: ')
search = urllib.parse.quote(search)
url = 'https://www.instagram.com/explore/tags/'+str(search) + '/'
 
driver = webdriver.Chrome('C:\\Users\\user\\Desktop\\chromedriver\\chromedriver.exe')  #절대경로로 찾아가서 크롬드라이버로 검색
driver.get(url)
time.sleep(5#페이지가 완전히 로드된 후 코드 진행. 5초간 슬립
 
SCROLL_PAUSE_TIME = 1
reallink = [] #이곳에 각 페이지의 고유 주소들을 모은다.
 
while True:
    pagestring = driver.page_source
    bs = BeautifulSoup(pagestring, 'lxml')
    
    for link1 in bs.find_all(name = 'div', attrs={'class':'Nnq7C weEfm'}): # div태그의 'Nnq7C weEfm' 클래스의 'a'태그들을 가져옴 / find_all을 사용하면 하위태그들까지 전부 가져온다.
        for a in link1.select('a'):
            reallink.append(a.attrs['href'])
       
        
    last_height = driver.execute_script('return document.body.scrollHeight'#scrollheight를 반환함
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);'# x축으로 0, y축으로 scrollheight만큼 이동시킴
    time.sleep(SCROLL_PAUSE_TIME) # 1.5초간 대기
    new_height = driver.execute_script('return document.body.scrollHeight'#이동 시킨 이후의 scrollheight를 가져옴
    if new_height == last_height: #이동시키기 전과 후가 동일하면 한번 더 스크롤다운을 해보고 기다린다. 
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        time.sleep(SCROLL_PAUSE_TIME)
        new_height = driver.execute_script('return document.body.scrollHeight'
        if new_height == last_height:
            break #한번더 스크롤다운을 해도 동일하면 종료
        else:
            new_height = last_height
            continue
 
 
 
 
 
 
posting_infos = []
for address in reallink:
    post_url = 'https://www.instagram.com/' + address
    driver.get(post_url)
    time.sleep(3)
    try:
        infos = driver.find_element_by_class_name('C4VMK').text 
        posting_infos.append(infos)
    except:
        print(address)
        continue
cs

https://riptutorial.com/ko/selenium-webdriver/example/13934/webdriver%EB%A5%BC-%EC%82%AC%EC%9A%A9%ED%95%98%EC%97%AC-%ED%8E%98%EC%9D%B4%EC%A7%80-%EC%9A%94%EC%86%8C-%EC%B0%BE%EA%B8%B0

 

selenium-webdriver - WebDriver를 사용하여 페이지 요소 찾기 | selenium-webdriver Tutorial

selenium-webdriver documentation: WebDriver를 사용하여 페이지 요소 찾기

riptutorial.com

이 곳으로 가면 selenium의 webdriver를 사용해 페이지 요소를 찾는 다양한 방법들이 나와있다. 

'web crawling' 카테고리의 다른 글

instagram images crawling  (1) 2020.06.22
Instagram Hashtag crawling using Selenium  (2) 2020.05.31