在R之中可以用rvest進行網路爬蟲,
url 表示你要抓取資料的連結,
read_html(url) %>% html_nodes("section p") %>% html_text() 表示去抓取這連結底下,標籤section底下的標籤p所包裹的文字。
簡單範例如下:
We can do web crawling with rvest in R; the code is as below. In the code, read_html(url) %>% html_nodes("section p") %>% html_text() means we want to catch the words coated in the p tag and under the section tag on a web page.
如果要抓取多個已知連結,
則可將連結們丟入vector,
利用迴圈方式去抓取內容,
比較需要注意的為Sys.sleep(runif(1,2,4)),
這意思是讓程序休息2~4秒,
免得過於頻繁存取同個網站(類似DDOS),
如果不設定休息時間,
可能會被該網站阻斷,
就會暫時不能進入該網站。
範例如下:
If we want to crawl more than one web page, we can put urls in vector, using for loop to get all the words on the web pages you want. However, if we are crawling web pages from the same website, we should use something like Sys.sleep(runif(1,2,4)) to let the program pause about 2~4 seconds, so that we won't access the website too often, since sometimes the website will block you for accessing their website too often.
(另一篇寫在Github範例(其中有 Code Example): Simple way to scrape website by rvest)
Popular Posts
Blog Archive
Categories
R
(28)
data.table
(4)
Python
(3)
Rstudio
(3)
dplyr
(3)
rvest
(3)
網路爬蟲
(3)
Error
(2)
Web Crawler
(2)
grepl
(2)
jupyter
(2)
plyr
(2)
ubuntu
(2)
教學
(2)
.Last.value
(1)
Big Data
(1)
Console
(1)
IEEE程式語言排行
(1)
PuTTY
(1)
Rprofile.site
(1)
Rselenium
(1)
XLConnect
(1)
assign
(1)
bar chart
(1)
cat
(1)
conflict
(1)
coord_flip
(1)
data.frame
(1)
dcast
(1)
download.file
(1)
evalWithTimeout
(1)
excel_sheets
(1)
factor
(1)
file.rename
(1)
fread
(1)
ggplot2
(1)
global variable
(1)
group_by
(1)
gsub
(1)
invalid multibyte character
(1)
jiebaR
(1)
join
(1)
jupyter_contrib_nbextensions
(1)
jupyterthemes
(1)
loading
(1)
melt
(1)
merge
(1)
mutate
(1)
numeric
(1)
print
(1)
rbind
(1)
read.csv
(1)
read_csv
(1)
read_excel
(1)
readr
(1)
readxl
(1)
scientific notation
(1)
scipen
(1)
separate_rows
(1)
setDF
(1)
setDT
(1)
sqldf
(1)
static IP address
(1)
str_count
(1)
stringr
(1)
table
(1)
tidyr
(1)
timeout
(1)
trim
(1)
txtProgressBar
(1)
unique
(1)
zip
(1)
人力銀行
(1)
參考資源
(1)
技能
(1)
文字探勘
(1)
橫條圖
(1)
玩玩小數據
(1)
結巴分詞
(1)
能力
(1)
資料分析
(1)
資料分析師
(1)
長條圖
(1)
沒有留言:
張貼留言