selenium自动化工具爬虫

selenium自动化工具爬虫

十一月 25, 2019

This is selenium

1
$ selenium

一、 selenium介绍

1.1、selenium简介

selenium 是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera等。selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium Remote Control)和测试的并行处理(Selenium Grid)。
Selenium的核心Selenium Core基于JsUnit,完全由JavaScript编写,因此可以用于任何支持JavaScript的浏览器上。
selenium可以模拟真实浏览器,自动化测试工具,支持多种浏览器,爬虫中主要用来解决JavaScript渲染问题。

二、环境安装之Selenium

2.1、通过jar包安装

点击 [https://selenium.dev/downloads/](https://selenium.dev/downloads/ "Selenium下载")

2.2、maven之pom文件引入依赖

1
2
3
4
5
 <dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.4.0</version>
</dependency>

2.3、Hello Selenium

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package javaBase;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

public class Itest {
public static void main(String[] args) {

WebDriver driver = new ChromeDriver();
driver.get("http://www.itest.info");

String title = driver.getTitle();
System.out.printf(title);

driver.close();
}
}

三、 selenium3 浏览器驱动

3.1、下载浏览器驱动

谷歌:http://chromedriver.storage.googleapis.com/index.html

3.2、设置浏览器驱动

设置浏览器的地址非常简单。 我们可以手动创建一个存放浏览器驱动的目录,如: C:\driver , 将下载的浏览器驱动文件(例如:chromedriver、geckodriver)丢到该目录下。
我的电脑–>属性–>系统设置–>高级–>环境变量–>系统变量–>Path,将“C:\driver”目录添加到Path的值中。
1
2
3
4
5
6
7
8
9
10
11
@ControllerAdvice
public class GlobalExceptionHandler {
@ExceptionHandler(RuntimeException.class)
@ResponseBody
public Map<String, Object> exceptionHandler() {
Map<String, Object> map = new HashMap<String, Object>();
map.put("errorCode", "500");
map.put("errorMsg", "系統错误!");
return map;
}
}

四、 综合实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
	public class CrawLers {
private static String xpath;
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.chrome.driver", "C:\\Users\\HaiRui\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("https://github.com/trending");
xpath="/html/body/div[1]/header/div/div[2]/div[2]/a[1]";
WebElement login = driver.findElement(By.xpath(xpath));
login.click();
driver.findElement(By.xpath("//*[@id=\"login_field\"]")).sendKeys(new GitHubUser().getUsername());
driver.findElement(By.xpath("//*[@id=\"password\"]")).sendKeys(new GitHubUser().getPassword());
driver.findElement(By.xpath("//*[@id=\"login\"]/form/div[3]/input[8]")).click();

String html = driver.getPageSource();
Document doc = Jsoup.parse(html);
List<Element> eleList = doc.getElementsByClass("Box-row");
//System.out.println(eleList.size());
List<Element> eleTop10 = new ArrayList<>();
for (int i = 0; i < 10; i++) {
eleTop10.add(eleList.get(i));
}

Repositories repos = new Repositories();
ArrayList<Repositories> reposList = new ArrayList<>();
Elements eles1,elesName,elesLanguage,eles4;
Element ele1;
String str;
int i = 1;
for (Element ele : eleTop10){
eles1 = ele.getElementsByClass("h3 lh-condensed");
ele1 = eles1.get(0);
elesName = ele1.getElementsByTag("a");
str = elesName.get(0).text();
String[] name = str.split(" ");
repos.setUsername(name[0]);
repos.setReposName(name[2]);
elesLanguage = ele.getElementsByAttributeValue("itemprop","programmingLanguage");
repos.setLanguage(elesLanguage.text());
String star = ele.getElementsByClass("f6 text-gray mt-2").get(0).getElementsByClass(" muted-link d-inline-block mr-3").text();
String[] stars = star.split(" ");
repos.setStarsCount(stars[0]);
xpath = "/html/body/div[4]/main/div[3]/div/div[2]/article["+i+"]/h1/a";
i++;
driver.findElement(By.xpath(xpath)).click();
String html1 = driver.getPageSource();
Document doc1 = Jsoup.parse(html1);
String branchesCount = doc1.getElementsByClass("num text-emphasized").text();
String[] branches = branchesCount.split(" ");
repos.setBranchesCount(branches[1]);
System.out.println(repos);
xpath = "//*[@id=\"js-repo-pjax-container\"]/div[1]/div/ul/li[1]/form/details";

driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
WebElement element = driver.findElement(By.xpath(xpath));
((JavascriptExecutor)driver).executeScript("arguments[0].style.border = \"5px solid yellow\"",element);

driver.findElement(By.xpath(xpath)).click();
Thread.sleep(10000);
xpath = "//*[@id=\"js-repo-pjax-container\"]/div[1]/div/ul/li[1]/form/details/details-menu/div[2]/button[3]";

driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
WebElement element1 = driver.findElement(By.xpath(xpath));
((JavascriptExecutor)driver).executeScript("arguments[0].style.border = \"5px solid yellow\"",element1);

driver.findElement(By.xpath(xpath)).click();
reposList.add(repos);
driver.navigate().back();
}

for (Repositories r : reposList){
System.out.println(r);
}

DownloaderExcel excle = new DownloaderExcel();
excle.excel(reposList);
driver.close();
}
}