受欢迎的博客标签

Web crawler series:Chrome headless Puppeteer Sharp

Published

Introduction

In this tutorial you’ll set up and deploy a production-ready ASP.NET Core application with a MongoDb Server on Ubuntu 18.04 using Nginx.

How it works

Remember the relationships.

OS---------- ubuntu server 18.04-x64 
 ├── PuppeteerSharp                  <----web browser
 │   ├── Chromium                     <----web browser.Chromium is an open-source and free web browser
 │   └── Connect to a remote browser                         <----Connect to a remote browser
 ├── WebDriver
 │   ├── ChromeDriver                    <- ---------- SDK
 │   ├── firefox                              <--- The Desktop Runtime enables you to run existing Windows desktop applications.
 │   └── EdgeChromium                  <---  The .NET Runtime enables you to run existing  console app.
 ├── web browser applications   <----web browser
 │   ├── Chromium                     <----web browser.Chromium is an open-source and free web browser
 │   └── Chrome                         <----web browser.google 
 └── Nginx web server   <--- frontend Local_Server ,Listens on Port 443 / 80 and forwards HTTP/S calls to http://locahost:5000.

 

Tabel of content

 

具体的puppeteer对应的Chrome版本查看:https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md

Proxy in PuppeteerSharp

Puppeteer Sharp v5.0 connect to local Chrome headless

Puppeteer Sharp v5.0 Connect to a Remote Browser

Using Puppeteer with Edge Chromium

 

Chromium vs chromedriver vs  chrome  chromium vs chromium-browser  chromium_webview

Chromium(Chromium.exe) is an open-source and free web browser that is managed by the Chromium Project.

国产的所有 “双核浏览器”,Chrome ,都是基于 Chromium 开发的.

chrome(chrome.exe) Developed and managed by Google, Chrome is a proprietary browser. It was released in 2008. Chromium and Chrome browser are tied to each other because Google’s Chrome borrows Chromium’s source code.

Google已经完全开源了Chromium for Android,这样我们就完全可以开发与Chrome for Android媲美的Android浏览器.

 

 

Introduction

Headless Browsers

A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.

Puppeteer

https://github.com/puppeteer/puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

Puppeteer Sharp

控制Chrome或Chromium

Puppeteer Sharp is a port of the popular Headless Chrome NodeJS API built by Google. Puppeteer Sharp was written in C# and released in 2017 by Darío Kondratiuk to offer the same functionality to .NET developers.

Puppeteer Sharp enables a .NET developer to programmatically control, or ‘puppeteer’ the open-source Google Chromium web browser or firefox web browser(Puppeteer Sharp V5.0). The convenience of the Puppeteer API is the ability to use a headless instance of the browser, not actually displaying the UI for increased performance benefits.

If you are a .NET developer, installing the Puppeteer Sharp Nuget package into your project can enable you to achieve:

Crawling the web using a headless web browser
Automated testing of a web application using a test framework
Retrieve JavaScript rendered HTML

https://github.com/hardkoded/puppeteer-sharp

Requirements

Platform / OS version:Ubuntu 18.04

vs2019

.Net version:net5.x

Table of contents

Create a new ASP.NET Core web application

Download chromium browser revision package

Load a Webpage

 

Connect to a Remote Browser

 

 

PuppeteerSharp V5.0

Step 1:Install-Package PuppeteerSharp 5.0

To use Puppeteer Sharp in your project, run:

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>net5.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
   
    <PackageReference Include="PuppeteerSharp" Version="5.0.0" />
   
  </ItemGroup>

</Project>

 

Step 2: Download  chromium.zip(下载chromium浏览器)

 You can see the version mapping here github.com/GoogleChrome/puppeteer/blob/v1.10.0/docs/api.md  you can download a compatible version here chromium.woolyss.com/download/en

using PuppeteerSharp;

   var browserFetcher = new BrowserFetcher();
           //DownloadAsync() 实际执行了2个步骤,详细的可以去看看它的源码
           //1.请求https://storage.googleapis.com这个地方去下载
           //2.下载完成解压出来
            await browserFetcher.DownloadAsync();

           //Headless = false:Puppeteer in headful mode will display the browser UI
           //
            await using var browser = await Puppeteer.LaunchAsync(
                new LaunchOptions { Headless = true });      
            await using var page = await browser.NewPageAsync();
            await page.GoToAsync("http://stockpage.10jqka.com.cn/603906/operate/");   //获取最终页面(即加载JavaScript之后的页面)
            await page.ScreenshotAsync(outputFile);
            //Get and return the HTML content of the page
            var content = await page.GetContentAsync();

 

 

 

 

Step 2: reuse downloaded chrome

Puppeteer Sharp: avoid downloading Chromium (bundle Chromium locally)

You can use the LaunchOptions.ExecutablePath property. If you use ExecutablePath you don't even need to call BrowserFetcher. It would look something like this:

var options = new LaunchOptions
{
    Headless = true,
    ExecutablePath = "<CHROME.EXE FULL PATH>"
};

using (var browser = await Puppeteer.LaunchAsync(options))
{
}

 

 F:\Miniblog.Core
2020/11/07  16:23    <DIR>          .local-chromium
                  ---F:\Miniblog.Core\Admin2\Miniblog.Core\.local-chromium\Win64-884014\chrome-win\chrome.exe  it is a chromium browser, not chrome browser
2021/08/03  00:54    <DIR>          .local-firefox
...
2021/08/03  00:18               860 Miniblog.Core.csproj
...
2021/04/17  22:45             6,056 Startup.cs

 

namespace PuppeteerSharp
{
    //
    // 摘要:
    //     Browser to use (Chrome or Firefox).
    public enum Product
    {
        //
        // 摘要:
        //     Chrome.
        Chrome = 0,
        //
        // 摘要:
        //     Firefox.
        Firefox = 1
    }
}

 

 var bf = new BrowserFetcher();

            //ExecutablePath=F:\Miniblog\Miniblog.Core\.local-chromium\Win64-884014\chrome-win\chrome.exe
            var ExecutablePath = bf.GetExecutablePath(BrowserFetcher.DefaultChromiumRevision);
          
            var options = new LaunchOptions
            {
                Headless = true,
                ExecutablePath = bf.GetExecutablePath(BrowserFetcher.DefaultChromiumRevision)
            };

            var browser = await Puppeteer.LaunchAsync(options);

            var page = await browser.NewPageAsync();

            await page.GoToAsync("http://stockpage.10jqka.com.cn/603906/operate/");

Puppeteer Sharp v5.0 connect to local Chrome headless=false (puppeteer 连结已有的Chrome浏览器) 

第一步,现在Chrome的桌面快捷方式中添加调试启动参数。 方法:快捷方式–右键属性–目标 在最后添加 --remote-debugging-port=9222即可,和.exe之间有个空格。
第二部,在浏览器中请求地址:http://localhost:9222/json/version 是正常的GET请求,获得到webSocketDebuggerUrl参数。
第三步,puppeteer 连结已有的Chrome浏览器
用 const browser = await puppeteer.connect({
browserWSEndpoint: webSocketDebuggerUrl
});
即可。

step 1:

"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 http://basic.10jqka.com.cn/603906/operate.html#stockpage
"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --headless  --remote-debugging-port=9222 http://basic.10jqka.com.cn/603906/operate.html#stockpage

step 2:

http://localhost:9222/json/version

{
   "Browser": "HeadlessChrome/92.0.4515.159",
   "Protocol-Version": "1.3",
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/92.0.4515.159 Safari/537.36",
   "V8-Version": "9.2.230.29",
   "WebKit-Version": "537.36 (@0185b8a19c88c5dfd3e6c0da6686d799e9bc3b52)",
   "webSocketDebuggerUrl": "ws://localhost:9222/devtools/browser/733d8577-676a-4ea9-838d-80cfb327bf86"
}

step 3:

  /// <summary>
        /// Puppeteer Sharp v5.0 Connect to a local Remote Browser
        /// </summary>
        /// <returns></returns>
        public async Task<ActionResult> Index2()
        {

            var connectOptions = new ConnectOptions()
            {
                BrowserWSEndpoint = "ws://localhost:9222/devtools/browser/733d8577-676a-4ea9-838d-80cfb327bf86/"
            };

            var browser = await Puppeteer.ConnectAsync(connectOptions);

            // 

            Page page = await browser.NewPageAsync();


            await page.Tracing.StartAsync(new TracingOptions { Path = "C:\\Files\\trace.json" });

            await page.GoToAsync("http://basic.10jqka.com.cn/603906/operate.html#stockpage");




            string content = await page.GetContentAsync();


            return View();
        }

 

 

other:

(1)Using Puppeteer with Edge Chromium

Change executablePath to point to your installation of Microsoft Edge (Chromium).

We need to point Puppeteer to our newly installed Chromium Edge via the Puppeteer options:

import * as puppeteer from 'puppeteer';
     
const puppeterOptions = {
    headless: false,
    executablePath: 'C:\\Users\\joe\\AppData\\Local\\Microsoft\\Edge SxS\\Application\\msedge.exe',
};
     
export default async () => {
    return await puppeteer.launch(puppeterOptions);
};

To find the executablePath, navigate to edge://version and copy the Executable path.

User Agent string difference in Puppeteer headless and headful

 1.show User Agent string when running Puppeteer in headless mode.

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/79.0.3945.0 Safari/537.36

Please notice there is sub string HeadlessChrome there.

2. show User Agent string when running Puppeteer in headful mode.

headless: false
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36

we can see that this User Agent string is similar like normal web browser User Agent string

How to set User Agent on headless Chrome

 // set user agent (override the default headless User Agent)
            await page.SetUserAgentAsync("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36");

 

 

C:\Program Files\Google\Chrome\Application\chrome.exe"  --remote-debugging-port=9222

"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"  --remote-debugging-port=9222

 

PuppeteerSharp V2.04

Step 1:Install-Package PuppeteerSharp

If you are a .NET developer, installing the Puppeteer Sharp Nuget package into your project can enable you to achieve:

Crawling the web using a headless web browser
Automated testing of a web application using a test framework
Retrieve JavaScript rendered HTML

To use Puppeteer Sharp in a new or existing .NET project. install the latest version of the Nuget package ‘PuppeteerSharp’

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>net5.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="PuppeteerSharp" Version="2.0.4" />
  </ItemGroup>

</Project>

 

Step 2:Download  chromium.zip

In order to get headless chrome working,

 public class AdController : Controller
    {
        // GET: AdController
        public async Task<ActionResult> Index()
        {
  //Download chromium browser revision package
            RevisionInfo revisionInfo= await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);

            Console.WriteLine($"revisionInfo.Downloaded:{revisionInfo.Downloaded}");
            Console.WriteLine($"revisionInfo.ExecutablePath:{ revisionInfo.ExecutablePath}");
            Console.WriteLine($"revisionInfo.FolderPath:{ revisionInfo.FolderPath} ");
            Console.WriteLine($"revisionInfo.Platform:{ revisionInfo.Platform}    ");
            Console.WriteLine($"revisionInfo.Revision:{ revisionInfo.Revision}   ");


            return View();
        }

output

Content root path: /var/www/Net5/
revisionInfo.Downloaded:True
revisionInfo.ExecutablePath:/var/www/Net5/.local-chromium/Linux-706915/chrome-linux/chrome
revisionInfo.FolderPath:/var/www/Net5/.local-chromium/Linux-706915 
revisionInfo.Platform:Linux    
revisionInfo.Revision:706915 

For the first time, it will download the download-Win64-536395.zip portable installation package from the network to the .local-chromium directory of the current program.

It will some time to wait here.

If a network problem occurs during downloading, which results in download failure, it throws an exception and exits the program.

 An unhandled exception has occurred while executing the request.
      System.Net.WebException: The operation has timed out.
         at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
         at System.Net.WebClient.GetWebResponse(WebRequest request, IAsyncResult result)
         at System.Net.WebClient.GetWebResponseTaskAsync(WebRequest request)
         at System.Net.WebClient.DownloadBitsAsync(WebRequest request, Stream writeStream, AsyncOperation asyncOp, Action`3 completionDelegate)
         at PuppeteerSharp.BrowserFetcher.DownloadAsync(Int32 revision)

other:

Change Defaults download host to a new host

A download host to be used.

1.Defaults to https://storage.googleapis.com.

https://github.com/hardkoded/puppeteer-sharp/blob/master/lib/PuppeteerSharp/BrowserFetcher.cs

 public class BrowserFetcher
    {
        private const string DefaultDownloadHost = "https://storage.googleapis.com";
        private static readonly Dictionary<Platform, string> _downloadUrls = new Dictionary<Platform, string>
        {
            [Platform.Linux] = "{0}/chromium-browser-snapshots/Linux_x64/{1}/{2}.zip",
            [Platform.MacOS] = "{0}/chromium-browser-snapshots/Mac/{1}/{2}.zip",
            [Platform.Win32] = "{0}/chromium-browser-snapshots/Win/{1}/{2}.zip",
            [Platform.Win64] = "{0}/chromium-browser-snapshots/Win_x64/{1}/{2}.zip"
        };

2.third host:https://mirrors.huaweicloud.com/chromium-browser-snapshots/

https://storage.googleapis.com   //Default Download Host
https://mirrors.huaweicloud.com/chromium-browser-snapshots/  //Other Download Host
https://npm.taobao.org/mirrors/chromium-browser-snapshots/   //Other Download Host

2.2 https://mirrors.huaweicloud.com/chromium-browser-snapshots/

File Name  ↓ File Size  ↓ Date  ↓ 
--
-2021-May-27 13:04
-2021-May-27 13:56
-2021-May-27 13:40
-2021-May-27 13:23
  //set options
            BrowserFetcherOptions browserFetcherOptions = new BrowserFetcherOptions();
            browserFetcherOptions.Host = "https://mirrors.huaweicloud.com/chromium-browser-snapshots/";
          

            RevisionInfo revisionInfo = await new BrowserFetcher(browserFetcherOptions).DownloadAsync(BrowserFetcher.DefaultRevision);

            Console.WriteLine(browserFetcherOptions.Host);
            Console.WriteLine(browserFetcherOptions.Path);
            Console.WriteLine(browserFetcherOptions.Platform);

 

downloaded path:

A path for the downloads folder. Defaults to [root]/.local-chromium, where [root] is where the project binaries are located.

Windows
.local-chromium\download-Win64-706915.zip

Linux 
/var/www/Net5/.local-chromium/Linux-706915

How to download and reuse Chrome from a custom location

Problem

You want to download Chrome in a custom folder and you want to reuse Chrome from a location where it was previously downloaded instead of from the default location.

Solution

Use BrowserFetcherOptions to specify the full path for where to download Chrome.

var browserFetcherOptions = new BrowserFetcherOptions { Path = downloadPath };
var browserFetcher = new BrowserFetcher(browserFetcherOptions);
await browserFetcher.DownloadAsync(BrowserFetcher.DefaultRevision);

Use Puppeteer.LaunchAsync() with LaunchOptions with the LaunchOptions.ExecutablePath property set to the fully qualified path to the Chrome executable.

var options = new LaunchOptions { Headless = true, ExecutablePath = executablePath };

using (var browser = await Puppeteer.LaunchAsync(options))
using (var page = await browser.NewPageAsync())
{
    // use page
}

see:https://www.cnblogs.com/TTonly/p/10920294.html

BrowserFetcher: download through a proxy

 

 

Step 2:Load a Webpage

  public class AdController : Controller
    {
        // GET: AdController
        public async Task<ActionResult> Index()
        {

            // 下载浏览器执行程序
           //第一次运行,会从网络上下载浏览器便捷式安装包download-Win64-536395.zip到你本地,里面解压后是一个Chromium浏览器。这里需要等待一些时间
视你的网络下载速度而定,等待的时间长短不一。
           //第二次,部署到其它地方时,可以打包这个文件夹,它会在启动时检测文件夹是否存在,以决定是否从网络上下载并且解压安装
            await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);

            // 创建一个浏览器执行实例
            using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = true,
                Args = new string[] { "--no-sandbox" }
            });

            // 打开一个页面
            using var page = await browser.NewPageAsync();

            // 设置页面大小
            await page.SetViewportAsync(new ViewPortOptions
            {
                Width = 1920,
                Height = 1080
            });


            var url = "https://juejin.im";
            await page.GoToAsync(url, WaitUntilNavigation.Networkidle0);
            var content = await page.GetContentAsync();
            Console.WriteLine(content);

            return View();
        }

other 

How to use Proxy in PuppeteerSharp without using the system global proxy settings

 

private static readonly LaunchOptions puppeteer_launchOptions = new LaunchOptions 
    {
    Headless = false,
    IgnoreHTTPSErrors = true,
        Args = new [] {
            "--proxy-server=http://1.2.3.4:5678",
            "--no-sandbox",
            "--disable-infobars",
            "--disable-setuid-sandbox",
            "--ignore-certificate-errors",
        },
};

 

github login

 await page.goto('https://github.com/login');
    await page.click('#login_field');
    await page.type('username');

    await page.click('#password');
    await page.type('password');

    await page.click('#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block');

    await page.waitForNavigation();
come from:https://www.cnblogs.com/dh-dh/p/8490047.html

baidu login

public static async Task<string> LogInAsync()
        {
            try
            {
                string ResultCookies = "";
                //获取用户名
                string UserName = Environment.UserName;

                var currentDirectory = Path.Combine(@"C:\Users\", UserName, @"AppData\Local\Google\Chrome\Application\", "Chrome.exe");//string currentDirectory = Path.GetDirectoryName(@"C:\Users\TT\AppData\Local\Google\Chrome\Application"); //指定Chrome.exe在这目录才行

                if (!File.Exists(currentDirectory))
                {
                    currentDirectory = Path.GetDirectoryName(AppDomain.CurrentDomain.BaseDirectory);
                    var downloadPath = Path.Combine(currentDirectory,  "LocalChromium");
                    Console.WriteLine($"Attemping to set up puppeteer to use Chromium found under directory {downloadPath} ");
                    if (!Directory.Exists(downloadPath))
                    {
                        Console.WriteLine("Custom directory not found. Creating directory");
                        Directory.CreateDirectory(downloadPath);

                        Console.WriteLine("Downloading Chromium");

                        var browserFetcherOptions = new BrowserFetcherOptions { Host = "https://npm.taobao.org/mirrors", Path = downloadPath };//设置淘宝镜像
                        var browserFetcher = new BrowserFetcher(browserFetcherOptions);
                        await browserFetcher.DownloadAsync(BrowserFetcher.DefaultRevision);

                        var executablePath = browserFetcher.GetExecutablePath(BrowserFetcher.DefaultRevision);

                        if (string.IsNullOrEmpty(executablePath))
                        {
                            Console.WriteLine("Custom Chromium location is empty. Unable to start Chromium. Exiting.\n Press any key to continue");
                            Console.ReadLine();
                            return "Custom Chromium location is empty. Unable to start Chromium. Exiting.\n Press any key to continue";
                        }
                        Console.WriteLine($"Attemping to start Chromium using executable path: {executablePath}");
                        //Set Path
                        currentDirectory = Path.Combine(executablePath, "Chromium.exe");
                    }
                    else
                    {
                        //Set Path  这里没做下载失败的判断
                        currentDirectory = Path.Combine(downloadPath, "Chromium.exe");
                    }

                }

                var options = new LaunchOptions
                {
                    Headless = false,//无头
                    ExecutablePath = currentDirectory,//本地路径
                    Args = new string[]
                    {
                        "--disable-infobars",//隐藏 自动化标题
                    },//添加Argument 和webdriver一样吧
                    DefaultViewport = new ViewPortOptions
                    {
                        Width = 500,
                        Height = 500,
                        IsMobile = true,
                        //DeviceScaleFactor = 2
                    },
                    //SlowMo=250, // slow down by 250ms
                };

                using (var browser = await Puppeteer.LaunchAsync(options))
                using (var page = await browser.NewPageAsync())
                {
                    // disable images to download
                    //await page.SetRequestInterceptionAsync(true);
                    //page.Request += (sender, e) =>
                    //{
                    //    if (e.Request.ResourceType == ResourceType.Image)
                    //        e.Request.AbortAsync();
                    //    else
                    //        e.Request.ContinueAsync();
                    //};
                    //设置手机模式
                    DeviceDescriptor deviceOptions = Puppeteer.Devices.GetValueOrDefault(DeviceDescriptorName.IPhone7);
                    await page.EmulateAsync(deviceOptions);
                    //await page.SetUserAgentAsync("Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1");
        
                    await page.GoToAsync("https://www.baidu.com/");

                    // 登录
                    Console.WriteLine("Start Login!");
                    await page.GetContentAsync();
                    //输入
                    //ElementHandle input = await page.WaitForSelectorAsync("#search_form_input_homepage");
                    //await input.TypeAsync("Lorem ipsum dolor sit amet.");
                    await page.TypeAsync("input[name=id]", "yourname");
                    await page.TypeAsync("input[name=pwd]", "yourpassword");
                    await Task.WhenAll(page.ClickAsync("#login"), page.WaitForNavigationAsync());
                    Console.WriteLine("Finish Login!");
                   
                    //获取Cookies
                    //CookieParam[] cookies = await page.GetCookiesAsync();
      
                    Console.WriteLine(ResultCookies);
                    Console.WriteLine("Press any key to continue...");
                    Console.ReadLine();
                }
                return ResultCookies;
            }
            catch (Exception ex)
            {
                return ex.Message;
            }
        }

Google account login

 

log into a Google account with puppeteer

step 1:Try to change the user agent like so:

const browser = await require('puppeteer').launch({
    headless: false,
});

const page = await browser.newPage();
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136");

I solved it!!
With headless: true on AppEngine.....
Basically the idea is to save the cookies to a file after you logged in to google or any site with google login, and then you should load the cookie file at the beginning of the script and then you will skip on the google login screen because google think that you already logged in...
You need to save only one time the cookies, then just load it.
to create the cookie file: (i created the file on my local machine with headless:false)

const cookiesObj = await page.cookies()
      jsonfile.writeFile('cookiesFile.json', cookiesObj, { spaces: 1 },
        function(err) { 
          if (err) {
           console.log('The file could not be written.', err)
          }
          console.log('Session has been successfully saved')
       })

to load the cookie file:

let cookiesFilePath = 'cookiesFile.json'
const previousSession = fs.existsSync(cookiesFilePath)
if (previousSession) {
  let rawdata = fs.readFileSync(cookiesFilePath);
   let cookiesArr = JSON.parse(rawdata);
  if (cookiesArr.length !== 0) {
    for (let cookie of cookiesArr) {
      await page.setCookie(cookie)
    }
    console.log('Session has been loaded in the browser!')
    
  }
}

 

 

Scraping search results from a website

Bing search

This example searches www.bing.com/maps for "CN Tower, Toronto, Ontario, Canada"

First, we will programmatically initiate an instance of the headless web browser, load a new tab and go to ‘https://www.bing.com/maps’:

// Create an instance of the browser and configure launch options
Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
   Headless = true
});

// Create a new page and go to Bing Maps
Page page = await browser.NewPageAsync();
await page.GoToAsync("https://www.bing.com/maps");

With the webpage successfully loaded in the headless browser, let’s interact with the webpage by searching for a local tourist attraction:

// Search for a local tourist attraction on Bing Maps
await page.WaitForSelectorAsync(".searchbox input");
await page.FocusAsync(".searchbox input");
await page.Keyboard.TypeAsync("CN Tower, Toronto, Ontario, Canada");
await page.ClickAsync(".searchIcon");
await page.WaitForNavigationAsync();

We’re able to use Puppeteer Sharp to interact with the JavaScript rendered HTML of Bing Maps and search for ‘CN Tower, Toronto, Ontario, Canada’!

If you would like to store the HTML to parse elements such as the address or description, you can easily store the HTML in a variable:

// Store the HTML of the current page
string content = await page.GetContentAsync();

Once you are finished, close the browser to free up resources:

// Close the browser
await browser.CloseAsync();

Puppeteer Sharp v5.0 Connect to a Remote Browser

One last feature of Puppeteer Sharp that I would like to mention is the ability to connect to a remote browser. This may be useful if you are using a serverless environment where installing a browser is not an option, such as the scalable ‘Azure Functions’.

const browser = await puppeteer.connect({ browserWSEndpoint: 'wss://chrome.browserless.io/' });
 
        /// <summary>
        /// Connect to a Remote Browser
        /// </summary>
        /// <returns></returns>
        public async Task<ActionResult> Index()
        {

            var connectOptions = new ConnectOptions()
            {
                BrowserWSEndpoint = "wss://chrome.browserless.io/"
            };

            var browser = await Puppeteer.ConnectAsync(connectOptions);

            // 

            Page page = await browser.NewPageAsync();


            await page.Tracing.StartAsync(new TracingOptions { Path = "C:\\Files\\trace.json" });

            await page.GoToAsync("http://basic.10jqka.com.cn/603906/operate.html#stockpage");
           
         


            string content = await page.GetContentAsync();


            return View();
        }

 

Chromium vs chromedriver vs  chrome  chromium vs chromium-browser  chromium_webview

 

 

Selenium

Selenium is the Selenium library of code containing the FindBys and Clicks and SendKeys code.

 

WebDriver

ChromeDriver

ChromeDriver(chromedriver.exe) is a library of code that controls the Chrome Browser.

chromedriver.exe is a standalone server that takes commands from Selenium Server and communicates with browser's API via JSON commands. There is nothing to do with Chrome browser itself.

 

Chromium

Chromium is an open-source and free web browser.also called chromium-browser.

Chrome

Chrome.exe 

 

1.MAKING CHROME HEADLESS UNDETECTABLE

https://intoli.com/blog/making-chrome-headless-undetectable/

https://intoli.com/blog/not-possible-to-block-chrome-headless

 

use puppeteer-extra-plugin-stealth

A plugin for puppeteer-extra to prevent detection

https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth

 

 

Useful links

Puppeteer Sharp

 running Chrome on Linux

Puppeteer Sharp Examples

Bypassing CAPTCHAs with Headless Chrome

网络爬虫之使用pyppeteer替代selenium完美绕过webdriver检测

 

从网络上下载浏览器便捷式安装包download-Win64-706915.zip到你本地,但是代码执行就抛出异常了,引发的异常:“System.Net.WebException”

https://www.cnblogs.com/cdyy/p/PuppeteerSharp.html