Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinct options for every new browser instance #107

Closed
NikolaiT opened this issue Feb 26, 2019 · 6 comments
Closed

Distinct options for every new browser instance #107

NikolaiT opened this issue Feb 26, 2019 · 6 comments
Labels
discussion Talk about features or implementation

Comments

@NikolaiT
Copy link

NikolaiT commented Feb 26, 2019

Hi!

First of all: Very beautiful code and software. I should start learning typescript.

Is it possible to pass different options to different browser launches?

As I can see in the concurrency implementation of CONCURRENCY_BROWSER in src/concurrency/built-in/Browser.ts, every new browser is started with identical options:

let chrome = await this.puppeteer.launch(this.options) as puppeteer.Browser;

Would it be possible to pass different options to new launches of browser instances?

I ask because I want to set different --proxy-server=some-proxy flags to new browser launches.

Thanks for viewing

@NikolaiT
Copy link
Author

NikolaiT commented Feb 27, 2019

Ok I managed to do this myself.

here is the test case:

const { Cluster } = require('./dist/index.js');

(async () => {

    let browserArgs = [
        '--disable-infobars',
        '--window-position=0,0',
        '--ignore-certifcate-errors',
        '--ignore-certifcate-errors-spki-list',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--disable-gpu',
        '--window-size=1920x1080',
        '--hide-scrollbars',
        '--proxy-server=socks5://78.94.172.42:1080',
    ];

    // each new call to workerInstance() will
    // left pop() one element from this list
    // maxConcurrency should be equal to perBrowserOptions.length
    let perBrowserOptions = [
        {
            headless: false,
            ignoreHTTPSErrors: true,
            args: browserArgs.concat(['--proxy-server=socks5://78.94.172.42:1080'])
        },
        {
            headless: true,
            ignoreHTTPSErrors: true,
            args: browserArgs.concat(['--proxy-server=socks5://CENSORED'])
        },
    ];

    const cluster = await Cluster.launch({
        monitor: true,
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 2,
        puppeteerOptions: {
            headless: false,
            args: browserArgs,
            ignoreHTTPSErrors: true,
        },
        perBrowserOptions: perBrowserOptions
    });

    // Event handler to be called in case of problems
    cluster.on('taskerror', (err, data) => {
        console.log(`Error crawling ${data}: ${err.message}`);
    });


    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url, {waitUntil: 'domcontentloaded', timeout: 20000});
        const pageTitle = await page.evaluate(() => document.title);
        console.log(`Page title of ${url} is ${pageTitle}`);
        console.log(await page.content());
    });

    await cluster.queue('http://ipinfo.io/json');
    await cluster.queue('http://ipinfo.io/json');
    // many more pages

    await cluster.idle();
    await cluster.close();
})();

here is the diff:

diff --git a/src/Cluster.ts b/src/Cluster.ts
index c2ee9f0..23678f0 100644
--- a/src/Cluster.ts
+++ b/src/Cluster.ts
@@ -20,6 +20,7 @@ interface ClusterOptions {
     maxConcurrency: number;
     workerCreationDelay: number;
     puppeteerOptions: LaunchOptions;
+    perBrowserOptions: any;
     monitor: boolean;
     timeout: number;
     retryLimit: number;
@@ -42,6 +43,7 @@ const DEFAULT_OPTIONS: ClusterOptions = {
     puppeteerOptions: {
         // headless: false, // just for testing...
     },
+    perBrowserOptions: [],
     monitor: false,
     timeout: 30 * 1000,
     retryLimit: 0,
@@ -72,6 +74,8 @@ export default class Cluster extends EventEmitter {
     static CONCURRENCY_BROWSER = 3; // no cookie sharing and individual processes (uses contexts)
 
     private options: ClusterOptions;
+    private perBrowserOptions: any;
+    private usePerBrowserOptions: boolean = false;
     private workers: Worker[] = [];
     private workersAvail: Worker[] = [];
     private workersBusy: Worker[] = [];
@@ -139,7 +143,14 @@ export default class Cluster extends EventEmitter {
         } else if (this.options.concurrency === Cluster.CONCURRENCY_CONTEXT) {
             this.browser = new builtInConcurrency.Context(browserOptions, puppeteer);
         } else if (this.options.concurrency === Cluster.CONCURRENCY_BROWSER) {
+            this.perBrowserOptions = this.options.perBrowserOptions;
+            if (this.perBrowserOptions.length !== this.options.maxConcurrency) {
+                debug('Not enough perBrowserOptions! perBrowserOptions.length must equal maxConcurrency');
+            } else {
+                this.usePerBrowserOptions = true;
+            }
             this.browser = new builtInConcurrency.Browser(browserOptions, puppeteer);
+
         } else if (typeof this.options.concurrency === 'function') {
             this.browser = new this.options.concurrency(browserOptions, puppeteer);
         } else {
@@ -165,12 +176,17 @@ export default class Cluster extends EventEmitter {
         this.nextWorkerId += 1;
         this.lastLaunchedWorkerTime = Date.now();
 
+        var nextBroserOption = {};
+        if (this.usePerBrowserOptions && this.perBrowserOptions.length > 0) {
+            nextBroserOption = this.perBrowserOptions.shift();
+        }
+
         const workerId = this.nextWorkerId;
 
         let workerBrowserInstance: WorkerInstance;
         try {
             workerBrowserInstance = await (this.browser as ConcurrencyImplementation)
-                .workerInstance();
+                .workerInstance(nextBroserOption);
         } catch (err) {
             throw new Error(`Unable to launch browser for worker, error message: ${err.message}`);
         }
diff --git a/src/concurrency/ConcurrencyImplementation.ts b/src/concurrency/ConcurrencyImplementation.ts
index ce1a1bc..7550467 100644
--- a/src/concurrency/ConcurrencyImplementation.ts
+++ b/src/concurrency/ConcurrencyImplementation.ts
@@ -34,7 +34,7 @@ export default abstract class ConcurrencyImplementation {
     /**
      * Creates a worker and returns it
      */
-    public abstract async workerInstance(): Promise<WorkerInstance>;
+    public abstract async workerInstance(perBrowserOptions: any): Promise<WorkerInstance>;
 
 }
 
diff --git a/src/concurrency/built-in/Browser.ts b/src/concurrency/built-in/Browser.ts
index 9f29753..b3232a6 100644
--- a/src/concurrency/built-in/Browser.ts
+++ b/src/concurrency/built-in/Browser.ts
@@ -11,8 +11,8 @@ export default class Browser extends ConcurrencyImplementation {
     public async init() {}
     public async close() {}
 
-    public async workerInstance(): Promise<WorkerInstance> {
-        let chrome = await this.puppeteer.launch(this.options) as puppeteer.Browser;
+    public async workerInstance(perBrowserOptions: any): Promise<WorkerInstance> {
+        let chrome = await this.puppeteer.launch(perBrowserOptions || this.options) as puppeteer.Browser;
         let page: puppeteer.Page;
         let context: any; // puppeteer typings are old...

@thomasdondorf
Copy link
Owner

thomasdondorf commented Feb 27, 2019

Thank you. The parallelization part of the code will probably be rewritten in the not-too-far future. Then this use case will be easier to implement. Right now it would be possible by using a concurrency implementation, but that is just too complicated right now...

// Edit: I hope it's okay I closed this? Otherwise feel free to re-open :)

@lazybotter
Copy link

Thanks for the great module!

Is this possible yet? I need to set a different http proxy per browser instance.

Could some kind of event not be fired beforeLaunch or something like that, then we can configure each browser/page instance.

This use case would not work for my application as I need to dynamically queue tasks every X mins to the cluster object that are fetched from a server.

Thanks

@TahorSuiJuris
Copy link

TahorSuiJuris commented Jun 25, 2019

Thank you for a wonderful module. Am incorporating proxy rotation. NOTE: "It has been said a common issue with Puppeteer is that proxies can only be set at the Browser level, not the Page level, so each Page (browser tab) must use the same proxy. To use different proxies with each page, one shall need to use proxy-chain module.

Below is the current code under development. Comments from anyone that has accomplished such is greatly appreciated.

httpbin.org/ip is being used to confirm proxy switch.

const { Cluster } = require('puppeteer-cluster');
const proxyChain = require('proxy-chain');

(async () => {
    const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
});

await cluster.task(async ({ page, data: url }) => {

//==================================
    const proxies = {
        'useragent1': 'http://proxyusername1:proxypassword1@proxyhost1:proxyport1',
        'useragent2': 'http://proxyusername2:proxypassword2@proxyhost2:proxyport2',
        'useragent3': 'http://proxyusername3:proxypassword3@proxyhost3:proxyport3',
    };

const server = new ProxyChain.Server({
        port: 8000,
        prepareRequestFunction: ({request}) => {
        const userAgent = request.headers['user-agent'];
const proxy = proxies[userAgent];
return {
    upstreamProxyUrl: proxy,
};
});
});

server.listen(() => console.log('proxy server started'));
//==================================

    await page.goto(url);

var currentdate = new Date();
var datetime = "_" + (currentdate.getMonth()+1)  + "/"
    + currentdate.getDate() + "/"
    + currentdate.getFullYear() + " @ "
    + currentdate.getHours() + ":"
    + currentdate.getMinutes() + ":"
    + currentdate.getSeconds();
let fileName = (`${datetime}`).replace(/(\. |\&|\.\r|\, |\  |\ |\-|\,|\r\n|\n|\r|\.|\/|:|%|#)/gm, "_");
if (fileName.length > 100) {
    fileName = fileName.substring(0, 100);
}
const url2 = page.url();

const screen = `${fileName}` + '_' + url2.replace(/[^a-zA-Z]/g, '_') + '.png';//☑ added timestamp
await page.screenshot({ path: './screenshots/' + screen });//size is 800x600
console.log(`Screenshot of: ${url2} saved: ${screen}`);
});

cluster.queue('http://httpbin.org/ip');
cluster.queue('http://www.google.com/');
cluster.queue('http://httpbin.org/ip');
cluster.queue('http://www.wikipedia.org/');
cluster.queue('http://httpbin.org/ip');

await cluster.idle();
await cluster.close();
})();

@xL0b0
Copy link

xL0b0 commented Sep 27, 2022

Ok I managed to do this myself.

here is the test case:
`

Could you share the modified files from puppeteer cluster that you used to get this working?

@bobitza
Copy link

bobitza commented May 30, 2024

I want for each instance task running, to have another ip ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Talk about features or implementation
Projects
None yet
Development

No branches or pull requests

6 participants