Open Web Scraper Chrome



Web scraper chrome extension is one of the most powerful tools for extracting web data. Using the extension, you can devise a plan or sitemap regarding how a particular web site of your choice should be navigated. Web scraper chrome extension will, then, follow the navigation design accordingly and scrape the. Mkdir /headless-web-scraping. Cd /headless-web-scraping. Env/bin/activate # activate the environment which populates the shell's PATH. So now we can launch a browser, open a page (a tab in chrome) and navigate to a website and wait for javascript to finish loading/executing then close the browser with the above code.

Apr 04, 2016 The Method After installation, open the Google Chrome developer tools by pressing F12. (You can alternatively right-click on the screen and select inspect element). In the developer tools, you will find a new tab named ‘Web scraper’ as shown in the screenshot below. Feb 19, 2019 Some extensions, like Web Scraper, integrate directly in Chrome’s Developer Tools which can be easily found by pressing Ctrl + Shift + I or just by hitting the F12 button. Once you can see the user.

Introduction

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.

“But you should use an API for this!'

However, not every website offers an API, and APIs don't always expose every piece of information you need. So, it's often the only solution to extract website data.

There are many use cases for web scraping:

  • E-commerce price monitoring
  • News aggregation
  • Lead generation
  • SEO (search engine result page monitoring)
  • Bank account aggregation (Mint in the US, Bankin’ in Europe)
  • Individuals and researchers building datasets otherwise not available.

The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except Google - they all want to be scraped by Google).

Web Scraper Chrome Extension

So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.

This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.

Emulate Human Tool: Headless Chrome

Why Using Headless Browsing?

When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.

The thing is, if you just do: curl www.google.com, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, This is the infamous “User-Agent” header. Just by looking at the “User-Agent” header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.

Headers are easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you'd need to set more than one header. But it is not difficult to artificially forge an HTTP request with cURL or any library to make the request look exactly like a request made with a browser. Everybody knows this. So, to determine if you are using a real browser, websites will check something that cURL and library can not do: executing Javascript code.

Do you speak Javascript?

The concept is simple, the website embeds a Javascript snippet in its webpage that, once executed, will “unlock” the webpage. If you're using a real browser, you won't notice the difference. If you're not, you'll receive an HTML page with some obscure Javascript code in it:

an actual example of such a snippet

Once again, this solution is not completely bulletproof, mainly because it is now very easy to execute Javascript outside of a browser with Node.js. However, the web has evolved and there are other tricks to determine if you are using a real browser.

Headless Browsing

Trying to execute Javascript snippets on the side with Node.js is difficult and not robust. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with Node.js become useless. So the best way to look like a real browser is to actually use one.

Headless Browsers will behave like a real browser except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.

The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. SeleniumPlaywright and Puppeteer are the three most famous solutions.

However, it will not be enough as websites now have tools that detect headless browsers. This arms race has been going on for a long time.

While these solutions can be easy to do on your local computer, it can be trickier to make this work at scale.

Managing lots of Chrome headless instances is one of the many problems we solve at ScrapingBee

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.

Browser Fingerprinting

Everyone, especially front-end devs, know that every browser behaves differently. Sometimes it's about rendering CSS, sometimes Javascript, and sometimes just internal properties. Most of these differences are well-known and it is now possible to detect if a browser is actually who it pretends to be. This means the website asks “do all of the browser properties and behaviors match what I know about the User-Agent sent by this browser?'.

This is why there is an everlasting arms race between web scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.

However, in this arms race, web scrapers tend to have a big advantage here is why: Topaz labs key generator.

Screenshot of Chrome malware alert

Most of the time, when a Javascript code tries to detect whether it's being run in headless mode, it is when a malware is trying to evade behavioral fingerprinting. This means that the Javascript will behave nicely inside a scanning environment and badly inside real browsers. And this is why the team behind the Chrome headless mode is trying to make it indistinguishable from a real user's web browser in order to stop malware from doing that. Web scrapers can profit from this effort.

Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.

If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel's blog, which is entirely dedicated to this subject.

That's about all you need to know about how to pretend like you are using a real browser. Let's now take a look at how to behave like a real human.

TLS Fingerprinting

What is it?

TLS stands for Transport Layer Security and is the successor of SSL which was basically what the “S” of HTTPS stood for.

This protocol ensures privacy and data integrity between two or more communicating computer applications (in our case, a web browser or a script and an HTTP server).

Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify users based on the way they use TLS.

How this protocol works can be split into two big parts.

First, when the client connects to the server, a TLS handshake happens. During this handshake, many requests are sent between the two to ensure that everyone is actually who they claim to be.

Then, if the handshake has been successful the protocol describes how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.

Most of the data point used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database.

On this website, you can see that the most used fingerprint last week was used 22.19% of the time (at the time of writing this article).

A TLS fingerprint

Open Web Scraper Chrome Browser

This number is very big and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.

Those parameters are, amongst others:

  • TLS version
  • Handshake version
  • Cipher suites supported
  • Extensions

If you wish to know what your TLS fingerprint is, I suggest you visit this website.

How do I change it?

Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.

Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won't work. Your fingerprint will be so rare that it will be instantly flagged as fake.

Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward.

For examples, the famous Python requests module doesn't support changing the TLS fingerprint out of the box. Here are a few resources to change your TLS version and cypher suite in your favorite language:

  • Python with HTTPAdapter and requests
  • NodeJS with the TLS package
  • Ruby with OpenSSL

Keep in mind that most of these libraries rely on the SSL and TLS implementation of your system, OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.

Emulate Human Behaviour: Proxy, Captcha Solving and Request Patterns

Proxy Yourself

A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of page from the same website you have to trick the website into thinking that all those requests come from different places in the world i.e: different I.P addresses. In other words, you need to use proxies.

Proxies are not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxy IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.

There are several proxy solutions on the market, here are the most used rotating proxy providers: Luminati Network, Blazing SEO and SmartProxy.

There is also a lot of free proxy lists and I don’t recommend using these because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is important. Anti-crawling services are known to maintain an internal list of proxy IP so any traffic coming from those IPs will also be blocked. Be careful to choose a good reputation. This is why I recommend using a paid proxy network or build your own.

Another proxy type that you could look into is mobile, 3g and 4g proxies. This is helpful for scraping hard-to-scrape mobile first websites, like social media.

To build your own proxy you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then, you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist …) it can be a little tedious to put in place.

You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many cases, TOR won’t help you, compared to classic proxies. It's worth noting that traffic through TOR is also inherently much slower because of the multiple routing.

Captchas

Sometimes proxies will not be enough. Some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you'll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).

Scraper

While some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.

Old captcha, breakable programatically
Google ReCaptcha V2

If you use the aforementioned services, on the other side of the API call you'll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.

But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your data extraction process.

Request Pattern

Another advanced tool used by websites to detect scraping is pattern recognition. So if you plan to scrape every IDs from 1 to 10 000 for the URL www.example.com/product/, try to not do it sequentially or with a constant rate of request. You could, for example, maintain a set of integer going from 1 to 10 000 and randomly choose one integer inside this set and then scrape your product.

Open Web Scraper Chrome Extension

Some websites also do statistic on browser fingerprint per endpoint. This means that if you don't change some parameters in your headless browser and target a single endpoint, they might block you.

Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try to not do it with proxies in Vietnam.

But from experience, I can tell you that rate is the most important factor in “Request Pattern Recognition”, so the slower you scrape, the less chance you have of being discovered.

Emulate Machine Behaviour: Reverse engineering of API

Sometimes, the server expect the client to be a machine. In these cases, hiding yourself is way easier.

Reverse engineering of API

Basically, this “trick” comes down to two things:

  1. Analyzing a web page behaviour to find interesting API calls
  2. Forging those API calls with your code

For example, let's say that I want to get all the comments of a famous social network. I notice that when I click on the “load more comments” button, this happens in my inspector:

Request being made when clicking more comments

Notice that we filter out every requests except “XHR” ones to avoid noise.

When we try to see which request is being made and which response do we get… - bingo!

Request response

Now if we look at the “Headers” tab we should have everything we need to replay this request and understand the value of each parameters. This will allow us to make this request from a simple HTTP client.

HTTP Client response

The hardest part of this process is to understand the role of each parameter in the request. Know that you can left-click on any request in the Chrome dev tool inspector, export in HAR format and then import it in your favorite HTTP client, (I love Paw and PostMan).

This will allow you to have all the parameters of a working request laid out and will make your experimentation much faster and fun.

Previous request imported in Paw

Reverse-Engineering of Mobile Apps

The same principles apply when it comes to reverse engineering mobile app. You will want to intercept the request your mobile app make to the server and replay it with your code.

Doing this is hard for two reasons:

  • To intercept requests, you will need a Man In The Middle proxy. (Charles proxy for example)
  • Mobile Apps can fingerprint your request and obfuscate them more easily than a web app

For example, when Pokemon Go was released a few years ago, tons of people cheated the game after reverse-engineering the requests the mobile app made.

What they did not know was that the mobile app was sending a “secret” parameter that was not sent by the cheating script. It was easy for Niantic to then identify the cheaters. A few weeks after, a massive amount of players were banned for cheating.

Also, here is an interesting example about someone who reverse-engineered the Starbucks API.

Conclusion

Here is a recap of all the anti-bot techniques we saw in this article:

Anti-bot techniqueCounter measureSupported by ScrapingBee
Browser FingerprintingHeadless browsers
IP-rate limitingRotating proxies
Banning Data center IPsResidential IPs
TLS FingerprintingForge and rotate TLS fingerprints
Captchas on suspicious activityAll of the above
Systematic CaptchasCaptchas-solving tools and services

I hope that this overview will help you understand web-scraping and that you learned a lot reading this article.

We leverage everything I talked about in this post at ScrapingBee. Our web scraping API handles thousands of requests per second without ever being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us :).

We recently published a guide about the best web scraping tools on the market, don't hesitate to take a look!

Learn how to scrape webpages using Puppeteer and Serverless Functions built with OpenFaaS.

Introduction to web testing and scraping

In this post I’ll introduce you Puppeteer and show you how to use it to automate and scrape websites using OpenFaaS functions.

There’s two main reasons you may want to automate a web browser:

  • to run compliance and end-to-end tests against your application
  • to gather information from a webpage which doesn’t have an API available

When testing an application, there are numerous options and these fall into two categories: rendered webpages, running with JavaScript and a real browser, and then text-based tests which can only parse static HTML. As you may imagine, loading a full web-browser in memory is a heavy-weight task. In a previous position I worked heavily with Selenium, which has language bindings for C#, Java, Python, Ruby and other languages. Whilst our team tried to implement most of our tests in the unit-testing layer, there were instances where automated web tests added value, and mean that the QA team could be involved in the development cycle by writing User Acceptance Tests (UATs) before the developers had started coding.

Selenium is still popular in the industry, and it inspired the W3C Working Draft of a Webdriver API that browsers can implement to make testing easier.

The other use-case is not to test websites, but to extract information from them when an API is not available, or does not have the endpoints required. In some instances, you see a mixture of both usecases, for instance - a company may file tax documents through a web-page using automated web-browsers, when that particular jurisdiction doesn’t provide an API.

Kicking the tires with AWS Lambda

I learned more recently of a friend who offers a search for Trademarks through his SaaS product, and for that purpose he chose a more modern alternative to Selenium called Puppeteer. In fact if you search StackOverflow or Google for “scraping and Lambda” you will likely see “Puppeteer” mentioned along with “headless-chrome.” I was curious to try out Puppeteer with AWS Lambda, and the path was less than ideal, with friction at almost every step of the way.

  • The popular aws-chrome-lambda npm module is over 40MB in size because it ships a static binary binary, meaning it can’t be uploaded as a regular Lambda zip file, or as a Lambda layer
  • The zip file needs to be uploaded through a separate AWS S3 bucket in the same region as the function
  • The layer can then be referenced from your function.
  • Local testing is very difficult, and there are many StackOverflow issues about getting the right combination of npm modules

I am sure that this can be done, and is being run at scale. It could be quite compelling for small businesses if they don’t spend too much time fighting the above, and can stay within the free-tier.

Getting the title of a simple webpage - 15.5s

That said, OpenFaaS can run anywhere, even on a 5-10 USD VPS and because OpenFaaS uses containers, it got me thinking.

Is there another way?

So I wanted to see if the experience would be any better with OpenFaaS. So I set out to see if I could get Puppeteer working with OpenFaaS, and this isn’t the first time I’ve been there. It’s something that I’ve come back to from time to time. Today, things seem even easier with a pre-compiled headless Chrome browser being available from buildkite.com.

Typical tasks involve logging into a portal and taking screenshots. Anecdotally, when I ran a simple test to navigate to a blog and take a screenshot, this took 15.5s in AWS Lambda, but only 1.6s running locally within OpenFaaS on my laptop. I was also able to build and test the function locally, the same way as in the cloud.

Walkthrough

We’ll now walk through the steps to set up a function with Node.js and Puppeteer, so that you can adapt an example and try out your existing tests that you may have running on AWS Lambda.

OpenFaaS features for web-scraping

What are the features we can leverage from OpenFaaS?

  • Extend the function’s timeout to whatever we want
  • Run the invocation asynchronously, and in parallel
  • Get a HTTP callback with the result when done, such as a screenshot or test result in JSON
  • Limit concurrency with max_inflight environment variable in our stack.yml file to prevent overloading the container
  • Trigger the invocations from cron, or events like Kafka and NATS
  • Get rate, error and duration (RED) metrics from Prometheus, and view them in Grafana

OpenFaaS deployment options

We have made OpenFaaS as easy as possible to deploy on a single VM or on a Kubernetes cluster.

  • Deploy to a single VM if you are new to containers and just want to kick the tires whilst keeping costs low. This is also ideal if you only have a few functions, or are worried about needing to learn Kubernetes.

    See also: Bring a lightweight Serverless experience to DigitalOcean with Terraform and faasd

  • This is the standard option we recommend for production usage. Through the use of containers and Kubernetes, OpenFaaS can be deployed and run at scale on any cloud.

    Many cloud providers have their own managed Kubernetes services which means it’s trivial to get a working cluster. You just click a button and deploy OpenFaaS, then you can start deploying functions. The DigitalOcean and Linode Kubernetes services are particularly economic.

Deploy Kubernetes and OpenFaaS on your computer

In this post we’ll be running Kubernetes on your laptop, meaning that you don’t have to spend any money on public cloud to start trying things out. The tutorial should take you less than 15-30 minutes to try.

For the impatient, our arkade tool can get you up and running in less than 5 minutes. You’ll just need to have Docker installed on your computer.

The arkade info openfaas command will print out everything you need to log in and get a connection to your OpenFaaS gateway UI.

Create a function with the puppeteer-node12 template

Let’s get the title of a webpage passed in via a JSON HTTP body, then return the result as JSON.

Now edit ./scrape-title/handler.js

Deploy and test the scrape-title function

Deploy the scrape-title function to OpenFaaS.

You can run faas-cli describe FUNCTION to get a synchronous or asynchronous URL for use with curl along with whether the function is ready for invocations. The faas-cli can also be used to invoke functions and we’ll do that below.

Try invoking the function synchronously:

Running with time curl was 10 times faster than my test with AWS Lambda with 256MB RAM allocated.

Alternatively run async:

Run async, post the response to another service like requestbin or another function:

Example of a result posted back to RequestBin

Each invocation has a unique X-Call-Id header, which can be used for tracing and connecting requests to asynchronous responses.

Take a screenshot and return it as a PNG file

One of the limitations of AWS Lambda is that it can only return a JSON response, whilst there may be good reasons for this approach, OpenFaaS allows a binary input and response for functions.

Let’s try taking a screenshot of the page, and capturing it to a file.

Edit ./screenshot-page/handler.js

Now deploy the function as before:

Invoke the function, and capture the response to a file:

Now open screenshot.png and check the result.

Produce homepage banners and social sharing images

You can also produce homepage banners and social sharing images by rendering HTML locally, and then saving a screenshot.

Unlike a SaaS service, you’ll have no month fees to pay, and get unlimited use, you can also customise the code and trigger it however you like.

The execution time is very quick at under 0.5s per image and could be made faster by preloading the Chromium browser and re-using it. if you cache the images to /tmp/ or save them to a CDN, you’ll have single-digit latency.

Edit ./banner-gen/handler.js

Deploy the function:

Example usage:

Note that the inputs are URLEncoded for the querystring. You can also use the event.body if you wish to access the function programmatically, instead of from a browser.

This is an example image generated for my GitHub Sponsors page which uses a different HTML template, that’s loaded from disk.

HTML: sponsor-cta.html

Deploy a Grafana dashboard

We can observe the RED metrics from our functions using the built-in Prometheus UI, or we can deploy Grafana and access the OpenFaaS dashboard.

Access the UI at http://127.0.0.1:3000 and login with admin/admin.

See also: OpenFaaS Metrics

Hardening

If you’d like to limit how many browsers can open at once, you can set max_inflight within the function’s deployment file:

A separate queue can also be configured in OpenFaaS for web-scraping with a set level of parallelism that you prefer.

See also: Async docs

You can also set a hard limit on memory if you wish:

See also: memory limits

Long timeouts

Whilst a timeout value is required, this number can be as large as you like.

Open Web Scraper Chrome

See also: Featured Tutorial: Expanded timeouts in OpenFaaS

Getting triggered

If you want to trigger the function periodically, for instance to generate a weekly or daily report, then you can use a cron syntax.

Users of NATS or Kafka can also trigger functions directly from events.

See also: OpenFaaS triggers

Wrapping up

You now have the tools you need to deploy automated tests and web-scraping code using Puppeteer. Since OpenFaaS can leverage Kubernetes, you can use auto-scaling pools of nodes and much longer timeouts than are typically available with cloud-based functions products. OpenFaaS plays well with others such as NATS which powers asynchronous invocations, Prometheus to collect metrics, and Grafana to observe throughput and duration and share the status of the system with others in the team.

The pre-compiled versions of Chrome included with docker-puppeteer and aws-chrome-lambda will not run on a Raspberry Pi or ARM64 machine, however there is a possibility that they can be rebuilt. For speedy web-scraping from a Raspberry Pi or ARM64 server, you could consider other options such as scrapy.

Ultimately, I am going to be biased here, but I found the experience of getting Puppeteer to work with OpenFaaS much simpler than with AWS Lambda, and think you should give it a shot.

Find out more: