![]() ![]() ![]() It will work for scraping Google / Bing / Amazon, because they want to be scraped to a certain extent.īut it will never work against well protected websites that employ protection from anti bot companies such as DataDome, Akamai or Imperva (there are more anti bot companies, don't be salty when I didn't name you, okay?). (This was all in 20, things possibly changed) I tried the same with Google Cloud Platform, but funnily enough, Google blocks their own cloud infrastructure much more aggressively compared to traffic from AWS. This was enough to be able to scrape millions of Google SERPs / week, even when sharing public datacenter IP addresses. And then you have 16 regions, which gives you around 16 * 250 = 4000 public IP addresses at any time when using AWS Lambda. And if you concurrently invoke 1000 Lambda functions, you will bottom out at around 250 public IP addresses. ![]() But I needed a full browser for other projects, so there was that.Īnyhow, AWS gives you access to 16 regions all around the world (are they offering even more regions in the meantime?) and after three AWS Lambda function invocations, your function obtains a new public IP address. I used AWS Lambda, put Headless Chrome into an AWS Lambda function and used puppeteer-extra and chrome-aws-lambda to create a function that automatically launches a browser for 300 seconds that I can solely use for scraping.Īctually, I could have probably achieved the same with plain curl, because Google really doesn't put too much effort into blocking bots from their own search engine (they mostly rate limit by IP). So how did I manage to scrape millions of Google SERP's? Ad-fraud, social media spam, web attacks such as automated SQL injections or XSS is not.įurthermore, those proxy services are quite pricey, and me being a stingy German, I didn't possibly see a reasonable way for this combination to work out. What if I share proxy servers with criminals that do more malicious stuff than the somewhat innocent SERP scraping?įull disclosure: Non-DoS scraping of public information is okay for me. But I never ever purchased proxies from proxy providers such as Brightdata, Packetstream or Oxylabs.īecause I could not fully trust the other customers with whom I shared the proxy bandwidth. When I used to run a scraping service, I managed to scrape at most a couple of million Google SERPs per week. ![]()
0 Comments
Leave a Reply. |