The second most popular question people ask when they learn that I work in the web data collection / acquisition space is "how do you go about banning?". Could you guess what the first question is? And no, it's not related to the first challenge we looked at last week: Setup. One hint: if this banning question is only asked by people who have some experience with web data acquisition, then the first question is also asked by people who has only heard of this concept.
I see many articles going over the tactics to overcome banning or blocking, but it's important to understand the fundamental of how the different mechanisms are mapped out conceptually.
First, what do we mean by banning? Banning is when the web servers implement limiting mechanism to not return the response in the format you're expecting at the rate that you're requesting.
Each web server have different capacity and implement their own rules to ensure their resources are available, not abused, legitimately used (minimise DDoS attacks, frauds). To do this, the servers will try and recognise whether it is a real human accessing the webpages or it is a bot / crawler. This is also known as rate limiting, anti-bots systems, bot detection systems, or crawling countermeasures.
As much that we follow best practices so to not abuse the service providers while accessing public web data, our data collection needs will conflict with these ranging set of rules and varying capacity. Essentially, how to ensure you don't cross that thin and arguable line between running legitimate web scrapers and malicious bots throwing aggressive traffic?
We’ll look at different bot detection mechanisms and common methods in overcoming the challenges -- activities I like to label as anti-ban efforts.
This image illustrates the different methods commonly used (in combination) arranged by their level of sophistication.
Up till 5 years ago, it used to be sufficient to rely on the back end methods to keep most bots away. In the early days, sites are served in HTML, vanilla CSS, and basic Javascript. These days websites are built to deliver more interactive user experience afforded by the amount of innovation made in web technologies the past 12 years which has also allowed for more advanced tracking, something I want to unpack soon in future series. This coupled with how more people realise the value of web data, more sophisticated bots are developed alongside the technologies to protect them.
If I do it in-house, what should I expect?
Here is a more tactical look at what you need to expect if you're looking to overcome anti-bots, with some quick wins for each approach.
Back end level
Manage your sessions (a combination of user agents, source IP, headers sent, cookies used). Quick win: think like a browser. Fake it until you make it.
Mind your request pattern (be mindful of suspicious path-bypass, velocity of requests). Quick win: apply throttling, slow down.
Use appropriate geolocation. Quick win: proxy management. Start from static, subscribe to more advanced solution with larger pool when needed.
Ensure your TCP/IP fingerprint is consistent. Quick win: ensure TTL and window size fields is consistent.
Front end level
Be prepared for:
Javascript capability / rendering check. Quick win: Playwright or Puppeteer.
Browser fingerprinting. Browsers have a set of properties that anti-bot systems could use to check for inconsistencies. Things like OS version, the canvas API checks, WebGL test, TLS fingerprinting, WebRTC. (not so) Quick win: headless-browser-farm-as-a-service.
Captchas (graphical / explicit, behavioral / implicit). Quick win: find ways to not trigger these. This is one good place to practice "You don't need to solve your problem when you can just run away from them". One thing to note is this method is also the most intrusive to human visitors, so websites tend to use this sensibly.
If I want to buy, what are my options?
All the solutions below employ a combination of methods described above to either detect or evade detection.
Different antibot solutions
Datadome
PerimeterX
Incapsula
Akamai
F5 Bot Detection & Security
Cloudflare
Alibaba Cloud
Google ReCaptcha
Different antiban solutions:
Oxylabs.io offers different proxy types and services
Limeproxies offers different proxy types and services
Netnut.io (Proxy provider)
https://smartproxy.com/ (Proxy provider)
https://rayobyte.com/ (Proxy provider)
https://github.com/claffin/cloudproxy (self hosted, if you manage your own pool of proxies)
2captcha.com
Death by Captcha
Side note: You can't help but notice that a lot of these solutions (and many more) revolve around proxies. The proxy rotation and management space is indeed a whole other interesting and substantial topic that I think deserve its own post to go deeper. It is still one of the most effective to overcome most bot detection systems.
All of these are simple to grasp but not easy to execute. Most of the time it's better to buy SaaS solutions mentioned above and free up your team's precious time from the hassle (hence, cost) in managing different proxy types, applying throttling, manual ban handling, writing middlewares, setting up a headless browser infrastructure, and reverse engineering of complex Javascript. Of course I'm addressing this from the point of view of data collection project, but the same goes if you have a website and are struggling with malicious bots.
Final thoughts
At the end of the day there is no silver bullet for the anti-ban efforts. If you decided to run the show yourself, it all comes down to: be consistent, be thorough, and be respectful. If you go with SaaS anti-ban solutions, be prepared to experiment with your crawlers' parameters, be patient, and be ready to make trade offs.
Next week we’ll look at the next challenge in managing data collection projects: scaling.
Originally posted in Proses.ID.