DDOS By Web Crawlers

I have this website that crawls reddit posts and put them online. There's hundred of thousands of entries, each one with their own URL. The listing is paginated, and thus it creates a lot of urls. All those URLS are then indexed by Web crawlers, who then regularly ping the URLS to check the content, etc. You know how it works.

The problem is that it unvoluntarily creates DDOS attacks on my poor webserver. And if you don't want to upgrade to a better infrastructure, then you're probably going to seek for a way to mitigate those "attacks".

An easy approach is to rate limit the web crawlers based on the User-Agent header in their requests. Schematically this looks like the following.

At http level in Nginx configuration:

map $http_user_agent $bot_ua {
  default '';

  "~*Googlebot|Bing" Y;
}

limit_req_zone $bot_ua zone=bot:1m rate=1r/s;

This will make sure that all requests with Googlebot or Bing in User-Agent will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP). So all of the bots will wait in a single queue to access the web site.

The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.

At server or location level:

limit_req zone=bot burst=5;

This indicates that it is possible to have a "burst" of 5 requests. If desired, you can choose to exclude this option.

When a request is restricted due to rate limiting, Nginx will generate HTTP status code 429. "Responsible" web crawlers can detect this and will decrease their scanning speed on the website.

However, it is important to note that this entire issue is much more complex. There are numerous malicious requests that pretend to be from popular platforms like Google, Twitter, and Facebook. These requests come from various scanners and crawlers, as exemplified by the question mentioned earlier. These malicious entities do not adhere to the rules specified in the robots.txt file or respond to the 429 status code. They can be quite intelligent and even mimic the behavior of web browsers by using User-Agent headers. In such cases, the approach mentioned above will not be effective to mitigate the issue.

DDOS By Web Crawlers

about me

follow me

newsletter

support my work

popular posts

Discourse and Cloudflare : Error on port 443 on installation

How To Convert A Video File To GIF with PHP

Laravel 5 And His F*cking non-persistent App SetLocale

Simple Like System with Laravel 5

The Ultimate Like System For Laravel Thanks To Many to Many Polymorphic Relationships

categories

recent posts

Running Multiple Laravel Nightwatch Agents on the same server

How to fix CORS issue with Cloudflare R2 and Laravel Filament / Livewire

Dragon's Dogma 2 - Pitch Meeting

To grieve deeply is to have loved fully

Introducing Volet: A Laravel Package for User Feedback