DDOS By Web Crawlers
I have this website that crawls reddit posts and put them online. There's hundred of thousands of entries, each one with their own URL. The listing is paginated, and thus it creates a lot of urls. All those URLS are then indexed by Web crawlers, who then regularly ping the URLS to check the content, etc. You know how it works.
The problem is that it unvoluntarily creates DDOS attacks on my poor webserver. And if you don't want to upgrade to a better infrastructure, then you're probably going to seek for a way to mitigate those "attacks".
An easy approach is to rate limit the web crawlers based on the User-Agent header in their requests. Schematically this looks like the following.
At http level in Nginx configuration:
map $http_user_agent $bot_ua {
default '';
"~*Googlebot|Bing" Y;
}
limit_req_zone $bot_ua zone=bot:1m rate=1r/s;
This will make sure that all requests with Googlebot or Bing in User-Agent will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP). So all of the bots will wait in a single queue to access the web site.
The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.
At server or location level:
limit_req zone=bot burst=5;
This indicates that it is possible to have a "burst" of 5 requests. If desired, you can choose to exclude this option.
When a request is restricted due to rate limiting, Nginx will generate HTTP status code 429. "Responsible" web crawlers can detect this and will decrease their scanning speed on the website.
However, it is important to note that this entire issue is much more complex. There are numerous malicious requests that pretend to be from popular platforms like Google, Twitter, and Facebook. These requests come from various scanners and crawlers, as exemplified by the question mentioned earlier. These malicious entities do not adhere to the rules specified in the robots.txt file or respond to the 429 status code. They can be quite intelligent and even mimic the behavior of web browsers by using User-Agent headers. In such cases, the approach mentioned above will not be effective to mitigate the issue.
I consider myself as an IT Business Artisan. Or Consultant CTO. I'm a self-taught Web Developper, coach and teacher. My main work is helping and guiding digital startups.
more about meBTC
18SY81ejLGFuJ9KMWQu5zPrDGuR5rDiauM
ETH
0x519e0eaa9bc83018bb306880548b79fc0794cd08
XMR
895bSneY4eoZjsr2hN2CAALkUrMExHEV5Pbg8TJb6ejnMLN7js1gLAXQySqbSbfzjWHQpQhQpvFtojbkdZQZmM9qCFz7BXU
2024 © My Dynamic Production SRL All rights Reserved.