The first thing I would look for is if real users both browsers and API clients are capable of doing HTTP/2.0 and if they default to that. If so that an easy win. Block anything lower than HTTP/2.0 and that will nuke most bots outside of headless Chrome. If any real clients are using HTTP/1.1 then make a separate listener/URL for those and limit access by known good CIDR blocks with a firewall assuming this is a corporate GitLab server. Or block this on HAProxy and give trusted networks a way to reach NGinx directly such as a VPN or firewall rule.
If there are archived access logs that would be a good place to try to figure this out.
In NGinx the block looks like this [1] or change it to a redirect to a static landing page.
If this is not an option then restrict repo access to approved SSH clients.
If this is not an option then put authentication on the repos hit hardest and a page that explains what the user/password is along with an acceptable use policy for using the authentication. If AI are trained to learn the authentication they will be violating the AUP written by your lawyers. Make the AI vendors give you enough money to upgrade your infrastructure to handle their load.
TL;DR find the differences between bot behavior and real people then make rules that will break the bots. There's always a difference. When all else fails block the CIDR blocks of all the known AI networks and play whack-a-mole for anything outside of their networks. Not perfect, nothing is but it will lower the load.
If going the blocking route, add all their CIDR blocks and IP's into a text file that gets read by a startup script to
ip route add blackhole "${CIDR} 2>/dev/null
That will prevent HAproxy from being able to complete the handshake and is a much lower CPU and memory load on the server than using firewall rules.
I saw them try to read some static files I posted here but they were instantly blocked by a combination of nftables and nginx.
That's what made it past nftables TCP MSS and TCP window rules. The 200's were members of HN. The 444's were bots.Does Gitlab front-end with Nginx or Haproxy?
Both - first haproxy, then nginx.
The first thing I would look for is if real users both browsers and API clients are capable of doing HTTP/2.0 and if they default to that. If so that an easy win. Block anything lower than HTTP/2.0 and that will nuke most bots outside of headless Chrome. If any real clients are using HTTP/1.1 then make a separate listener/URL for those and limit access by known good CIDR blocks with a firewall assuming this is a corporate GitLab server. Or block this on HAProxy and give trusted networks a way to reach NGinx directly such as a VPN or firewall rule.
If there are archived access logs that would be a good place to try to figure this out.
In NGinx the block looks like this [1] or change it to a redirect to a static landing page.
If this is not an option then restrict repo access to approved SSH clients.
If this is not an option then put authentication on the repos hit hardest and a page that explains what the user/password is along with an acceptable use policy for using the authentication. If AI are trained to learn the authentication they will be violating the AUP written by your lawyers. Make the AI vendors give you enough money to upgrade your infrastructure to handle their load.
TL;DR find the differences between bot behavior and real people then make rules that will break the bots. There's always a difference. When all else fails block the CIDR blocks of all the known AI networks and play whack-a-mole for anything outside of their networks. Not perfect, nothing is but it will lower the load.
If going the blocking route, add all their CIDR blocks and IP's into a text file that gets read by a startup script to
That will prevent HAproxy from being able to complete the handshake and is a much lower CPU and memory load on the server than using firewall rules.[1] - https://mirror.newsdump.org/nginx/inc.d/40_https2_stuff.conf...