Zardoz: a lightweight WAF , based on Pseudo-Bayes machine learning.
Zardoz is a small WAF, aiming to take off HTTP calls which are well-known to end in some HTTP error. It behaves like a reverse proxy, running as a frontend. It intercepts the calls, forwards them when needed and learns how the server reacts from the Status Code.
After a while, the bayes classifier is able to understand what is a "good" HTTP call and a bad one, based on the header contents.
It is designed to don't consume much memory neither CPU, so that you don't need powerful servers to keep it running, neither it can introduce high latency on the web server.
This is just an experiment I'm doing with Pseudo-Bayes classifiers. It works pretty well with my blog. Run in production at your own risk.
git clone https://git.keinpfusch.net/LowEel/zardoz
Zardoz has no configuration file, it entirely depends from environment string.
In Dockerfile, this maps like:
ENV REVERSEURL http://10.0.1.1:3000
ENV PROXYPORT :17000
ENV TRIGGER 0.6
ENV SENIORITY 1025
ENV DEBUG false
ENV DUMPFILE /somewhere/bayes.txt
ENV COLLECTION 2048
Using a bash script, this means something like:
REVERSEURL is the server zardoz will be a reverse proxy for. This maps to IP and port of the server you want to protect.
PROXYPORT is the IP and PORT where zardoz will listen. If you want zardoz to listen on all ports, just write like ":17000", meaning, it will listen on all interfaces at port 17000
TRIGGER: this is one of the trickiest part. We can describe the behavior of zardoz in quadrants, like:
||BAD > GOOD
||BAD < GOOD
||GOOD - BAD \
||GOOD - BAD \
The value of trigger can be from 0 to 1, like "0.5" or "0.6". The difference between BLOCK without learning and block with learning is execution time. On the point of view of user experience, it will change nothing (user will be blocked) but in case of "block+learn" the machine will try to learn the lesson.
Basically, if the GOOD and BAD are very far, "likelyhood" is very high, so that block and pass are taken strictly.
If the likelyhood is lesser than TRIGGER, then we aren't sure the prediction is good, so zardoz executes the PASS or BLOCK, but it waits for the response , and learns from it. To summerize, the concept is about "likelyhood", which makes the difference between an action and the same action + LEARN.
Personally I've got good results putting the trigger at 0.6, meaning this is not disturbing so much users, and in the same time it has filtered tons of malicious scan.
SENIORITY: since Zardoz will learn what is good for your web server, it takes time to gain seniority. To start Zardoz as empty and leave it to decide will generate some terrible behavior, because of false positives and false negatives. Plus, at the beginning Zardoz is supposed to ALWAYS learn.
The parameter "SENIORITY" is then the amount of requests it will set in "PASS+LEARN" before the filtering starts. During this time, it will learn from real traffic. It will block no traffic unless "seniority" is reach. If you set it to 1025, it will learn from 1025 requests and then it will start to actually filter the requests. The number depends by many factors: if you have a lot of page served and a lot of contents, I suggest to increase the number.
This is where you want the dumpfile to be saved. Useful with Docker volumes.
The amount of collected tokens which are considered enough to do a good job. This depends by your service. This is useful to limit memory usage if your server has a very complex content, by example.
If DEBUG is set to "false" or not set, minute Zardoz will dump the sparse matrix describing to the whole bayesian learning, into a file named bayes.json. This contains the weighted matrix of calls and classes. If Zardoz is not behaving like you expected, you may give a look to this file. The format is a classic sparse matrix. WARNING: this file may contain cookies or other sensitive headers.
DEBUG : if set to "true", Zardoz will create a folder "logs" and log what happens, together with the dump of sparse matrix. If set to "false" or not set, sparse matrix will be available on disk for post-mortem.
Credits for the Bayesian Implementation to Jake Brukhman : https://github.com/jbrukh/bayesian