Spam comments are annoying and notorious. They are either malicious data from hackers to exploit the loopholes of the site or advertisements posted by robots. These kinds of comments have their own features and patterns, if we are careful enough, we can find ways to block most of them although it's not so easy.
To block the comment with malicious executable codes such as JavaScript, we should remember one rule : never trust user input. So wherever there are user inputs, we need to check the validity of the data, we should escape the data inputted by users and remove unnecessary HTML tags from the input. Posts on how to check this can be found online conveniently.
To block the advertisement comments, we should find the features and patterns of these comments and then adopt various methods accordingly and combine them to build a complete blocking system. In this post, we will focus on how to block advertisement comments. These methods include check the HTML link numbers in the comment, build a black list to ban some illegal contents, check comment source etc.
Simple one first. Most of the advertisement comments are posted by robots because they are fast and automatic, once they find the form to post the comment, the spammers will build a similar form and fill the data in themselves and then let the robots take over the process. So we can first check the request type of the comment, if they are not from an expected request type such as HTTP Post, then we should block them. The code to check whether a request is HTTP Post is :
isset($_SERVER["REQUEST_METHOD"])&&$_SERVER["REQUEST_METHOD"]=='POST'
You can check whether the request is submitted through AJAX as well:
(!empty($_SERVER['HTTP_X_REQUESTED_WITH']) && strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest')
With above check, we can block all comments posted which are not HTTP Post request. But nowadays robots are smart enough and they can simulate the HTTP Post request easily. So this method will not help too much.
Next, another feature of spam comment robot is it posts comments within a short time interval. But in reality, when a user visits a page and starts to write comment and then clicks the submit button, this whole process will definitely take a few seconds or longer. So we can check the time interval between the comment post time and the page load time(Here the load time means the time when the page loads, not how long does it take to load the page), if this time interval is too short, we can assume that the comment is a spam comment. The page load time can be retrieved at the page load and saved in a form field and submitted to the server along with the comment, then when the server process the request, it will generate the comment post time and make the check.
(time()-$_POST["ntime"])<$timeInterval //If yes, then block it. $_POST["ntime"] is the page load time, time() is the comment post time
Some robots are also smart enough to get away from this check by setting the post interval. But we still can block some spam comments with this method.
One feature of spam comments is they usually contain some links which may attract the users to click and visit their client's sites. So we can check the number of links contained in the comment as well. If the links contained in the comment are far too many, we should block them.
max(substr_count(strtolower($data),'href='),substr_count(strtolower($data),'http://'));
Above is a simple way to get the number of links in a comment, you can have your own way to check the numbers. Most of the time if the link numbers are more than 2 in a comment, then the comment is a spam comment.
One another very important feature of spam comments is they contain some keywords which are easy to be found but illegal or inappropriate such as gamble, drug, pharmacy etc. So we can build a black list which contains the banned keywords and then check whether the comment contains the banned keywords. According to the type of a website, the black list may be different. We can find some black list online as well such as this one.
To detect banned keywords, we need to process the comment posted by the user and generate a list of word tokens. Also we may need to use stemming to find some duplicated keywords. If you are using PHP, here is a good stem class to use.
For the source code for how to process the comment, please check them on GitHub.
With this method, you can block most of the spam comments. Many famous blog websites adopt this method such as Wordpress.
By combining all the above methods, you can achieve even more. If you have better ways to block spam comments, please feel free to share. The war with spam comment will continue and last for a long time.