?

Log in

No account? Create an account

Fighting blog spam with Common Lisp

People often ask for practice projects to get started with Common Lisp: "I just can't think of any way to use it!" One possible answer: if you run into a practical problem in your life, write a Common Lisp program to solve it. Here's a program I wrote to fight spam. It's not fancy, or innovative, or anything like that, but it's a short, easy project, the kind of thing for which I use CL all the time.

Every day I check Google Blog Search for Common Lisp to see if anything interesting pops up. A while ago the results started filling up with a particular kind of spam blog. Each one was hosted on blogspot, and the blog titles were things like "Cheap Discount Computer Books," "Cheap Computer Programming Books," etc., with posts full of computer-related keywords, including "Common Lisp."

Blogspot-hosted blogs normally have a toolbar across the top of each page for reporting the blog as spam, as on the Quicklisp blog:

The spam blogs hid the toolbar somehow:

To fight this spam, I wrote a little project called "spamfighter." It scrapes the blogs and makes it easy for me to report them as spam.

The URL for reporting a Blogspot blog as spam looks like this: http://www.google.com/support/blogger/bin/request.py?hl=en&contact_type=main_tos&blog_ID=blog-id&blog_URL=blog-url&rd=1

The blog-id and blog-url are embedded in JavaScript data in every blog page. The format is highly regular, so I decided to scrape it out with a regular expression instead of using an HTML and JavaScript parser.

  * (quickproject:make-project "~/src/lisp/spamfighter/"
			       :depends-on '(drakma cl-ppcre))
  "spamfighter"
  * (ql:quickload "spamfighter")
  [loading output]
  ("spamfighter")

Quickproject created the skeleton of a project for me, setting up spamfighter.asd, package.lisp, and spamfighter.lisp. (For more information about how I use quickproject, see Making a small Lisp project with quickproject and Quicklisp.)

I used the quickproject-created ASDF system file without modification:

;;;; spamfighter.asd

(asdf:defsystem #:spamfighter
  :serial t
  :depends-on (#:cl-ppcre
               #:drakma)
  :components ((:file "package")
               (:file "spamfighter")))

I did make changes to package.lisp file, though, by adding two import-from clauses:

;;;; package.lisp

(defpackage #:spamfighter
  (:use #:cl)
  (:import-from #:drakma
                #:url-encode
                #:http-request)
  (:import-from #:ppcre
                #:create-scanner
                #:register-groups-bind))

I used :import-from instead of :use here for a few reasons. First, if a future version of cl-ppcre or drakma exports more symbols, I don't have to care; they can't clash with anything in my project. Second, :import-from can import symbols that aren't actually exported, as with drakma::url-encode. It isn't a robust practice when working with someone else's libraries, but I needed URL encoding and Drakma's function turned up first in apropos, so I didn't mind doing it in this quick and dirty project.

Here's the code for the spam fighting bit:

;;;; spamfighter.lisp

(in-package #:spamfighter)

;;; "spamfighter" goes here. Hacks and glory await!

(defun page-content (url)
  "Return the content of URL as a string."
  (multiple-value-bind (content status headers uri stream must-close phrase)
      (http-request url)
    (declare (ignore headers))
    (when must-close
      (close stream))
    (unless (= status 200)
      (error "Unexpected status ~A (~A) on ~A" status phrase uri))
    content))

(defclass blog ()
  ((id
    :initarg :id
    :accessor id)
   (url
    :initarg :url
    :accessor url)))

(defmethod print-object ((blog blog) stream)
  (print-unreadable-object (blog stream :type t)
    (format stream "~S (~A)" (url blog) (id blog))))

(defparameter *scanner*
  (create-scanner "blog_id.. = '(\\d+).*homepageUrl....(http://.*?)'"
                  :single-line-mode t
                  :multi-line-mode t))

(defun extract-blog (content)
  (register-groups-bind (id url)
      (*scanner* content)
    (when (and id url)
      (make-instance 'blog
                     :id id
                     :url url))))

(defun report-url (blog)
  (format nil "http://www.google.com~
               /support/blogger/bin/request.py?~
               hl=en&contact_type=main_tos&blog_ID=~A&blog_URL=~A&rd=1"
          (id blog)
          (url-encode (url blog) :utf-8)))

(defun report-spam (url)
  (let* ((content (page-content url))
         (blog (and content (extract-blog content))))
    (when blog
      (report-url blog)))))))

I wrote it out as you see here, from top to bottom, in just a few minutes.

Initially I used the program by doing my daily search, getting a results page with 5-10 spam blogs, and then:

  1. Get spam blog URL
  2. Paste URL into (report-spam "URL")
  3. Copy & paste return value back into browser

That got pretty tedious for multiple URLs, so I changed report-spam and added spam-repl:

(defun report-spam (url)
  (let* ((content (page-content url))
         (blog (and content (extract-blog content))))
    (when blog
      (asdf:run-shell-command "gnome-open ~S" (report-url blog)))))

(defun spam-repl ()
  (with-simple-restart (quit-repl "Leave spam-repl")
    (loop
      (with-simple-restart (skip "Skip URL")
        (format t "> ")
        (force-output)
        (let ((url (read-line)))
          (when (string-equal url "quit")
            (return))
          (report-spam url))))))

That improved the process to just calling spam-repl and pasting URLs after the prompt.

I've used the with-simple-restart technique a few times now. It's especially handy when working on a list of things, some of which might take a long time to process. If there's an error processing a particular thing, I either want to keep going with the rest (so I don't lose my work on the ones that processed fine), or give up and start afresh later.

After using this for a few days, the spam situation improved a lot. See, for example, cheaplispprogrammingbookreview.blogspot.com, which now returns a page that says "Blog has been removed."

Hacks and glory await!

Tags:

Comments

(Anonymous)

Congratulations

Excelent post and code! I learn a lot with you! Thanks!

thanks

Thanks for writing this up. When I started using CL a couple of years ago, there were already some excellent books on CL programming (eg PCL), but what I was missing was this kind of tutorial on how to set up a project. Quicklisp and Quickproject lower the entry cost for CL programming considerably.

Good work

Good work and very well explained Xach.

Cheers,

Ruben