I've been trying to make small improvements to the Erik Naggum comp.lang.lisp archive. One of them is to change the Google Groups link on an article from Google's individual article view to the thread view, so you can quickly see more of the context of the article without clicking multiple times.
Unfortunately, getting the thread view URL requires scraping an individual article's HTML, and Google limits the rate at which you can do that from a single IP. Fortunately, the task is easily distributed among multiple workers.
Here's a bit of client and server code I used to get some friends help me gather the 5000+ links I was looking for.
First, the client. It's meant to be run with sbcl --load megamid.lisp. Here's what it does:
- Fetch a message id from my server
- Fetch that message id's page from Google
- Scrape the thread URL from that page
- Post the message id and thread URL back to my server
- Repeat while my server still returns message ids
;;;; megamid.lisp (sb-ext:disable-debugger) (require 'asdf) (require 'drakma) (defpackage #:megamid (:use #:cl) (:shadowing-import-from #:drakma #:http-request)) (in-package #:megamid) (defun parameters (params) (loop for (key value) on params by #'cddr collect (cons (string-downcase key) (princ-to-string value)))) (defun request (url &key (method :get) parameters) (multiple-value-bind (content code headers uri stream must-close) (http-request url :method method :parameters (parameters parameters) :want-stream nil) (declare (ignore headers uri)) (when must-close (ignore-errors (close stream))) (when (<= 400 code) (error "Bad response code ~A for ~A ~A" code method url)) (unless (= 204 code) content))) (defun google-page (message-id) (request "http://groups.google.com/groups" :parameters (list :selm (string-trim "<>" message-id)))) (defun extract-thread-link (string) "Return the URL linking to a thread discussion in STRING, which should be a Google Groups article HTML page." (let ((index (search "/browse_thread/" string))) (when index (let ((start (1+ (position #\" string :from-end t :end index))) (end (position #\" string :start index))) (concatenate 'string "http://groups.google.com" (remove #\? (subseq string start end))))))) (defun thread-link (message-id) (extract-thread-link (google-page message-id))) (defun message-id-p (string) "Is STRING a lot like a message-id?" (and (char= (char string 0) #\<) (char= (char string (1- (length string))) #\>) (position #\@ string))) (defun resolver-loop () (loop (let ((message-id (request "http://lisp.xach.com/naggum/unresolved"))) (unless (and message-id (message-id-p message-id)) (sb-ext:quit)) (let ((thread-link (thread-link message-id))) (request "http://lisp.xach.com/naggum/resolve" :method :post :parameters (list :message-id message-id :url (or thread-link "none"))) (format t "~A => ~A~%" message-id thread-link))) (force-output) (sleep (+ 3 (random 3))))) (resolver-loop)
Here's the server. Basically:
- Keep a stack of all message ids
- If a client requests a message id, pop the stack and return one
- If a client posts a message id and url, save it to a file
;;;; megamid-server.lisp (defpackage #:megamid-server (:use #:cl) (:shadowing-import-from #:sb-thread #:make-mutex #:with-mutex)) (in-package #:megamid-server) (defvar *lock* (make-mutex :name "megamid")) (defvar *pending-message-ids* '()) (defun message-id-p (string) "Is STRING a lot like a message-id?" (and (char= (char string 0) #\<) (char= (char string (1- (length string))) #\>) (position #\@ string))) (defun string-digest (string) (ironclad:byte-array-to-hex-string (ironclad:digest-sequence 'ironclad:md5 (sb-ext:string-to-octets string :external-format :ascii)))) (defun file (message-id) (make-pathname :name (string-digest message-id) :type "txt" :defaults #p"site:db;naggum;thread-urls;")) (defun resolvedp (message-id) (probe-file (file message-id))) (defun load-unresolved () (with-mutex (*lock*) (setf *pending-message-ids* (remove-if 'resolvedp (site:file-lines #p"site:db;naggum;msgids.txt"))))) (defun handle-resolve () (let ((message-id (hunchentoot:post-parameter "message-id")) (url (hunchentoot:post-parameter "url"))) (when (and message-id url (message-id-p message-id)) (let ((file (file message-id))) (with-mutex (*lock*) (unless (probe-file file) (ensure-directories-exist file) (site:barf url file))))) "ok")) (defun handle-unresolved () (let ((message-id)) (with-mutex (*lock*) (setf message-id (pop *pending-message-ids*))) (if message-id message-id (progn (setf (hunchentoot:return-code*) 204) "")))) (site:handle-url "/naggum/resolve" 'handle-resolve) (site:handle-url "/naggum/unresolved" 'handle-unresolved)
With a bunch of people helping me, I was able to get all the URLs I needed within a few hours.