August 18th, 2009

Dan Weinreb on High-Performance Transaction Processing in Lisp

Via twitter I came across a link to Dan Weinreb's Tech Talk, Lisp for High-Performance Transaction Processing. It was posted yesterday.

update Here are some notes on the talk. It wasn't really much about the low-level details of Transaction Processing, it was more about what ITA does, how they do it, what kinds of problems they face and how they solve them, etc.

Explanation of what ITA does

Discussion of the products

Discussion of the product architecture

Why Lisp? Macros: syntax does matter. Example: flexible, cheap logging with a macro.

100 people on the team, 50 writing Lisp

Lisp also used for ORM, "Quake", written by Scott McKay

What Lisp changes were needed? Timeouts on all I/O operations

Four nines reliable, 24/7, strict availability requirements (300ms or better) or ITA has to pay penalties.

How do you upgrade the software while things are running?

Handling Unix signals.

QPX uses SBCL, RES uses ClozureCL. ClozureCL compiler is 10x faster than SBCL's. Code generation is not as good. SBCL has a ragtag group, ClozureCL has a company with employees behind it. They get paid to do what ITA needs. And Clozure people are friends with ITA people (Dan has known Gail Zacharias since high school.)

You don't have to advocate for garbage collection any more.

They use CFFI to call crypto libraries, Oracle interfaces.

Presentation layer is in Java due to libraries.

Inventory manager is written in Python. Why? The guy who wrote it said "I knew python, I thought it would be fun, and I like twisted." But there's no more special reason than that. But it could be written in Lisp.

Dan started 3 years ago: a lot of cowboy programmers. Things are much tighter now, more professional. Tasks, schedules, code reviews, buildbots, etc.

Languages used Lisp, Java, Python, C++, random shell scripts, perl scripts.

How do you get high level, possibly strong-ego people to work together? The example set by Guy Steele for the ANSI standardization process was eye opening for Dan: Guy is diplomatic and a great writer and he made things work.

Code reviews speed up software development: finding bugs early outweighs checking in quickly.

Q: Is there static type checking for Lisp applications? A: There are macros people can put in. But there's also DEFINE-STRICT-FUNCTION: has type checking for inputs, outputs, preconditions, and postconditions, and is used for module interfaces.

What do you do when a program knows it has a bug? Depends on the kind of bug. Heisenbugs can be restarted in their stateless system. Reproducible bugs require a different strategy.

ITA is divided into developers and operators. Ops team demand "opsification" of delivered software: startable, restartable, monitorable, maintainable, etc.

Should the cluster use threads? The discussion was heated: some said "Threads are problematic, locking, crashing, everything will break." Dan wrote a 35-page design paper covering possible uses of threads in the project. 8 different schemes, 12 different criteria, hard to decide what's best. Finally decided not to use multiple threads for request processing. Timers and monitoring do run in their own threads.

Careful planning for loading new business data without shutting things down, and even loading data early and specifying "Make this the active data at some particular future timestamp."

Strategies for live updates of clusters: multiple versions in the same cluster should planned early, not added in late.

Air Canada patterns of usage are diurnal and upgrades could be scheduled in lightened load at night for them. But a big American airline might be less diurnal and have ten times the volume.

(45 minute mark: "Oh! I don't have much time!")

Clojure is up-and-coming.

State of Common Lisp: Really old, 11 active implementations, 3 European conferences, 250 people at the most recent International Lisp Conference. (ILC videos will be online sometime soon.)

To improve Common Lisp, Dan feels the library ecosystem is a great place to start.

Covers LAIR, Lisp All-Inclusive Repository. Metadata, reviews, etc. Single out exemplary libraries, make it easier to document things in textbooks.

Practical Common Lisp brings their hackers up to Lisp-speed quickly.

Common Lisp is on life support: new libraries will extend its life, but it's not the future in Dan's view.

Scheme is interesting, lots of academic work ongoing. How to do macros right, modules, continuations, etc.

Clojure: "amazing", Dan is really enthusiastic about it, has high hopes for the its future. You can reuse the huge amount of Java effort. "CL library situation isn't as bad as people think, but we'll never catch up to the volume of Java libraries." Good concurrency story for Clojure. STM, side-effect free, etc. Java tool ecosystem is also applicable. Dan thinks Clojure will fill the "industrial strength Lisp" niche in the future, instead of CL. Scheme is trickier, but maybe it could be unified.

Q: GC is slow on huge heaps, is that a problem with the 300ms limit?

A: GC has not yet been a problem. The performance is acceptable. Generational GC works great. In the cluster context, if you have a GC issue, just crash the instance and let something else take over.