Moving On

September 19, 2010

I’ve moved to my own hosted blog

State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure (from Scratch)

September 14, 2010

Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is available and can finish in under a few minutes for a large corpus.

Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn’t facilitate expressiveness the way my favorite language, Clojure, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, clojure contrib, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like Gibbs Sampling, which is how inference is done here.

Without further ado, the code is on gisthub and github (in case I make changes).

Computer and Computational Science

August 30, 2010

There’s a divide I’ve noticed amongst people lumped into a “computer science” department. Compactly, I think there are computer scientists and computational scientists; the knowledge base of these groups is rapidly diverging and CS departments should do a better job catering to each’s needs.

So what exactly is the difference? Well, it’s definitely a fuzzy distinction, but essentially a computational scientist works with data and her primary job is extracting useful information from it. Typically, a computational scientist requires a significant amount of statistical knowledge as well as usually a lot of knowledge from a particular domain in order to make use of data.

Take myself for example: my specialty is statistical natural language processing. My research essentially involves inducing structured data
from unstructured language data and this requires far more knowledge about statistics and linguistics than it does expertise with computer architecture, databases, or systems.

A computer scientist, on the other hand, well, is a computer scientist. Her daily bread is understanding the science of how computers run: low-level operating and embedded systems, tuning a database, scaling a web server, etc. So for instance, a post like this is all about the computer science.

Now, most computational scientists have to know a little bit about computer science in order to implement what it is she wants to do with data. Increasingly though,  advances, made by computer scientists, have enabled data scientists to do their job at higher levels of abstractions without having to think much about what computer scientists think about. These improvements range from the fact that you can make performant systems in higher-level languages  to frameworks like Hadoop that let a computational scientist focus on data and her domain.

There is plenty the areas share which justifies putting them in the same department: much of standard algorithms and computational theory I believe are still broadly relevant to both areas. Procedural thinking, for better or worse, is at the foundation of computer science as well as how we think about doing things with data.

Thinking about data and how to use it certainly isn’t new; statisticians have been doing it for centuries. What is new is the availability of large data and a focus on what actionable decision should be made with it. Computational science has certainly enjoyed a lot of recent success and growth. The New York Times recently called the area the new sexy job. The number of areas which can make use of computational science is growing and will continue to do so for a long time. Computational science will, hopefully, still be a big part of CS departments for a long time to come.

Here’s the issue though. I don’t think the educational curriculum of CS departments has adjusted itself to this growing area. Machine learning isn’t a standard part of the undergraduate curriculum; some instructors have converted their Artificial Intelligence courses into ML ones, but those aren’t always required either. A statistics course isn’t typically required; and no, bundling probability theory into the tail end of a discrete math course doesn’t count. I mean a course where a student does basic analytics on a larg-ish dataset, including things such as simple statistical tests, which are useful in a lot of surprising contexts. Many universities require physics and EE courses for the computer scientists, where is the equivalent statistics course for computational scientists?

A related problem with the standard CS (conflating computer and computational) curriculum is that it doesn’t really convey the broad range of potential CS applications: social science, biology, law, finance, linguistics, astronomy, even comparative literature. I think this is one of the most exciting things about doing CS and exploring these applications is important for budding young computational scientists. However, early CS courses focus on the nuts & bolts important to computer scientists: programming language details, data structures, low-level memory management, etc. I’m not sure it’s a fair analogy, but it’s as though your first year biology course focused on the structure and use of lab equipment; this week: bunsen burners. Clearly, you need a little computer science to do computational science, but I don’t think it needs to be buried so deep in the curriculum.