Perl or Python for bioinformatics

I plan to learn both. Just curious about opinions regarding the difference between these two languages and their applications with respect to bioinformatics. I haven’t decided which one to try first.

Python is a real programming language, with a significant set of extensions for just about very scientific venture imaginable. Perl is a text manipulation language. Perl has its place. But you don’t write real code in it. You may find a lot of little utilities written in Perl for manipulating various files, and munging formats. But you can’t write serious programs in it. At least not and stay sane.

I’ve written “real” bioinformatics code in both Perl and Python. Both have their place. Both have been around long enough that both have lots of extensions and libraries for nearly anything under the sun. Both have been used for numerical ‘heavy lifting’, though you will need to do low-level optimization to get either of them to operate at ‘high performance’ levels. Neither is really what you’d want to write a huge project in unless there’s some specific feature that you desperately need.

My very broad rule of thumb for picking a language when starting a project that’ll be relatively small is to consider what its primary goal will be. If the point of the project is to carry out transformations on what is fundamentally textlike data (and note that a very substantial part of bioinformatics, perhaps the majority, is this kind of code), Perl is a great language; it was designed from the ground up for this sort of thing. Write conservatively and comment often, and the codebase will even be maintainable. If the point of the project is to manipulate lists (either explicitly or implicitly with infinite streams), then Python has a lot of features geared for just that.

Both Perl and Python have extensive bioinformatics libraries (Bioperl and BioPython respectively), and this, by the way, is a good rule of thumb for deciding on whether or not to use a language for bioinformatics applications; if your data format of choice doesn’t have a parsing+manipulation library available for a given language, you really need to think hard about whether whatever features that language has are worth the grief of having to write those libraries yourself. It can turn a day-long lark into a month-long root canal.

I don’t write code for bioinformatics, but I wanted to defend Perl a bit. It’s plenty serious for doing real work and writing “real” applications. I wouldn’t write a flight simulator with it (though I"m sure someone has) but it has a niche. Perl is getting a bit long in the tooth but continues to be useful.

Based on Wikipedia, BioPerl is the basis for quite a list of bioinfo tools.

I also note there is a BioRuby and BioJava so that other programmers can chime in with so support their languages!

For bioinformatics it looks like both languages have support libraries and books to get you started. I’d start with the one with the cheapest book. Both languages are useful for non-bioinformatics too so it’s hard to go wrong.

Do you have a mentor/advisor/trainer in bioinformatics? That’s probably essential, for making any real progress in the field. I’d suggest that you just find out what language that person uses, and focus on that, so you can share code easily.

You do not know what you’re talking about.

This is kind of a dickish reply, so let me expand a bit.

It is wrong to dismiss Perl as “a text manipulation language.” Perl is a general-purpose, modern, object-oriented, functional programming language. The core language has a lot of historical baggage and is crufty, but a large ecosystem of extensions make it a useful and powerful tool with all the same capabilities of any other high-level programming language.

Not everybody likes Perl syntax. That’s fine. It has a lot to be desired in some places and some people seem intent on writing unreadable Perl code. Perl thus suffers from a poor reputation in some places.

For scientific and math applications, Perl has PDL (for very fast numeric computation). It has packages for symbolic computation (aka computer algebra systems.) There are fast, extremely-well-tested packages for doing math on complex numbers, rationals, bignums, trig, financial models, and other stuff. It has widely-used interfaces to all computer algebra systems, like MATLAB and Mathematica.

For bioinformatics, BioPerl is a mature and stable framework. It’s been around for nearly 10 years. It was largely responsible for the success of the Human Genome Project.

With Moose and its ecosystem, Perl has the best OOP framework of any language in common use.

So I get a bit testy when people who don’t use Perl dismiss it because they aren’t aware of its capabilities. I have made a good living for more than a decade writing large, sophisticated, documented, readable, well-tested production systems in Perl.

Perl vs. Python is a matter of religion, not merit. The fact of the matter is that they’re both perfectly good languages for whatever you might want to do, and your choice really needs to be based on what the people you’re going to be working with use.

Anecdotally, our Bioinformatics guy, uses Perl and R pretty much exclusively and does wonderful work.

Thanks for all of the input.

Not really. I’m beginning grad school this fall. I haven’t yet picked my lab, but the lab options all incorporate bioinformatics into their research. Any coding in these labs is done primarily by the grad students as none of these particular professors are bioinformatic experts. My bioinformatics training will come from courses outside of the lab and other grad students. Eventually I’ll learn both, so that I can more easily work with various labs regardless of what particular language they use.

I haven’t used Perl, but I program bioinformatics with Python, and it is great.

Nah, it was a fair retort. My comment was born of the end of a long frustrating day. I do a great deal of work in Python, and have seen some guys struggling to do scientific work in Perl, although it was probably more a reflection on their capability than the language. It is interesting how the languages steal from one another. Python has pretty much lifted Perl’s regexp syntax. The current reality for both languages is a far cry from their beginnings. (That said, I still do dislike Perl’s syntax, and I doubt that will ever change.)

Thank you for the input. I definitely appreciate your enthusiasm for Perl, and others I have spoken with in person share your enthusiasm. However, I do question this one statement in support of Perl. I’m well aware of the efforts of the Humane Genome Project and am fairly well versed in many aspects of it. I’m aware of several advancements (outside of scripting language) that contributed to it’s success. Taking you at your word that BioPerl has been around for nearly 10 years however (I haven’t bothered to look up when BioPerl was first established), doesn’t fit with having been “largely responsible for the success of the Human Genome Project.” The Human Genome Project was begun in 1989, working draft in 2000 ( I even remember reading about this long before I decided to enter the field), and complete draft in 2003. If you’re correct about BioPerl being around for nearly 10 years, that would put it’s conception at late 2001 or later, after the initial working draft was announced. I’m not sure how BioPerl could be largely responsible for it’s success if it was conceived after practically all of the work for the Human Genome project had been completed.

It’s possible that BioPerl applications have been around longer (again I haven’t looked it up) and was used in some aspects of the Human Genome Project. However, since the vast majority of the work on the Human Genome Project, in both the Ventor lab and government collaboration, was done pre-2000, it seems unlikely that a technology that had yet to be conceived (according to your stated time) was “largely responsible.” It’s also possible that BioPerl applications were conceived while working on the Human Genome Project. However, neither of these assumptions correlate with your statement that BioPerl has been around for “nearly 10 years.”

I only point this out in response to your admittedly “dickish” reply that another responder doesn’t know what he’s talking about.

I appreciate your enthusiasm for a particular language, but overemphasizing and possibly embellishing support to back up your enthusiasm really doesn’t help. My guess is that you’re a programmer with a particular taste, and your assessment of Perl vs. Python may be correct. I also assume that you’re not a bioinformatics or bio-anything professional who made a hasty statement on the matter, which is fine. It’s just unfortunate that the hasty statement came directly after a rash comment toward someone with a different opinion.

Thanks for the input though, it really does help. I appreciate the input of everyone, and would like for everyone with an opinion on the subject to comment. So lets please not insult those with differing points of view and chase them away.

Wikipedia says that BioPerl dates back to 1996, and has a Usenet cite to support that. It wasn’t officially released until 2002, but it seems likely that it was used internally to the HGP well before that date.

Wikipedia also states “It has played an integral role in the Human Genome Project.[2]” with a citation. (http://www.bioperl.org/wiki/How_Perl_saved_human_genome). To my eyes the article is more about Perl than BioPerl. My quick skim leads me to believe that it is describing the creation of BioPerl. It also matches up with the 1996 creation date of BioPerl.

I don’t know what all falls under bioinformatics that might distinguish it from any other programming task, so I can only comment on the languages as a generic platform for anything.

Many programming languages are more well-suited to particularly sizes of application. The following would be a rough guess as to what might be considered the relative size of applications:

Script < 200 lines
Tool < 1000 lines
Small application < 3 files @ ~1000 lines each
Medium application < 15 files
Large application < 50 files
Giant >= 50 files

Perl is best for scripts. It can be used for tools and (if you use object oriented coding) small applications, but probably isn’t ideal above that. Though you might be able to write perfectly decent Perl code, people around you might not, so once you’re getting to an application with multiple authors, Perl isn’t great. But it is a great language for scripts. It has a very compact syntax that allows you to do common tasks in a very small amount of code. If you don’t want to spend a lot of time and the task is straight forward, it’s a good language to know.

Python probably scales up to about the medium application level. It has a less arcane syntax, so it’s easier to read and understand, and to write as well since you won’t have to spend as much time going back through the reference material to remember what symbol means what.

For medium+ applications, I’d recommend a language like C# or Java. They’re no harder to code in than Python, and they give a performance boost so you get your answers faster. If you need a GUI, they both have good abilities to enable this, without you having to go through arcane means.

Now on the other hand, if you’re really looking for speed of processing, then you’ll want to look into C (script -> medium) and C++ (tool -> giant). But you’ll need a team of skilled, experienced coders for that.

Your experience with Perl may be somewhat dated. It has a talented and vibrant development community and has greatly evolved in the last decade.

Perl is widely used for “Giant” applications with large diverse development teams.

Users include amazon, priceline, ticketmaster, NASA, BBC, craigslist, etc, etc.

(Disclaimer: I work for such an organization and have a small team working on “Giant” Perl applications.)

And? Any language can be used for anything. PHP is the backbone of many of the largest websites, like Facebook and MySpace. That doesn’t mean that PHP is a good language to make large websites from. People can and do write giant, object oriented programs in C. That doesn’t mean that C is the ideal language for that. Theoretically, C doesn’t even support object oriented coding at all.

Given a hammer, nails, and a saw, I can build every house in the world, but I’d still be better off starting out with a chainsaw, nail gun, nails, and a power source.

Perl won’t prevent you from making a giant application, nor will it prevent you from making a well-formed, bug free, and maintainable set of source code. But, you’ll end up working harder to get it to that point than if you’d chosen a language that was better suited to the task.

I’m suggesting that due to developments in the last decade, Perl has advanced to the state that it it is very well suited to a certain class of tasks, including some that involve “Giant” applications developed by large teams.

One can get into language wars quickly. Snide comments like mine above never help. I currently work on a project that easily fits the above definition of large, verging on giant, which has a very large component in Python. I spend my time coding in C++ and Python. Languages have strengths and weaknesses, and there are many times where a language has been pushed past what you might regard as its intrinsic limits. C++ is a good example of that. It is not a happy language. Python is generally a pleasure to program in, and where speed isn’t critical, you can write a lot of very powerful code very quickly. If performance is critical you need to be able to program to the metal, and the language needs to get out of your way. Performance critical code needs a clear understanding of the underlying machine architecture as well as algorithm design skills. Fortran is still around, and brilliant for some tasks. Especially in F95 and HPF guise, where you have the ability to express data parallelism with great ease.

C#, Java provide levels managed runtime, which is a huge win for any large project. There is point where you simply can’t manage the sheer complexity of code properly and you need what one might regard as proper language support. C++ doesn’t provide this, and they dropped automatic GC from C++0x at the last moment. Which is a huge pity.

Choice of language is usually much more constrained by a project’s environment than by purely technical considerations.