Thursday, February 19, 2009

tips for bioinformatics

This was from a talk by Joel Dudley, originally posted by Shirley Wu at

Quite useful if you want to start programming something...

1. Learn UNIX. It’s quick, it’s powerful, it’s easy to learn. What often takes several lines to code in a scripting language can usually be reduced to a single line on the command line.

2. Be jack of all trades, but master of ONE. That is, be familiar with most programming languages, but be really good at one of them. In the hierarchy of languages, VB and C are more “primitive” while Ruby and Python are most “advanced” - he recommends starting with one of the more advanced languages if you are new to programming. Out of Ruby and Python, Python will probably give you more bang for your buck, due to the smorgasbord of libraries available and broad acceptance (e.g. academic labs, Google). In addition, there are lots of bridges between languages, such as Jython (Java and Python) and JRuby (Java and Ruby) so expert knowledge of one is usually sufficient for you to make a lot of things work practically everywhere.

3. Don’t reinvent the wheel. “Frameworks are your friends.” Take advantage of large existing projects like BioPython/Perl/Ruby/Java, Django, Rails, etc which contain lots of ready to go code for practically everything. Use the internet to find existing code solutions - e.g. Koders is like a Google search for open source code on the web.

4. Learn one text editor really well. Take your pick of Emacs, vi, or a GUI-based editor like TextMate for Macs. The advantage of emacs and vi is that they will be installed on pretty much any system you come across.

5. “Don’t trust yourself”, i.e. use code versioning. Examples are Subversion, CVS, and git. You can even outsource your code hosting with github. Combine this with project management in GForge.

6. Don’t be afraid to use more than 3 letters to define a variable. Having short variable names won’t make the code run faster. It will, however, make the code more difficult for others (and you, 3 months from now) to understand!
Photo by archeon on Flickr

7. Balance architecture and accomplishment. You may be tempted to create something that is complete, elegant, and perfectly structured. This will likely be a waste of time. It’s ok to sacrifice a little bit of structure to get something that actually works.

8. Automate documentation. Documentation is necessary, but it’s a pain to write. So come up with a convention for your headers and make it automatic. Use available tools like Doxygen, JavaDoc, and RDoc, many of which are free.

The above are generic for academic-level software engineering. Some tips that more specifically address high-throughput biomedical computing:

9. Kill the flat file (sort of). This is the most common file format used in bioinformatics, but it hardly lends itself efficient computation. A common task we want to do with the file is read in the data and store it keyed so that we can look up specific pieces of the data later. Hate databases? Cringe at SQL? If you can represent your data as key/value pairs, consider using an embeddable database like the open source BerkeleyDB (now licensed by Oracle), which require no administration. If you don’t mind SQL, but hate the administration, SQLite allows you to create embedded, serverless databases. Other options that go beyond the relational database concept are CouchDB (”a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”) and Hypertable (”a high performance distributed data storage system”).

10. New ways to do parallel computing. Determine whether your tasks are loosely coupled (independent) or tightly coupled. Although personal computers and laptops are coming out with more cores, most programs only use one at a time. Find ways to utilize idle cores - e.g. there is a way to do this in R. Think in terms of MapReduce. Take advantage of cloud computing, like Amazon’s EC2. Use platforms like Hadoop and Disco to make parallel computing applications. A cool example of this is Cloudburst-Bio, a massively parallel project for genome assembly from next-generation sequencing that uses MapReduce.

11. Embrace hardware. New (and old) hardware is available that can give you significant speedups in biomedical computation, notably graphical processing units (GPUs) which have been used to accelerate molecular dynamics. Hardware vendors like Nvidia are starting to respond; you can now get GPU workstations like NVidia’s Tesla personal supercomputer offering many 100sX speedup over traditional workstations. So if you don’t want to utilize the cloud, you can get an affordable and powerful cluster that fits on top of your desk. Aside from GPUs, there are field programmable gate arrays - chips you can program after manufacturing.

12. Playing nice with others. Think a bit about data exchange formats - but definitely use them! Suggestions are JSON, YAML, and, of course, XML. When working in teams, use an “agile software development” strategy - mainly many fast iterations of the specification-development-feedback cycle. Use tools to automate the development process, such as unit testing and the granddaddy, “make“. Tools like BaseCamp (and perhaps Science 2.0 versions like Laboratree) can help with the more general project management aspects.


In summary:

Focus on the goal (biology or medicine).
Don’t be clever (you’ll trick yourself).
Value your time.
Outsource everything but genius.
Use tools available to you.
And have fun. ;)

Slides for Joel’s presentation are up on Slideshare

by Beyond Lab

Saturday, August 16, 2008

tough decision - how to focus

Like any graduate students doing research, there are some tough times. This is what I am going through recently. --The balance between research and family; the requirements of department to meet; the struggling to focus on only one project.

Looking back, I have to admit that I have not been very focused on my research. As a result, I have worked on 4 or 5 different projects. They are all excited projects. And I have made reasonable progress in most of them. But now I realized that to graduate quickly, I have to focus on one thing and do something deep enough. So, which project to drop?? I am struggling between two projects. The one I have made most progress and have produced quite a publication is really risky to follow up and should definitely have big impact (this is like marketing). The other one is relatively straightforward (I have to say relative) and will very likely generate some small paper(s).

Plus, the department required me to teach for two semesters --one of them has to be some kind of lecturing (like lab courses). This will definitely eat out a lot of my time in research.

I wish I had 48hours a day to work!!!

I think I probably will go with the relatively easier one so that I can graduate by the end of next year. After graduation, if the risky one has not been published by other researchers, I could continue it. (Again, this is marketing!!!)

Wish me luck!

by Beyond Lab

Thursday, July 31, 2008

lab design

For most biology research (especially molecular biology), lab design is an art. Poorly designed labs will "effectively" limit your productivity, including moving around, organizing equipment, communication, storage, wasting space, etc. All these things must be considered. In order to do that, the actual researchers have to be actively involved.

Our lab has moved a few times in the last two years, either moving across states, or from one building to another building. The building I am currently in is brand new. But we've already seen a lot of design problems.

To name a few:
1. No storage space. Biology research utilizes a lot of tubes, dishes, and so on. It is impossible to buy new one every day or every week. So, you always buy in boxes, even in 10-20 boxes each time. Therefore you have to have space to store them. However there is absolutely no space designed for this purpose. As a result, our lab manager had to personally buy shelf material from Home Depot and ask carpenter to put shelves on the walls, which are not as stable of course, but is much better. Now, everytime our big boxes come, we have to unpack them and put the contents onto the shelves (obviously the boxes are too big for regular shelves).

2. curtain and emergency light for microscopy room. Modern microscopy mainly refers to fluorescent microscopy. This requires dim or no light while working --means dark room. This was designed badly with an emergency light right above the microscope. It is emergency light, you have to leave it on all the time -- how do you use the microscope then since darkness is needed? Also microscopy rooms usually have curtains to block lights out. Ironically they designed white curtain for dark room!

3. Conference room. You have to present your powerpoint slides - always with images - so you want the lights in the conference room adjustable. This was designed. However, the rooms have two big windows facing south. The regular blinds let lights in easily. Bad hah. This is worse -- the room is about 2 meters wide and 5 meters long with the screen installed at the long side. You know what I mean, when you want to see the presentation, you have to bend your whole body back. No one's neck likes that. If there are more than 5 people, you'll have to watch the presentation half meter in front of the screen. Like it?

There are more...

Who the hell designed the building???

by Beyond Lab

Wednesday, July 30, 2008

another stupid microsoft thing

Have you ever tried to set up two internet connections for your labtop? --Are you tired of changing between static IP address and DHCP?

Will, microsoft actually has this function. How to use it? -- This is the help page.

Follow this, can you find the "alternative configuration" tab??

If you are lucky, you will see it. But very likely you don't see it. --Why? Simple! It only shows up if you select DHCP (automatical). If you set a static IP etc., there is no alternative configuration tab.

Why? Ask Bill Gates. (Oh, I guess he's off the hook already. But sorry, we have to blame him.)

by Beyond Lab

Monday, July 28, 2008

better than others?

Something I am doing right now requires me to demonstrate that I am better than others with the ame levels of education. What thing I can do but others cannot?

While thinking about this question, I remember something like "if you cannot describe what you are doing in one sentense, stop doing it. you are wasting your time."

Sure I can do that. But my training has been a bit diverse, although always in life science -- from medical school gradually shifted toward basic pathology research to current biological research. I'll have to explain some basics before people really gets what I am doing. On the other hand, this really gives me something special than others. I will have to take advantage of my training background.

However, I understand that to succeed, it is necessary to be able to clearly explain to lay people my research in PLAIN language. So that people understand the importance of my research and therefore provide support. This is not just grant writing, which is for experts to review. This is for everyone else. I will post it here once I am done.

by Beyond Lab

Wednesday, July 23, 2008

DNA for dating??

by Beyond Lab

It has always been said that it is chemistry brings two people together. Probably it is right (lol) - check out this site: .

When doing transplant, you always need to match certain genes for example MHC, but in the future, before date someone, test his/her DNA first. --Or try to steal some of stuff he used on your first date to get the DNA tested.

This is a joke. --That's all I can say.

But will the company profit? probably. There are enough idiots in this world. Only marketing matters in business.

Friday, July 18, 2008

where is your saliva samples being analyzed?

by Beyond Lab

it is clear that 23andme uses Laboratory Corporation of America (LabCorp) as its genotyping service lab with Illumina chips (arrays). --Beyond lab is wondering how much 23andme will pay LabCorp for each sample, how much is the chip and how much can they get from what customres pay. --This market is huge but is absolutely quantity-dependent. The more customers, the more profit. What will the customres get and how will the information help the customres? --that's something else.