Chris' pdbCharmer |
In case you don't know... PDB stands for Protein Data Bank |
"Everybody is always coming up with (mostly vaporware) data structures that handle all possible situations and are extensible in all thinkable (and unthinkable) ways."-- Thomas Hamelryck, The Biopython Structural Bioinformatics FAQ |
Hmmm. Maybe everybody is trying to tell you something important. |
It is easy to write a PDB parser. It is hard to write the PDB parser.
In real life I personally know at least 6 people who have written some kind of parser for information contained in PDB files. I have come to realize that this wheel has been reinvented thousands of times. Why should we Python programmers continue to suffer this fate when a single, comprehensive cure to the problem is possible?
pdbCharmer is my attempt to put an end to the nonsense of writing custom parsers for every little bioinformatics project that needs to parse PDB files. That part is actually working pretty well. The grander goal is to provide an interface for all matters dealing directly with the PDB file format.
If you want the source code it is here: pdbCharmer.py
I've done a lot of work recently to clean up some problem areas and add some features. Now all of the information in the COMPND and SOURCE records are available. Now the current way to get information from these records is like this:
X.section[pdbCharmer.TI].compnd.compnd['molecule']
X.section[pdbCharmer.TI].source.source['organism_common']
I've also done a lot of work writing a tester program that can run through all the PDB files and see if the parsing system breaks. So far, I've found some pretty flagrant errors in quite a few PDB files. Perhaps the worst is a mangling of the MASTER record which is generated by PDB themselves. There are about 245 PDB entries with an incorrect MASTER record. Also it seems that a lot of non-crystallographers resent the fact that the CRYST1 record is mandatory because many folks simply leave it out. There are hundreds of such examples. All my testing has also caused me to set up a somewhat clumsy system that sort of auto-detects what version a PDB file is and act accoringly. This reduces the crashing quite a bit (though if you're looking for new things in old files, you get what you deserve).
I just added a filter function to clean up all the strings extracted from the PDB. Now if the PDB is supposed to contain an "Integer", it is actually an integer in Python too. Also strings are stripped if there was extra space. Any object that is blank in the PDB now returns a None object type allowing for checks like this:
import pdbCharmer
pdb1zx8= pdbCharmer.info()
pdb1zx8.get_raw_pdb_data("PDB:1zx8")
pdb1zx8.input([pdbCharmer.SS],True)
helices= pdb1zx8.section[pdbCharmer.SS].helix_list
for h in helices:
if h.initICode: # <-- Note that it will be None if empty.
print ( h.initICode + '\t'
+ "Insertion code of initial residue of helix")
This parser is now getting to be pretty useful.
Here is some code to illustrate new features. It is meant to be read, not run. Read it.
#!/usr/bin/python
# PDB file parsing with Xed's system:
# (http://xed.ucsd.edu/PDB/)
# Chris X Edwards - 05.08.08
# This is a sample program that is designed to illustrate some new
# features of my PDB parsing system. It is meant to be read more than
# run and serves only to explain concepts incorporated in the design
# of the system. Specifically, this shows the first phase of my
# optimization strategy which is PDB section selectivity. The next
# phase (if it ever gets done) will be PDB record-type selectivity.
# Since there was some concern about how many steps it took to get PDB
# data into objects, this clearly demonstrates why that was set up in
# such a way. Also it addresses concerns about my design's performance
# potential.
import pdbCharmer
import time
def test_PDB_input_features(exclude_list=[], sense=False):
"""This is a sample function which illustrates step by step how to
convert a PDB file into useful data objects. It is set up to
illustrate the different optimized conversion modes."""
sample_pdb_list= ['1J6U', '1J6P', '1O2D', '1O0W', '1J5P', '1J5U', '1J5X',
'1J5W', '1J5Y', '1J6O', '1J6R', '1O12', '1O13', '1O1Y',
'1O0Y', '1O22', '1O5J', '1O51', '1O3U', '1O1Z', '1O20',
'1O5H', '1O59', '1O4W', '1VPV', '1VL6',]
for s in sample_pdb_list:
# Set up an info object (will contain file's info).
pdbob= pdbCharmer.info()
# This is a new feature that will take a pdbid and fetch a pdb
# file out of a local mirror in some place like:
# /hardcodedlocation/for/now/xy/pdb1xyz.ent.Z
file= "LOC:"+s # Means "Get from LOCal repository."
# ("PDB:1xyz" means downloadownload from pdb.org.)
# Actually load the file - raw data collected.
pdbob.get_raw_pdb_data(file)
# Here's the point of this exercise:
# Now parse the data into meaningful objects.
# Notice how the exclude list and sense are positioned.
pdbob.input( exclude_list, sense )
# Just at test to make sure it's doing anything at all.
# print s+':'+pdbob.section[pdbCharmer.TI].expdta.technique
def timer(start):
"""Just a function to time things."""
print "Time:"+ str( time.clock() - start )
return time.clock() # Start time for next use.
# -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
t= time.clock() # Start the timer.
# ------ TEST ONE -------
# The default mode. This converts all raw data to usable objects.
test_PDB_input_features()
t= timer(t)
# RESULTING OUTPUT: Time:2.36
# ------ TEST TWO -------
# Specifying an exclude list, i.e. don't worry about these
# sections. Anything in the sections specified won't get
# translated to objects. Notice there is no "sense" argument.
test_PDB_input_features(
[ pdbCharmer.CO, pdbCharmer.PS, pdbCharmer.HE,
# Don't like my abbreviations? This = pdbCharmer.CN
"Connectivity", # The long names were taken from PDB spec.
"Crystallographic",
# Of course you can always mix them up.
pdbCharmer.RE, pdbCharmer.BK ] )
t= timer(t)
# RESULTING OUTPUT: Time:0.58
# ------ TEST THREE -------
# Here the sense argument is turned on signifying that instead of an
# excluded list, it is an include list. This means that it will only
# convert what is listed. In this case, it will only process records
# from the title section.
test_PDB_input_features( [ pdbCharmer.TI ], True )
t= timer(t)
# RESULTING OUTPUT: Time:0.55
# ------ TEST FOUR -------
# This test shows what happens if I ask it to not process anything. I
# turn the exclude list into an include list with the sense flag and
# then leave the list empty, i.e., include nothing. It stil takes time
# however and in fact demonstrates how long it takes just for the
# overhead of going and getting the files and looking at them to
# decide what to do next. Note that this now returns an error if you
# try to print out the "technique" as shown in the test function.
try:
test_PDB_input_features( [ ], True )
except KeyError:
print ( "Note that this now returns an error if you try to "
+ "actually access the nonexistent information." )
t= timer(t)
# RESULTING OUTPUT: Time:0.55
A day late and a dollar short... It has come to my attention that
my work here is pretty heavily redundant. I have recently learned
about The Python
Macromolecular Library (mmLib). This package has lots of fine
features, but its most interesting to me is a rather complete PDB
parser that looks strikingly similar to mine (conceptually). From the
README file:
"Although several Python PDB parsers already exist, this is
the only one supporting all PDB v2.2 records."
And indeed, looking at the code makes it obvious that its author
actually read the
PDB
spec. Not only is there a rather complete PDB system, these guys
have already accomplished what I've only conceptualized - an mmCIF
system too.
This is all rather frustrating for Python-programming PDB users who have been denied a centralized, unifed, comprehensive, high-quality system for working with macromolecular file formats. What will the future direction of this project be? Hard to say. Right now, I'm more interested in integrating my efforts into mmLib. But certainly more thought and better management are required.
I have started a group for discussing the very narrow topic of PDB files and Python. If you are interested in this topic, you are encouraged to participate.
Ok, PDB-with-Python fans. I've made a lot of progress. I now have a pretty huge system that is roughly capable of pulling out ANYthing that a PDB file can possibly contain. This is a big and complex system but PDB files are big and complex. At this stage, there are still some limitations:
if X.section[pdbCharmer.SS].helix_list[n].initICode:
print "There is an insertion code."
I wrote a stupid script to show how this works:
#!/usr/bin/python
# An example program that shows off features of the pdbCharmer module.
# The module that is being tested.
import pdbCharmer
# Needed for argv.
import sys
# Allowing the tester to specify a PDB file.
file= sys.argv[1] # Take first argument as filename.
# Use "PDB:1xyz" to retrieve from pdb.org.
# Set up an info object (will contain file's info).
X= pdbCharmer.info()
# Actually load the file - raw data collected.
X.get_raw_pdb_data(file)
# Now parse the data into meaninful objects.
X.input()
print "---------------------------------------------"
print "Helix ratio is:"
print X.section[pdbCharmer.SS].helix_ratio()
print "Type of the first helix:"
print X.section[pdbCharmer.SS].helix_list[0].typeOfHelix
print "Sense of the first sheet:"
print X.section[pdbCharmer.SS].sheet_list[0].sense
print "Sample of original PDB file:"
print X.rawline_list[22]
print "Experimental technique:"
print X.section[pdbCharmer.TI].expdta.technique
print "First lines of JRNL in title section:"
print X.section[pdbCharmer.TI].jrnl_list[0].text
print X.section[pdbCharmer.TI].jrnl_list[1].text
print "Experimental technique:"
print X.section[pdbCharmer.TI].expdta.technique
print "Show all atoms with residue name:"
for sa in X.section[pdbCharmer.CO].atom_list:
print sa.resName
print sa.x + ',' + sa.y + ',' + sa.z
...to find and extract any and all information that a PDB file can possibly convey.
Explanation: The idea here is that PDB files are often (and certainly can be) packed with an extremely diverse collection of data. Thomas says: "99.9% of people using crystal structures think in terms of ...". I think he has rightly considered the most important problem first, but I also think his estimate is a little high. I, for example, hardly seem like the 1 in 1000 user who doesn't care about the coordinate based model at all (for now). So maybe the number is more like 95%. Yet if this is a successful piece of software, the 5% add up and even ordinary users may expect to use some non-coordinate information 1 in 20 applications. Therefore, there must be a facility or at least a strategy with which to cure any PDB related information problem.
When I first considered using Bio.PDB, I felt confident that it was going to be able to extract whatever information I could possibly want out of a PDB file. That was not the case. Then I felt confident that there was a plan in place where the system would be easily extensible so that it could be modifed in a predetermined way to add the functionality I needed. This was not the case either. That is why I became involved and have created this framework and proof of concept.
...to understand the correct syntax of the PDB file format so that a dataset can be confirmed to correctly follow all rules.
Explanation: This may seem a bit extreme and like something we would be tempted to try and ignore, but if we want the best tool for handling PDB files possible, then the concept of understanding the syntactic and grammatical rules becomes important. It is conformance to the rules that makes a PDB file a PDB file. This could be functionality that could cause unnecessary delay in both development and runtime applications. That is fine - it's relatively easy to be able to turn this functionality off as an optimization. But the system is not a complete PDB file management tool if it has no mechanism to catch completely absurd mistakes. The logical extension of that is an ability to catch ALL mistakes and violations of the explicit PDB format. This point is conceptual and it may require separate facilities to deal with the various versions of the PDB format. However, I propose that having this goal and designing the foundation with it in mind will provide powerful and compelling functionality that will raise the profile and popularity of this system. Already the system that I have created can detect records and sections that are out of order, records that are used more than once that shouldn't be, and a variety of similar points. It does not do the "perfect" validation and probably never will, but it has been a goal to be able to confirm that all rules have been followed when looking at the input. If some rule is not checked for that is important to someone, it should be obvious where and how to implement the check for that conformance. Perhaps the most work on this can be done in the record classes which can check for correctness by record-type as necessary. But the plan is there.
...to complete a "round-trip", i.e. data is read from a file into a data structure and then written out from a data structure to a valid file. (After the 2nd round, diffs should be null).
Explanation: When a system exists that can bring any correct data set from a PDB file into Python memory objects, then there is a logical parity in the ability to make the return trip from a data object to a correctly formatted PDB file. This goal is closely associated with the other goals and represents a huge potential for usefulness in bioinformatic analysis. It seems to me that a lot of structure work involves reading in a known structure, adjusting it some how, and then comparing it to the original in some way. A simple example might involve a desire to remove certain components of a structure from consideration during further analysis by programs that read PDB files. Unless your problem can be solved with "grep -v", this type of situation would benefit greatly from a round trip capability. Also, on the horizon should be the concept of conversion between protein structure file formats. Sometimes you need a PDB file, but have an mmCIF or PDBML file. This problem is related, or should be, to the mission of Bio.PDB.
Maybe flawed, but not vaporware.
Chris X Edwards <cedwards At cs Dot ucsd Dot edu>
Last Updated 2005.08.08