Software Carpentry Day 10: XML and Regular Expressions

Morning: XML

A (simplified) PDB file looks like this:

COMPND      METHANE
AUTHOR      DAVE WOODCOCK  95 12 18
ATOM      1  C           1       0.257  -0.363   0.000  1.00  0.00
ATOM      2  H           1       0.257   0.727   0.000  1.00  0.00
ATOM      3  H           1       0.771  -0.727   0.890  1.00  0.00
ATOM      4  H           1       0.771  -0.727  -0.890  1.00  0.00
ATOM      5  H           1      -0.771  -0.727   0.000  1.00  0.00
TER       6              1
END

Note that the author's "name" may be any length; for example, it may be the name of the laboratory where the data was collected. Note also that the trailing '1.00' and '0.00' are unimportant for this exercise.

An XML representation of the same data might look like this:

<molecule name="METHANE">
  <author>DAVE WOODCOCK</author>
  <created year="95" month="12" day="18"/>
  <atoms>
    <atom symbol="C" x="0.257" y="-0.363" z="0.000"/>
    <atom symbol="H" x="0.257" y="0.727" z="0.000"/>
    <atom symbol="H" x="0.771" y="-0.727" z="0.890"/>
    <atom symbol="H" x="0.771" y="-0.727" z="-0.890"/>
    <atom symbol="H" x="-0.771" y="-0.727" z="0.000"/>
  </atoms>
</molecule>

Write a program that reads an XML file and prints out a (plain text) PDB file. For bonus marks, modify your program so that it puts the molecule data in an SQLite database that has two tables structured as follows:

Molecule
Name Author Year Month Day
"METHANE" "DAVE WOODCOCK" 95 12 18
"AMMONIA" "DAVE WOODCOCK" 97 10 31
... ... ... ... ...
Atom
Name Id Kind X Y Z
"METHANE" 1 "C" 0.257 -0.363 0.000
"METHANE" 2 "H" 0.257 0.727 0.000
"METHANE" 3 "H" 0.771 -0.727 0.890
"METHANE" 4 "H" 0.771 -0.727 -0.890
"METHANE" 5 "H" -0.771 -0.727 0.000
"AMMONIA" 1 "N" 0.257 -0.363 0.000
"AMMONIA" 2 "H" 0.257 0.727 0.000
"AMMONIA" 3 "H" 0.771 -0.727 0.890
"AMMONIA" 4 "H" 0.771 -0.727 -0.890
... ... ... ... ... ...

Afternoon: Regular Expressions

  1. Find words written in all caps and replace them with title case, e.g., "ALL CAPS" becomes "All Caps".
  2. Find amounts written in French-style currency and replace with English, e.g., "123,40$" becomes "$123.40"
  3. Translate human quotes into LaTeX, e.g., "something" becomes ``something''.
  4. Write a regular expression to match a single line from a molecule inventory file. Make sure it handles comment lines correctly.
  5. Write a regular expression to match a single line from a molecule formula file. Make sure it handles comments correctly.
  6. Write a regular expression that matches things that might be chemical formulae, such as 'He' (monatomic helium), 'NaCl' (salt), 'CO2' (carbon dioxide), and 'C12H6Cl6' (aldrin). Is there any way to exclude license plates (like 'A1B2C3')?
  7. Write a program using regular expressions that will read a PDB file and return both:
    1. a tuple containing the molecule's name, the author's name, and the date the file was created, and
    2. a list of (atom, X, Y, Z) tuples.
    For example, if the input is the file at the top of the page, the program should return:
    ('METHANE', 'DAVE WOODCOCK', '95 12 18')
    and
    [('C', 0.257, -0.363, 0.000),
     ('H', 0.257, 0.727, 0.000),
     ('H', 0.771, -0.727, 0.890),
     ('H', 0.771, -0.727, -0.890),
     ('H', -0.771, -0.727, 0.000)]