Software Carpentry Day 10: XML and Regular Expressions
Morning: XML
A (simplified) PDB file looks like this:
COMPND METHANE
AUTHOR DAVE WOODCOCK 95 12 18
ATOM 1 C 1 0.257 -0.363 0.000 1.00 0.00
ATOM 2 H 1 0.257 0.727 0.000 1.00 0.00
ATOM 3 H 1 0.771 -0.727 0.890 1.00 0.00
ATOM 4 H 1 0.771 -0.727 -0.890 1.00 0.00
ATOM 5 H 1 -0.771 -0.727 0.000 1.00 0.00
TER 6 1
END
Note that the author's "name" may be any length; for example, it
may be the name of the laboratory where the data was collected. Note
also that the trailing '1.00' and '0.00' are unimportant for this
exercise.
An XML representation of the same data might look like this:
<molecule name="METHANE">
<author>DAVE WOODCOCK</author>
<created year="95" month="12" day="18"/>
<atoms>
<atom symbol="C" x="0.257" y="-0.363" z="0.000"/>
<atom symbol="H" x="0.257" y="0.727" z="0.000"/>
<atom symbol="H" x="0.771" y="-0.727" z="0.890"/>
<atom symbol="H" x="0.771" y="-0.727" z="-0.890"/>
<atom symbol="H" x="-0.771" y="-0.727" z="0.000"/>
</atoms>
</molecule>
Write a program that reads an XML file and prints out a (plain
text) PDB file. For bonus marks, modify your program so that it puts
the molecule data in an SQLite database that has two tables structured
as follows:
| Molecule |
| Name |
Author |
Year |
Month |
Day |
| "METHANE" |
"DAVE WOODCOCK" |
95 |
12 |
18 |
| "AMMONIA" |
"DAVE WOODCOCK" |
97 |
10 |
31 |
| ... |
... |
... |
... |
... |
|
| Atom |
| Name |
Id |
Kind |
X |
Y |
Z |
| "METHANE" |
1 |
"C" |
0.257 |
-0.363 |
0.000 |
| "METHANE" |
2 |
"H" |
0.257 |
0.727 |
0.000 |
| "METHANE" |
3 |
"H" |
0.771 |
-0.727 |
0.890 |
| "METHANE" |
4 |
"H" |
0.771 |
-0.727 |
-0.890 |
| "METHANE" |
5 |
"H" |
-0.771 |
-0.727 |
0.000 |
| "AMMONIA" |
1 |
"N" |
0.257 |
-0.363 |
0.000 |
| "AMMONIA" |
2 |
"H" |
0.257 |
0.727 |
0.000 |
| "AMMONIA" |
3 |
"H" |
0.771 |
-0.727 |
0.890 |
| "AMMONIA" |
4 |
"H" |
0.771 |
-0.727 |
-0.890 |
| ... |
... |
... |
... |
... |
... |
|
Afternoon: Regular Expressions
- Find words written in all caps and replace them with title case, e.g., "ALL CAPS" becomes "All Caps".
- Find amounts written in French-style currency and replace with English, e.g., "123,40$" becomes "$123.40"
- Translate human quotes into LaTeX, e.g., "something" becomes ``something''.
- Write a regular expression to match a single line from a molecule
inventory file. Make sure it handles comment lines correctly.
- Write a regular expression to match a single line from a molecule
formula file. Make sure it handles comments correctly.
- Write a regular expression that matches things that might be
chemical formulae, such as 'He' (monatomic helium), 'NaCl' (salt),
'CO2' (carbon dioxide), and 'C12H6Cl6' (aldrin). Is there any way to
exclude license plates (like 'A1B2C3')?
- Write a program using regular expressions that will read a PDB file
and return both:
- a tuple containing the molecule's name, the author's name, and
the date the file was created, and
- a list of (atom, X, Y, Z) tuples.
For example, if the
input is the file at the top of the page, the program should return:
('METHANE', 'DAVE WOODCOCK', '95 12 18')
and
[('C', 0.257, -0.363, 0.000),
('H', 0.257, 0.727, 0.000),
('H', 0.771, -0.727, 0.890),
('H', 0.771, -0.727, -0.890),
('H', -0.771, -0.727, 0.000)]