As I have been working on cleaning the OpenStreetMap of Warsaw, I needed to convert chosen information from an XML file to CSV format. While the operation in itself is rather straightforward, I find it a good opportunity to share a snippet of working code.
The OpenStreetMap of Warsaw can be downloaded from the website of the MapZen Project. If you take a look at the file (or its abbreviated version, since the original is quite large), you can see a mysterious field containing a ‘SIMC’ number.
<tag k="addr:city" v="Warszawa" />
<tag k="addr:postcode" v="05-075" />
<tag k="addr:city:simc" v="0921728" />
As I got to know after a while of research, the number is referring to territorial classification of Polish cities. I found an XML file at the website of the Central Statistical Office of Poland which listed SIMC codes, but I needed the data in CSV format for easier comparison. The original data had the following format:
From all these information, I needed only the data contained in tags with names ‘NAZWA’, ‘SYM’ and ‘SYMPOD’. After a while of dabbling with the file, I came up with the following Python code:
import xml.etree.cElementTree as ET
reload(sys) # It seems to be necessary to display Polish chars
sys.setdefaultencoding('utf-8') # As above
outputs = 'simc.csv'
tree = ET.parse('SIMC.xml')
root = tree.getroot()
fields = ['name', 'sym', 'sympod']
with codecs.open(outputs, 'w') as f:
writer = csv.writer(f)
for child in root:
for stuff in child:
for element in stuff:
if element.attrib['name'] == 'NAZWA':
name = element.text
elif element.attrib['name'] == 'SYM':
sym = element.text
elif element.attrib['name'] == 'SYMPOD':
sympod = element.text
writer.writerow((name, sym, sympod))
At the beginning, I set the default encoding in order to display Polish characters properly. The code is very straightforward, and demands little commentary. It reads the entire XML document, and with every ‘col’ element checks the name of the field. If the name corresponds to one of the predefined ones, it saves it, and writes the saved names to the row once it reaches the end of the ‘row’ element.
The often helpful ElementTree has a detailed documentation, containing many examples of code. The Unicode problems, into which I ran the first time I fired the function, I solved thanks to a StackOverflow post. If you are curious about the OpenStreetMap project, take a look at their wiki, or download sample code from MapZen.