Python XML Parsing

Summary

The client had multiple XML files with horse racing results they wanted parsed and saved to a CSV file for later import to a database. The XML file contained race results for a single day at a single track.

Implementation

The client specified the XML fields to save, data format, and CSV columns for the data. Each row of the CSV file was denormalized, meaning that each row contain the horse information along with the race and common track data. So each row contains the same track information and for a given race, each horse in a given race contains the same race data. So, a CSV file has data for 1 track, N races, and multiple horses per race. If there are 6 races with 5 horses each, then there are 30 rows in the CSV file.

Because of the denormalized data stored in the CSV file, separate lists were created for track, race, and horse. Once all of the horse information was parsed, the three separate lists were combined and then written to the CSV file as a row.

The input file directory is searched through all sub directories for XML files. All CSV output files are stored in a common directory. The XML file is parsed with a try/except condition in case there are XML formatting issues. If an exception is thrown the file is not processed and the code continues to the next file.

# parse xml file
# create element tree object 
parseOK = True
try:
    tree = ET.parse(xmlfile) 
except ET.ParseError as err:
    parseOK = False
    print("Error: {0} in file {1}. File skipped.".format(str(err), file))

# check to make sure xml parsing was ok
if parseOK:

The race date is not contained within the XML data but is part of the filename. The date was extracted from the filename string using regular expression.

Once the track information was saved to a list (track code and race date), parsing looped through each race. Within each race the code looped through each horse. The only data issue to handle was the track distance where the client wanted track distance converted from fraction (1 1/16) to decimal (1.0625).

# watch for fractional distances
tdistance = race.find('distance').text.split(' ')   # space between parts
fdist = 0.
if len(tdistance) > 1:                              # we have fractional distance
    fdiststr = tdistance[1].split("/")              # separate fractions
    fdist = float(fdiststr[0])/float(fdiststr[1])   # calculate fraction
                            
distance = float(tdistance[0]) + fdist              # make floating point distance

The Python code was developed on a Windows 10 platform using Wing editor/environment with Python 3.8.2.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s