MDF4 Files to CSV: handling Automotive data easily with Python and Open-Source tools
Let me show you how I use open-source Python libraries to handle MDF files from Automotive testing and calibration environments
Using asammdf API and Pandas to handle automotive ASAM data
Let me show you an example of how you can parse MDF4 or other ASAM files and convert them to common formats.
Traditionally, the tools you'd need to process this type of file were only accessible through expensive licenses, like ETAS MDA, or Matlab's Vehicle Networks Toolbox to cite some commonly known examples.
However, nowadays there are excellent free and open-source alternatives, like asammdf, which is a Python-friendly parser and editor for ASAM files.
The library was originally developed by Daniel Hrisca, who still maintains it. He has done a great job at collaborating with engineers and developers from the open-source community and turn the project into a very robust solution.
What are MDF4 files?
MDF4 files are commonly used in the Automotive Industry for software development, calibration, and testing data. It is the most recent evolution of the original MDF file, developed in the 90's by Vector Informatik GmbH and Robert Bosch GmbH.
MDF stands for Measurement Data Format. Nowadays, it is defined by the ASAM (Association of Standardization of Automation and Measurement Systems) in the ASAM MDF standard.
ASAM MDF files are very useful for Automotive measurement and calibration data, because they can efficiently store large amounts of data, retaining information about the communication and acquisition systems used to create them, while being very fast to query and index.
The problem with them is that, unlike .csv or ASCII files, you can't just load them and preprocess them with Data Visualization tools, like Tableau, Power BI or more IoT-oriented ones, like Grafana. To handle them, you first need to convert them to a more manageable format, and this what I'm going to show, using asammdf and Python.
Taming the beast they don't tell you about in your Data Analytics online course
This is an industry-specific and tightly packed type of animal you probably never heard about while learning about data analytics until you decided to work for an Automotive company or a service provider for those companies.
Fortunately, converting MDF4 files to a more manageable format is easier than you'd think - in most cases.
Here's the barebones version of the code I'd use:
# You need to pip install asammdf, of course
from asammdf import MDF
import pandas as pd
# Define path of MDF4 file (.mf4 or .dat)
my_file = 'file_path'
# Define list of signals to extract
my_ch_list = ['Signal_A', 'Signal_B', 'Signal_C']
# Open ASAM MDF file
data = MDF(my_file, channels = my_ch_list)
# Define unique sampling rate you want to apply to all the signals
rater = 1
# Convert to pandas dataframe with selected unique raster
data_pandas = data.to_dataframe(raster = raster)
# Export to csv file
export_file = 'my_export_file_path'
data_pandas.to_csv(export_file)
And there it is, a CSV file you can work with following most online tutorials about data visualization 101.
I hope you found this useful and feel free to copy-paste it for your own projects. Please do read the official documentation of this API so you know what you are doing, though.
I still have some words about the choices you'll have to make when working with Automotive data. You might want to stick around and keep reading, especially if you aren't familiar with the Automotive testing and calibration area.
Why use a signal list?
Using asammdf API we can easily import all the signals contained within an input ASAM file. However, in my experience, when preparing data for a visualization pipeline, you likely won't want to include all the signals available.
In Automotive software (and therefore in data coming from Automotive Networks, logged in ASAM MDF formats), it's common to have tens, or even hundreds of signals containing error flags, status values, and other information you might not need. Software developers need them, Application Engineers need them, but you, as a Data Analytics expert, don't.
If you are working on a project that requires you to look at some of those, then, by all means, include them. But in most cases, you don't want to visualize them all.
Of course, it depends on what you are trying to achieve, so there isn't a definitive solution here. My suggestion is, import only the channels you'll actually use.
Something interesting about the list of signals is that it can act as additional documentation about what information you are using. They could be descriptive - literally including a description for each signal, like this:
my_signals_list = [
'ECU*_Signal_A//XCP* - Signal A description, e.g. units, source, measured or estimated, what have you...',
'ECU*_Signal_B//XCP* - Signal B description',
'ECU*_Signal_C//XCP* - Signal C description',
'Comp*_Signal_M//CCP* - Signal M description',
'ECU*_Signal_Z//XETK* -Signal Z description']
with
- You might need to over-complicate your workflow just to get the description from the raw MDF4 file.
- When you do so, it's very possible that the signals you are ingesting don't have an associated description, of the description is cryptic and hard to understand (sometimes even cryptic AND in a different language because you are logging signals from a component from a foreign supplier).
That is why, in my opinion, you should always contact the function experts or system engineers and align with them about what information is carried by the signals you are using.
What about the sampling rates?
Whether to choose a unique sampling rate to apply to all your signals or not is a decision you'll have to make by having a good understanding of the data you are using, what it represents and what the visualization is going to be used for. What do you or the stakeholders want to achieve by visualizing these data?
Looking at short-term events: rapidly-varying, dynamic phenomena
If you are visualizing data to analyze in detail something that happened over a short period (minutes, hours) you might want to look at the short-term trends or patterns in your signals.
Maybe you need to analyze when a failure happened, so you need to see signals varying rapidly and perhaps identify the instant when the system you are looking at reacted in some way. This could mean you need to visualize signals sampled at 10 Hz or more. Common Automotive ECUs have different task rates, such as 1000 ms, 100 ms, 10 ms and sometimes even 1 ms or faster.
In that case, you might want to retain each signal with its associated raster, so you can look at the information as it was logged by the system, with it's "native" sampling time, especially if you need to look at the details of a 2-second phenomena, for instance.
The drawback of doing this, is that you'll need to include both each signal and its sampling time in your analyses, meaning that there won't be a single time base that you can use to visualize timeseries data.
Another approach would be to re-sample all signals to match the highest sampling rate among the signals you'll need to visualize. That way, you don't crop any information, and you get to work with a single time base. In this case, you'll end up with too many samples so your workflow might get slowed down.
In my case, I always strive for a trade-off between the number of channels (signals) I need to look at, and the fastest phenomena I need to observe. As with all things engineering-related: do not lose or crop the information your stakeholders will actually want to look at.
Looking at longer-term events: trends and patterns over longer periods of time
Another common scenario - one in which you might want to use a data pipeline and create a visualization dashboard - is when you need to look at phenomena that change slowly over time.
For example, the temperature trends over several days, or weeks. In that case, it makes sense to re-sample everything to a lower frequency, because you don't care about the high-frequency information in your data. Why would you plot a signal that has 100 samples per second (100 Hz) when you need to look at several weeks? By resampling, you can reduce the amount of data you have to work with, without losing the information you care about.
Why convert to Pandas Dataframes?
With asammdf API, exporting an .MF4 file as .csv is straightforward simply using the MDF.export('csv')
method. So converting to a Pandas data frame first seems like an unnecessary step.
In my opinion, having a Pandas Dataframe lets you use all the great functionalities of Pandas to pre-process your data before you plot it, export it, or write it to a database.
If you are interested in learning more about Pandas, a good place to start is the Pandas & Python for data analysis full course by FreeCodeCamp, and there are many tutorials in video and text formats online. I've also written a few articles where show things I've found interesting about it, so might want to check them out.
Conclusion
This example shows a manual workflow to open and visualize the data within an MDF4 file, and export it as a .CSV. The idea is to showcase the basic usage of ASAMMDF API for local conversion and analysis.
If you want to automate this to deploy it at a larger scale, you'll need to integrate some of this in your pipeline and handle file selection and import more efficiently, and consider not exporting the data as a file, but instead writing it to a database that you can query.
This subject is interesting for me, because I'm an Engineer working in Automotive/Heavy Duty Embedded Software, but I'm also a data nerd. I hope this was useful to people wanting to work with MDF4 files without having access to industry-standard tools and licenses.
Nowadays, I feel like converting MDF files to different formats using non-proprietary tools will become more and more common, as Automotive OEMs start hiring more people for Data Analytics positions, and more companies start to offer Analytics services for the Automotive sector.
Also, companies aiming at creating Digital Twins can use open-source tools like these to leverage the capabilities of Cloud Computing and microservices to handle their MDF data - think a Python script running on a cloud instance, versus a Matlab script for which you need a license.
That's another good reason for replacing Matlab with Python in your workflow.