Sequential data load in ARIS Process Mining

Posted by Raviteja Avvari

Posted on Wed, 08/30/2023 - 21:39 in ARIS Process Mining

Hello,

In the below article let us take a detailed look at the behaviour inside APM (ARIS Process Mining) tool for a sequential data load from extraction to loading of data into the system and the steps in between

Once you make a system connection to a desired source (E.g.: SAP, Data base or data lake via JDBC), to extract the data into APM, you would need to define the extraction logic (tables and their dependencies and necessary columns from each table) to extract relevant data into APM. Once the mentioned logic Is defined, you can now extract all data or specific parts of the data upon need. While defining the extraction logic, as a user, you can choose to pseudonymize the relevant columns (SR24 and above) to ensure data privacy which we strongly recommend.

However, it’s not always necessary that extraction needs to be a manual task. In APM you could also automate the extraction to happen at a specific time intervals or a specific day of the week and pause it when not necessary. All the extracted data can be seen in the source tables of the data.

While the periodically fetched new data ensures always relevant data is available there might be cases where you must be still able to see the old data in the source tables and not only the latest load. This brings the importance of not overriding the table structure in the source tables but increment them and make the data load sequential. Let us see how this can be made below.

As on today APM supports two types of tables with each having its own benefits.

Standard source table
Incremental source table

Standard source tables

This behaviour is great when the analysis is being made only on a data pertaining to a period every time an analysis is made. In this case one does not want to have the data other than that of what is being analysed.

This is the standard across the tool when a user loads a source table via CSV upload or extraction or via Data Ingestion API. Every time the user loads the data with or without an update, the previous data will be erased, and the new data will be replaced in the source tables of the data set.

As an example, let us consider the following example data.

Case ID	Activity Name	Activity Start time
001	First activity	01-01-2023 00:00:00
001	Second activity	01-01-2023 00:01:00
001	Third activity	01-01-2023 00:02:00
001	Fourth activity	01-01-2023 00:03:00

We now assume that a customer extracts or uploads the above-mentioned data into APM source tables and then does a load into APM storage. The APM will simply copy the then state in the source tables in APM cloud. This copying of the source tables state into the APM cloud is the standard behaviour.

In a subsequent load operation, if the customer extracts or uploads the below data into APM source tables, the data in the source tables will be erased and now the source tables will only show the below data.

Case ID	Activity Name	Activity Start time
001	Fifth activity	01-01-2023 00:04:00
001	Sixth activity	01-01-2023 00:05:00

Although the old data is not erased from the APM cloud, there are limitations for analysis in longer timeframes with multiple extractions in between because of the below reasons.

The source tables will not show the same representation of the data in the APM cloud which would make data maintenance a challenge.
The transformations will only run on the existing source tables. Meaning, any changes in the transformations will not apply to the already loaded data in the APM cloud as its not in the source tables anymore.

Incremental Source Tables

This is our latest addition into the tool. A user must explicitly convert a standard source table into an incremental source table. This type of table comes with certain benefits. The primary benefit being the transformations changed after a data load can still be applied to the old data.

Let us consider the same example as above.

Case ID	Activity Name	Activity Start time
001	First activity	01-01-2023 00:00:00
001	Second activity	01-01-2023 00:01:00
001	Third activity	01-01-2023 00:02:00
001	Fourth activity	01-01-2023 00:03:00

The user loads the first table and changes the table to be an incremental source table, immediately a column will be added to the table named “_aris.lastChanged”. This column helps in subsequent loads when the system automatically loads those rows that are new.

Case ID	Activity Name	Activity Start time	_aris.lastChanged
001	First activity	01-01-2023 00:00:00	01-05-2023 00:00:00
001	Second activity	01-01-2023 00:01:00	01-05-2023 00:00:00
001	Third activity	01-01-2023 00:02:00	01-05-2023 00:00:00
001	Fourth activity	01-01-2023 00:03:00	01-05-2023 00:00:00

In the subsequent load operation when the user extracts the new data, the new data is appended, and all the data is now made available in the source tables with each row showing the last changed data appropriately. So, the data looks like below. (Notice the date change in the _aris.lastChanged column)

Case ID	Activity Name	Activity Start time	_aris.lastChanged
001	First activity	01-01-2023 00:00:00	01-05-2023 00:00:00
001	Second activity	01-01-2023 00:01:00	01-05-2023 00:00:00
001	Third activity	01-01-2023 00:02:00	01-05-2023 00:00:00
001	Fourth activity	01-01-2023 00:03:00	01-05-2023 00:00:00
001	Fifth activity	01-01-2023 00:04:00	01-07-2023 00:00:00
001	Sixth activity	01-01-2023 00:05:00	01-07-2023 00:00:00

This offers 2 benefits.

The source tables are not wiped clean but appended.
When data load is performed, only the latest changes are uploaded into the APM storage.

However, this type of tables needs “Merge Key” to be defined which helps the system understand what rows to be updated or appended. This also requires a more knowledge over the standard source tables.

As we have learnt from the standard source tables, if a user amends a transformation, only the next set of data gets impacted and not the already loaded into APM data. In this case of incremental source tables this can be solved by manually updating the “_aris.lastChanged” in a transformation to the latest date as the system updates every time with the latest data.

PS: Do not forget to remove the update on “_aris.lastChanged”. Otherwise it would not mean the benefits of the incremental table and each upload will run a full data instead of the delta

Sequential data load in ARIS Process Mining

Featured achievement

Leaderboard

Follow us

View posts by tag