DataCatalogs1Overview.adoc 12.8 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
== Overview

[IMPORTANT]
====
This overview talks about the work of the author and others, but without bibliographic references. Currently, it is just meant as background to better understand the technical documentation in the sections to follow.
Maybe it could be developed into a more serious paper later.
====

The overall motivation for the work on data catalogs for simulation is to make easier to develop and perform computer simulations in quite complex and _data rich_ domains like building physics, transportation, and all kinds of urban infrastructure.

=== The Bigger Picture

A good part of computer science was and is driven by the motivation to make it easier to develop computer programs of all sorts.
"Higher" programming languages were invented to make programs human readable and soon special constructs for _functional programming_ (computation without side effects) and _structured programming_ (computation without go to statements) were introduced to help programmers writing and understanding ever growing programs. 
Then, between 1962 and 1967, program language Simula was developed especially to deal with the challenges of simulating systems comprising of many different types of objects.
This opened the door to more direct computer representations of real world objects, their attributes, relationships and behavior, ultimately leading to _object-oriented_ software development that today is embodied in programming languages like Java, C++, Python, and graphical notations like the  Unified Modeling Language (UML).

While these achievements had boosted the productivity of software developers, still the creation of correct, efficient and maintainable programs -- including simulations -- required a big deal of expert knowledge and experience.
To overcome this bottleneck, starting in the 70s, so called 4th generation languages entered the stage.
These languages were tailored to specific tasks like statistics ("S" 1976, "R" being its successor), database programming (SQL 1979), or simulation (MATLAB around 1979, Mathematica 1988, Modellica 1999) to name a few.
By sacrificing generality, these special languages become more accessible to domain experts, not just trained software developers.
To flatten the learning curve even more, formal _graphical_ languages for special purposes were invented, e.g. Simulink for block diagram simulation models in 1984, Entity-Relationship-Diagrams for data modeling in 1976, UML for object-oriented systems design in the 1990s, or graphical languages to specify business and also scientific workflows around 2000.

This very short history on technologies for development of software in general, and simulations in particular, shall illuminate the tools at our disposal:

* general purpose programming languages that combine structured, functional and object-oriented approaches to enable the creation of big, modular software systems, often called "programming in the large"
* formal textual domain specific languages (DSLs) dedicated to solve specific tasks with ease
* formal graphical DSLs.

****
Note that DSLs more tend to describe _what_ shall be achieved by a computation instead of describing in detail, _how_ to achieve it.
Therefore, DSLs usually look more like a model than like an algorithm.
****

Now back to the task at hand.

Some domains deal with a few types of simple objects to be simulated.
Take the building blocks of an electric circuit as an example.
The algorithms to simulate these correctly and efficiently may be quite complex -- the model elements usually can be described by very few parameters like resistance or capacity.
More complex domains like (regenerative) energy systems or building physics deal with more complex objects to be simulated, e.g. PV modules or layered walls of buildings, often coming in different types and configurations, and dozens of possibly interdependent parameters.


=== Lessons Learned

The above problem of navigating huge parameter spaces and assembling complex simulation models popped up as the author worked on the diagram editor for *INSEL*, a simulation language and runtime environment developed for renewable energy systems simulation.
To make existing catalogs on weather data, solar panels and inverter modules accessible to the modeler, special dialogs were added to the INSEL user interface that allowed browsing through the catalogs.
Using this browsers, the modeler would choose a weather station, panel or inverter to parameterize a corresponding INSEL block.
However, there are some severe disadvantages with this approach:

. Data catalogs were stored in a proprietary data format on disk within the INSEL application distribution, meaning they could not used independently from INSEL by other interested parties (systems or users).
. The catalogs have to be maintained by editing text files manually.
. While INSEL modeler could browse the catalogs, searching and sorting were not supported.
. Development of Java Swing UIs for the different kind of catalogs is time consuming as is their maintenance, e.g. if a catalog data format were to change.
. Putting UIs to handle big amounts of data into a diagram editor is not very user friendly.

From 2013 to 2016, the simulation platform *SimStadt* was developed to make specific modeling and simulation workflows accessible to experts in urban planning and energy systems.
Using INSEL and other simulators under the hood, the usage of 3D city data, provided as CityGML files, was a core requirement of this project.

To enable simulation of, say, the heating demand of a district, geometric building data had to be enriched with data on building physics and usage.
To do so, existing informations about building physics and usage -- often only available as informal typologies or tables -- had to be provided to the SimStadt user on an abstract level, e.g. to choose between refurbishment scenarios.
At the same time, concrete building configurations and parameter sets had to be injected into the simulation models to obtain the desired results.

Again, we implemented data catalogs to fulfill these requirements, but compared to the quite simple catalogs used in INSEL, the data models for building materials, window, wall and roof types as well as the typologies of buildings, households, usage patterns, and so on were more intricate.
They had to be created iteratively in collaboration with domain experts.
In this situation, manual coding data formats and access with a general programming language would have led to relatively long iteration cycles and high communication effort between programmer and domain expert.
Instead, we decided to use a DSL for data modeling and use code generation whenever possible.
Since SimStadt was developed within the Java eco-system we followed this standard approach:footnote:[A similar approach is in use to standardize extensions to CityGML via so called application domain extensions (ADE) like the energy ADE for exchanging energy related data.] 

. Developer and domain expert create a first version of the data model as XML Schema Definition (our DSL).
. For plausibility checks any standard XML editor can be used to create example data conforming to the XSD.
. With JAXB, the Java Architecture for XML Binding, Java code is generated to read our XML catalogs into Java objects that, in turn, can be accessed by SimStadt workflows to generate and parameterize simulations.
. If required, developer and domain expert go back to step one to refine data model and catalog data

After the data model for building physics catalogs had matured, we developed an extra application for convenient creation and maintenance of building physics data catalogs separate from SimStadt.
It was developed in Java with a user interface written in JavaFX and was well received by domain expert users.

However, as a different catalog for building usages had to be created, it was quite difficult to reuse the XML schema and application code from the building physics catalog: The usage catalog data model was "pressed" into a form similar to the building physics catalog data model, and the UI code was "over-engineered" to accommodate both catalog's requirements.


=== Low-Code-Development of Data Catalogs

From INSEL and SimStadt we learned, that manual and automatic construction and parameterization of complex simulation models with many types of interrelated objects should be supported be the means of domain specific data catalogs.

Close collaboration with domain experts in designing and implementing these catalogs in short development cycles is desirable.

Data catalogs and the software for their creation, maintenance and deployment should be independent of any specific simulation software, (a) to be reusable and (b) not to overload simulation applications. 

In SimStadt, catalog development was partly facilitated by a textual DSL for data modeling (XML schema language) and automatic generation of Java code from it.
On the other hand, user interfaces and generation and parameterization of simulations from templates within SimStadt workflows had still to be coded manually hindering the routinely creation of new catalogs.

Now, in 2020, several developments in different projects provide an opportunity to re-think the topic of data catalogs for simulations, namely: 

. Plans for a new Urban Simulation Platform at Concordia University, Montreal 
. New implementation of INSEL front-end based on the Eclipse application framework and Eclipse-Sirius diagram editors
. Enhancement of existing building physics and usage catalogs from SimStadt and their adaptation to new regions
. Development of a new comprehensive catalog of electric systems components to be used in SimStadt as well as in Concordia's Urban Simulation Platform.

In what follows, the new technology stack used to implement (4) is documented in detail.
Plans are to use the same approach also for implementation of (3).

The new technology stack is rooted in the Eclipse application framework and eco-system.footnote:[A comparable, but completely different approach would be to combine several web applications and services via portal software in web browsers.]
Its main advantage is the possibility to implement CRUD (Create, Read, Update, Delete) applications like data catalogs and their underlying data models with no or very view lines of handwritten code (_low-code-development_).

Since task (2) and maybe (1) will use Eclipse, too, close integration of data catalogs and simulation environments seems feasible.
E.g., a user could drag an electric system component from a catalog onto an INSEL block for parametrization.

The Eclipse application framework offers:

* OSGI plug-in mechanism and UI framework for integrating applications and services
* General notion of _project_ with specific file types, help system, preferences etc.
* IDE support for important general purpose languages like Java, https://marketplace.eclipse.org/content/pydev-python-ide-eclipse[Python], Ruby, C, Fortran, C++
* Support for creating textual and graphical DSLs (https://www.eclipse.org/Xtext[XText], https://www.eclipse.org/sirius[Sirius])

* Industry proven DSLs and code generators for data models and form based UIs via the  https://www.eclipse.org/modeling/emf[_Eclipse Modeling Framework_] (EMF) providing:
** https://www.eclipse.org/ecoretools[_Ecore_] for model driven generation of Java classes and persistence layers for XML or data bases
** https://eclipsesource.com/blogs/tutorials/emf-forms-view-model-elements[_EMF Forms_] for describing and generating form based UIs
** Mechanisms to adapt or extend data models and forms to special needs (e.g., we added a quantities -- that is numbers _with_ units -- to Ecore and EMF Forms, a feature very important for data catalogs)

* Rich open source eco-system with lots of plugins and projects important for an urban simulation platform:
** model server for distributed access and work on Ecore models, including model comparison and migration (https://projects.eclipse.org/projects/modeling.emf.cdo[CDO], https://www.eclipse.org/emf/compare[EMFCompare])
** a https://pyecore.readthedocs.io/en/latest[Python implementation of Ecore]
** GIS: storage, processing, and visualization of geographical data (list of projects under the umbrella https://projects.eclipse.org/projects/locationtech[LocationTech], e.g. user-friendly desktop internet GIS http://udig.refractions.net[uDig])
** workbench for traffic simulation (https://www.eclipse.org/sumo[SUMO])
** spatial multi-agent-simulation (https://gama-platform.github.io/wiki/Home[GAMA-Platform])
** scientific workflows (https://projects.eclipse.org/projects/science.triquetrum[Triquetrum]) 
** visualizations (https://www.eclipse.org/nebula/widgets/visualization/visualization.php[Nebula])
** machine learning (https://deeplearning4j.org[deeplearning4j])
** 45+ projects in the area of https://iot.eclipse.org[IoT]
** ...

As always, all that glitters is not gold. When we go through the details below, some bugs and inconsistencies, typical for open source projects of this age and size, have to be addressed.