Doktorandenkolloquium Data and Knowledge Engineering

During this colloquium, current research work by doctoral students in the field of Data and Knowledge Engineering (DKE) will be presented.
The colloquium usually takes place on Thursdays from 1:00 pm alternating with the DKE research colloquium in room G29-301.

Please address questions about the colloquium to Michael Kotzyba or Andreas Nürnberger

Current lectures:

07/03/2024 (11:00 st in room G29-R130)
Synthetic tabular data generation: challenges, methods, and counterfactual explanations
Mr. Emmanouil Panagiotou (Research Assistant / Doctoral Student -Researcher)

In this presentation we explore different methods for generating synthetic tabular data, examining typical challenges related to data scarcity, imbalances, and metrics. Our main focus is on an application within the offshore wind domain, where we aim to produce innovative yet realistic designs for offshore substructures. Additionally, we explore counterfactual explanations, highlighting their significance in generating alternative scenarios and improving model interpretability.

Past Lectures:

06/13/2019 (13:00 st in room G29-301)
SAP Enterprise Search - Characteristics and Challenges of a Company-Wide Search
Dipl.-Inf. Marcus Nitsche (SAP Designs' Lead UX Designer for Enterprise Search, SAP SE)

When users search for information at work, they've come to expect the same user experience they get from web search engines, where they can find relevant information within a fraction of a second. However, searching for business content is a much more complex process. Business-critical information is often distributed across different databases, located within different systems, using different networks - even within the same company. And in contrast to web searches, not all users share the same authorizations, so enterprise search systems need to consider individual access rights. Also, because changes to business data can be critical for decision-making, an enterprise search needs to ensure that the information found is always up-to-date. in this talk

03/21 2019 (1:00 pm st in room G29-301)
Linked Learning Items: Creation of digital learning material from web resources
MA Katrin Krieger (Faculty of Computer Science (FIN), OVGU)

Observations have shown that learners in e-learning environments do not only use the learning material provided, but also use web resources to solve tasks, for example. However, such resources reside outside of the actual learning environment, so searching for and working with web resources can potentially distract learners from completing the actual task. The learning process is interrupted. A solution to this problem can be to embed appropriate web resources into the e-learning environment. In this lecture, a method for the automatic generation of digital learning material is presented that can be embedded in web-based e-learning environments. This process is divided into three steps:

20.11. 2018 (11:30 am st in room G29-301)
Musical Tempo Estimation with Convolutional Neural Networks
M.Sc. Hendrik Schreiber (Ph.D. candidate @ Int. Audio Lab. Erlangen; tagtraum industries)

Global musical tempo estimation is a well established task in music information retrieval (MIR) research. While traditional systems typically first identify onsets or beats and then derive a tempo, our proposed system estimates the tempo directly from a conventional Mel-spectrogram. This is achieved by using a single convolutional neural network (CNN). The CNN approach performs as well as or better than other state-of-the-art algorithms. Especially the exact estimation of tempo without tempo octave confusion is significantly improved. Furthermore, the same approach can be used for local tempo estimation.The talk focuses on how to design such a network by drawing inspiration from a traditional signal-processing-based approach and translating it into a fully convolutional neural network (FCN).

08/23 2018 (13:00 st in room G29-301)
Evaluating Semantic CoCreation in Cognitive Representation Models
M.Sc. Stefan Schneider (Data and Knowledge Engineering Group (DKE), OvGU (FIN))

In Cognitive Computing researchers are trying to implement a unified computational theory of mind. Most recent contributions allow the conclusion that such artificial systems are most successful if they are operating on language games in a community of agents. This confirms Wittgenstein's assumptions about Language in Use and let us speak about Interactive Cognitive Computing. To understand the basic principles of Interactive Cognitive Computing we are using the concept of Semantic CoCreation on top of cognitive representation models. In this session we present a new spatial language game, which we call Location Identification Task. This task is different from other similar spatial games because of it's high degree of flexibility.For evaluation purposes we define intrinsic parameters to apply a quantitative usability evaluation.

05/03/2018 (1:00 pm st in room G29-412)
Hidden Markov models for signal decoding in brain-computer-interfaces
Dipl.-Phys. Tim Pfeiffer (Chair of Medical Telematics and Medical Technology, OvGU (FEIT))

Hidden Markov models (HMM) are widely used for decoding purposes in the field of automated speech recognition (ASR) and their application has shown great success for various different problems. Their beneficial features are also highly promising for decoding of brain signals, which is an essential task in so-called brain-computer interfaces (BCI). A BCI is a system that provides a way of direct communication between the human brain and a computer. This can be used to grant control over assistive devices (e.g. wheelchairs or prostheses) for patients with severe handicaps without the need for muscle activity.While HMM-based decoders are well investigated in ASR topics, only a small number of studies can be found in the literature that considered using HMMs in the context of brain signal decoding yet. This talk discusses adaptations to the central components of the signal processing chain that need to be considered when applying HMM-based decoding approaches to BCI settings. Central focus will be on strategies to incorporate prior knowledge into the decoding by effective utilization of so-called language models. Exemplary results from a finger movement decoding task are shown to demonstrate the benefit of the routines.

April 26, 2018 (1:00 p.m. st in room G29-335)
Conception and implementation of a knowledge-based system for the sustainable development of health systems in West Africa
M.Sc. Wendgounda Francis Ouédraogo (TH Brandenburg, Department of Economics)

The health systems in Africa are not only characterized by a scarcity of resources and their asymmetric distribution, but also by other challenges with enormous implications for health policy. One is the management of medical information. The constant collection and updating of medical information turns out to be a problem because the information is still mainly made available on paper and in the libraries. On the other hand, the further training of the actors poses another problem. The challenge here is to prepare the availability of knowledge in a digital form in such a way that its dissemination becomes more efficient due to the massive intrusion of mobile devices right into the depths of Africa.

11/09/2017 (4:00 p.m. st in room G29-301)
A framework model for the introduction and evaluation of social media
Peter Gerhard (Festo AG & Co. KG)

Social media has found its way into companies in recent years. In this context, one speaks of Enterprise Social Media (ESM). ESMs allow their users not only to consume content, but also to create it themselves, and they encourage interaction and networking. This, in turn, enables employees to promote a cause, gain fellow campaigners, organize themselves, work out positions together, plan actions, and ultimately initiate and shape organizational change. With regard to the actual use of social media in companies, however, the picture is divided. On the one hand, their use is constantly increasing, on the other hand, their penetration is still low, especially in German medium-sized companies. Basically there is an understanding what social media can do within a company, both for employees and their managers, still in its infancy. In the present work, a framework model is developed that integrates the central ideas of social media into an organization and in which the actors act. The model consists of five parts: the contribution of ESMs to organizational development (increasing efficiency and improving working life), the level at which this contribution occurs, the process for introducing and evaluating ESMs, a description of the context in which ESMs are used, and the role of leaders. It is shown how the framework model can be applied to a specific company. Concrete design recommendations for its application in operational practice and information on the further development of ESM are derived from this. In general, the work contributes to a better understanding of the role of IT in organizations.

May 11, 2017 (1:00 p.m. st in room G29-301) Context-based
fusion of lane information considering reliability
Tuan T. Nguyen (Volkswagen, Corporate Research, Automated Driving)

In recent years, automated driving has become the focus of numerous research institutions and companies. Lane detection is one of the crucial tasks. In the literature, this task is realized through the use of one or more information sources, e.g.: optical road marking detection using camera sensors, trajectory of the vehicle in front, digital map, etc. These sources differ in their performance depending on the road and environmental conditions. Camera marking detection works well on freeways and country roads. However, their performance drops in urban scenarios, where often only asphalt transitions or curbs limit the road. In such cases, the alternative is to follow the vehicle in front. When merging different information sources, many existing works are based on the assumption that the sources always have the same performance and are therefore equally reliable. However, the above examples show that the performance of the information sources also depends on many factors, such as E.g.: location, weather, etc. Therefore, automated driving in all scenarios requires an explicit correct reliability estimation of the information sources. This can be achieved by integrating a reliability estimate into the fusion. This thesis presents an approach to define, measure, learn and integrate reliability into roadway estimation by that the sources always have the same performance and are therefore equally reliable. However, the above examples show that the performance of the information sources also depends on many factors, such as E.g.: location, weather, etc. Therefore, automated driving in all scenarios requires an explicit correct reliability estimation of the information sources. This can be achieved by integrating a reliability estimate into the fusion. This thesis presents an approach to define, measure, learn and integrate reliability into roadway estimation by that the sources always have the same performance and are therefore equally reliable. However, the above examples show that the performance of the information sources also depends on many factors, such as E.g.: location, weather, etc. Therefore, automated driving in all scenarios requires an explicit correct reliability estimation of the information sources. This can be achieved by integrating a reliability estimate into the fusion. This thesis presents an approach to define, measure, learn and integrate reliability into roadway estimation by Therefore, automated driving in all scenarios requires an explicit correct reliability estimation of the information sources. This can be achieved by integrating a reliability estimate into the fusion. This thesis presents an approach to define, measure, learn and integrate reliability into roadway estimation by Therefore, automated driving in all scenarios requires an explicit correct reliability estimation of the information sources. This can be achieved by integrating a reliability estimate into the fusion. This thesis presents an approach to define, measure, learn and integrate reliability into roadway estimation by

den Begriff Zuverlässigkeit einer Fahrstreifen-Hypothese definiert
zeigt, wie die Zuverlässigkeit anhand Ground-Truth-Daten gemessen werden kann
zeigt, dass die Zuverlässigkeit einer Hypothese signifikant vom Kontext abhängig ist
zeigt, welche Kontext- und Sensordaten signifikant beim Bestimmen der Zuverlässigkeit sind
präsentiert ein Verfahren, welches aus signifikanten Kontext- und Sensordaten die Zuverlässigkeit schätzt.
einen Performance Begriff für Fusionsalgorithmen definiert
nachweist, dass kontextbasierte Schätzung der Zuverlässigkeit die Performance bestehender Algorithmen (in allen Szenarien) signifikant verbessert und dabei keine signifikante Verschlechterung (z.B. durch neue systemematische Fehler) erzeugt.

Das Ziel dieser Arbeit besteht darin, für die Querregelung beim automatischen Fahren immer die zuverlässigen Ego-Lane-Hypothesen zu fusionieren.

13.04.17 (13:00 Uhr s.t. in Raum G29-301)
Visual Analytics in Participatory Processes
M.Sc. Lars Schütz (Hochschule Anhalt)

Today, e-participation in the domain of planning and decision processes attracts more and more attention. The growing number of participants and the use of ICT lead to several key challenges. First, the processes contain complex data in terms of diversity and connectedness, e. g., natural language text, images, geospatial and time-oriented data, that might additionally relate to each other in form of ideas, comments, formal statements, or documents. A network of explicit and implicit information containing all contributions evolves. Second, the exploration of the process data is time-consuming and affords high cognitive demands. It is challenging to get the overall context and view. Third, knowledge discovery is currently based on manual analysis only. Implicit information, e. g., similar or contrary contributions, remain hidden. Automated data analysis and (information) visualization can provide a more comprehensive approach. The goal of the intended thesis is not to solely examine these fields individually, but rather tightly combine them while focusing on interaction. I. e., visual analytics methods are applied to the e-participation domain. We investigate methods for the analysis of contributions, the moderation of processes, and the exploration of the involved data. This research is supported by additional questions. For example, how can interactively triggered model updates be computed in real time in order to provide instant feedback and how can these changes be visualized? Several prototypes will be implemented and evaluated in a Web-based context to illustrate that the targeted groups of participants, namely public administrations and citizens, can accomplish their tasks more efficiently.

26.01.17 (13:00 Uhr s.t. in Raum G29-301)
Training Visual Concept Classifiers
M.Sc. Christian Hentschel (Hasso Plattner Institute for Software Systems Engineering)

Visual Concept Detection describes the process of automatically classifying images and video based on the depicted visual content. This talk will start by comparing different approaches for visual concept detection, namely Bag-of-Visual-Words and Deep Convolutional Neural Networks (CNN). Bag-of-Visual-Words methods represented the de facto standard until CNNs emerged, backed by highly parallel hardware as well as large training datasets. The talk will present the impact of the available amount of training data on the classification performance as achieved by the individual approaches. Furthermore, techniques for model visualization will be presented. Non-linear models suffer from the lack of interpretability. The presented visualization methods help to qualitatively compare visual concept models by highlighting image regions considered important for the final classification decision. Finally, the talk will address the problem of leveraging social photo communities in order to increase the amount of available training data without additional manual labeling efforts. A social community language model will be presented as well as an outlook for multi-modal retrieval.

30.06.16 (13:00 in Raum G29-128)
Automatische Ableitung von Balanced Scorecards aus Textkorpora
Henner Graubitz

Unternehmen weltweit werden derzeit mit dem Zeitalter der Digitalisierung konfrontiert. Schlüsselfaktoren wie Gewinne oder Bilanzkennzahlen sind nicht mehr zwingend entscheidend für die erfolgreiche Zukunft eines Unternehmens. Die weltweit am höchsten bewerteten Unternehmen zeichnen sich dadurch aus, dass sie über flache Hierarchien verfügen, bei denen Mitarbeiter mehr Vertrauen geschenkt wird und in denen sie eigenverantwortlich arbeiten und transparente Entscheidungen treffen können. Eine Transparenz wird dadurch erreicht, indem Dokumente, die für alle interessant sein können, weitergeschickt oder unternehmensintern - für alle Mitarbeiter eines Unternehmens einsehbar - abgelegt werden. Einer Idee kann digital ein Freiraum verschafft werden. Information Retrieval, Methoden aus dem Bereich des Natural Language Processings (NLP) und Data-Mining können helfen, diese unstrukturierten Informationen zu aggregieren und aus ihnen Erkenntnisse über das Unternehmen abseits bisheriger Kennzahlen zu erlangen, um für die wachsende digitale Zukunft vorbereitet zu sein. Eine große Herausforderung stellt die Breite der unstrukturierten Informationen innerhalb eines Unternehmens dar. Dieser Vortrag zeigt die Herausforderung und schlägt verschiedene Methoden aus den oben genannten Bereichen vor. Es wird ein Ansatz präsentiert, wie unstrukturierte Texten in handhabbare Fragmente unterteilt werden können. Die Vernetzung der einzelnen Mitarbeiter im Unternehmen wird durch Algorithmen aufgedeckt, in dem vorab Namensentitäten durch die Abgleichung mit häufig vorkommenden Mustern erkannt werden. Hinzu kommt die Anwendungen von klassischen Methoden zur Erkennung von Namensdubletten und Wortstammformen, um alle Informationen zu aggregieren und aus ihnen Informationen zu extrahieren. Ebenso werden aus den einzelnen Klassen durch Methoden der Textzusammenfassung Strategien abgeleitet. Als Ergebnis präsentiert dieser Vortrag neue Sichtweisen und Strategien des Unternehmens abseits üblicher Finanzkennzahlen, die in einer Balanced Scorecard (BSC) Verwendung finden.

03.05.16 (13:00 in Raum G29-301)
Role-based Data Management
Tobias Jäkel, TU Dresden, GRK 1907 RoSI

Softwaresysteme sind allgegenwärtig und aus dem heutigen Leben, in dem jeder mit allem und überall verbunden ist, nicht mehr wegzudenken. Zusätzlich werden diese Systeme ständig erweitert indem neue Funktionalitäten hinzukommen und die Systeme in sich ständig ändernden Umgebungen agieren. Die daraus resultierenden Herausforderungen an moderne Softwaresysteme, wie zum Beispiel kontextabhängiges Verhalten von Objekten, die sowohl zur Entwicklungs- als auch zur Laufzeit entstehen, können durch das Rollenkonzept bewältigt werden. Dies hat dazu geführt, dass Rollen heute zur Modellierung und zur Implementierung komplexer und kontextabhängiger Softwaresysteme genutzt werden. Die Datenbanken, als essentieller Teil solcher Systeme, werden dabei oft vernachlässigt, was darin endet, dass die Rollensemantiken nicht direkt im Datenbanksystem repräsentiert werden können. Eine indirekte Abbildung bringt jedoch Nachteile mit sich, wie zum Beispiel einen erhöhten Transformationsaufwand oder den Verlust der kontextabhängigen Informationen.

Um diese Probleme und Herausforderungen aus Perspektive eines Datenbankmanagementsystems zu bewältigen, wird das RSQL-Framework vorgestellt, ein dreiteiligen Ansatz bestehend aus Datenmodell, Anfragesprache und Ergebnisrepräsentation. Das Datenmodell ist dabei die Grundlage und definiert die Rollensemantik im Datenbanksystem. Zum einen werden auf Schemaebene Dynamische Datentypen zur Darstellung der kontextabhängigen Informationen eingeführt und zum anderen bilden Dynamische Tupel diese Informationen auf Instanzebene ab. Die Anfragesprache stellt eine auf dem definierten Datenmodell basierende externe Schnittstelle für Benutzer und Anwendungen dar. Daher ist sie auf die Definition von Dynamischen Datentypen bzw. die Manipulation der Dynamischen Tupel ausgelegt. Der dritte Bestandteil sichert die Rollensemantik in den Anfrageergebnissen und wird als Netz verbundener Dynamischer Tupel dargestellt. Weiterhin werden verschiedene Pfade für die Navigation innerhalb dieses Netzes bereitgestellt und erläutert.

28.04.16 (13:15 in Raum G29-301)
Dynamic Clustering in Social Networks
Pascal Held (FIN, IWS)

In den letzten Jahren haben soziale Netze immer mehr Einfluss auf unser Leben bekommen. Spätestens seit dem Aufkommen von Facebook, Twitter oder anderer großer Plattformen steigt die Beliebtheit solcher Netze. Dieses gesteigerte Interesse zeigt sich auch in der Wissenschaft und der Analyse dieser Netze. Dabei bezieht sich Social Network Analysis (SNA) nicht nur auf die offensichtlichen Netzwerke großer sozialer Plattformen, sondern auch auf soziale Netze die im verborgenen liegen. Dies kann z.B. die Analyse eines Kommunikationsnetzwerkes, ein Co-Autoren Netzwerk oder ein Strukturnetzwerk von Websites sein. Auch im menschlichen Körper finden sich Netzwerke die ähnliche Eigenschaften besitzen, wie beispielsweise in Protein-Protein-Interaktionen oder in Wechselbeziehungen zwischen einzelnen Hirnregionen. Social Network Analysis ist mittlerweile ein eigenes Forschungsgebiet mit verschiedensten Forschungsrichtungen, nicht nur in der Informatik, sondern auch in anderen Disziplinen. Dazu gehören zum Beispiel die Analyse sozialer Beziehungen, der Status einzelner Teilnehmer in der Gruppe oder Dichteuntersuchungen verschiedener Teilgraphen. Ein weiterer Schwerpunkt liegt im Finden von zusammengehörigen Gruppen innerhalb der Netze. Diese nennen sich auch Cluster oder Communities. Bei vielen Arbeiten wird davon ausgegangen, dass die vorliegenden Netzwerke statisch sind, bzw. werden auf statischen Momentaufnahmen die Analysen durchgeführt und für verschiedene Zeitpunkte verglichen. Der Fokus meiner Forschungsarbeit liegt auf der Cluster- bzw. Community Analyse für dynamische Netzwerke. Bei Änderungen im zu Grunde liegenden Netzwerk, sollen dynamisch die gefundene Cluster- und Community-Struktur aktualisiert werden. Dazu werde ich auf Arbeiten aus dem statischen Fall aufbauen und Methoden adaptieren, bzw. neue entwickeln, die eben diese Möglichkeiten bieten.

10.03.16 (13:15 in Raum G29-301)
Feature Improvement and Matching Refinement for Near and Semi Duplicate Image Retrieval in Large Collection (Thesis Proposal)
Afraa Ahmad Alyosef (FIN, ITI)

Image near-duplicate retrieval is very challenging field to detect the similar images, to overcome the problems such as infringement copyright of images, forged images, obtain altering version of existent images and use them as not related images. Furthermore, images for a site taken hours (days or even month) apart may be no identical because of the movement or occlusion of objects of foreground or because of the changes in the lightness of the site between day and night. Moreover, the change in camera parameters, photometric conditions (lighting condition), change in contrast, resolution or use different cameras to take images for the same scene, make the task of determine similar images more complex. In this thesis, we aim to improve near-duplicate image retrieval in the case of being the query image sub-image of one of the database images. This sub-images may be an exact cut part of the original scene or a zoom-in image, it can be taken form different viewpoint, different lightness conditions or even different camera. These different kinds of variation that may be applied on the sub-image make the retrieval task more complex. From this point of view it is important to answer the following questions:
- What is size of the sub-image that can be still considered as a near-duplicate image.
- What are changes types that make it difficult to detect near-duplicate images.

11.02.16 (13:15 in Raum G29-301)
Clinical decision support system based on Bayesian networks to support interdisciplinary tumor board decisions
Mario Cypko (Universität Leipzig, Innovation Center Computer Assisted Surgery)

The Innovation Center for Computer-assisted surgery (ICCAS) is a research initiative funded by the Federal Ministry for Education and Research in Germany. It was founded in 2005 as a central facility at the University of Leipzig. It is a place of research for surgeons from various disciplines as well as engineers and computer scientists, who collaborate on the development of state-of-the-art technologies for clinical assistant systems and the operating room of the future. The increasing understanding of the complexity of oncological diseases and the dramatic growth of available patient information allow, in principle, for a highly individualized treatment of patients. At the same time, however, optimal treatment decisions are becoming more difficult to make. Clinical decision support systems based on patient-specific Bayesian networks can help to overview the entire patient situation and find the best treatment decisions. Cypko will highlight aspects of decision making in tumor boards, and also present the complexity developing clinical decision support system and its integration into tumor boards.

17.12.15 (13:15 in Raum G29-128)
Ein neuer Ansatz zur Touchgestenerkennung zur Unterscheidung von durch Beispielen definierten Gesten mit unterschiedlichen zeitlichen Dynamiken (Thesis Proposal)
Tim Dittmar (FIN, ISG)

Touchbasierte mobile Geräte wie Smartphones und Tablets haben in den letzten Jahren eine enorme Verbreitung erfahren und sind daher heutzutage nahezu überall anzufinden. Auch der Zugriff auf passwortgeschützte Onlinedienste erfolgt oft über solche Geräte und das Touchinterface. Die Eingabe sicherer Passwörter über eine virtuelle Tastatur ist jedoch im Vergleich zu einer physikalischen wesentlich aufwändiger und nimmt mehr Zeit in Anspruch. Als komfortablere Alternative könnten an dieser Stelle Gestenpasswörter eingesetzt werden. Die Idee Touchgesten zur Authentifizierung zu nutzen gibt es in einer sehr einfachen Form bereits auf Android-Geräten (Patternlock), wurde aber auch in wissenschaftlichen Papern etwas genauer betrachtet. Jedoch wurde vor allem versucht die Form der Geste zu erkennen. Die Betrachtung der Geschwindigkeiten während der Ausführung fand bisher nie statt, würde aber die Sicherheit des Konzeptes der Gestenauthentifizierung erhöhen. Es gibt bisher jedoch kein spezialisiertes Verfahren, welches durch Beispiele Touchgesten definieren kann, bei denen auch die Geschwindigkeiten während der Ausführung relevant sind. Für viele Gestenerkennungsaufgaben bei denen Gesten durch Beispiele definiert werden, finden Hidden Markov Modelle Verwendung und eine Erweiterung dieser Modellklasse stellen die sogenannten Conversive Hidden-non Markovian Modelle (CHnMM) dar. Diese ermöglichen eine viel konkretere Definition von zeitlichen Verläufen und erscheinen damit deutlich geeigneter, um Gesten auch anhand des zeitlichen Verlaufs zu unterscheiden. Das Ziel dieser Arbeit ist es, ein Verfahren zur automatischen Erstellung von CHnMM basierten Gestenmodellen anhand von Beispielen zu entwickeln, um so die Erkennung von Touchgesten mit unterschiedlichen Ausführungsgeschwindigkeiten zu ermöglichen. Zur Evaluierung des Verfahrens wird außerdem ein Gestenerkennungssystem implementiert, so dass Maße zur Erkennungsqualität (Precision, Recall) und -geschwindigkeit erhoben werden können.

22.10.15 (13:00 in Raum G29-301)
Infrastructure for Research Data Publication in the Frame of High-Throughput Plant Phenotyping
Daniel Arend (Leibniz Institute of Plant Genetics and Crop Plant Research)

Life sciences have become one of the most data-intense disciplines and a major player in the “big data” age. High-throughput technologies became affordable and produce a huge amount of research data, which are the basis for nearly every bioinformatics analysis pipeline. But there is a huge gap of standards and policies for their maintenance, life cycle and citation. Furthermore, there are a many less interacting domain-specific archives, like the databases, maintained by the European Bioinformatics Institute (EBI), but also several general data sharing services like figshare. Research institutes use no or private policies, which define how to describe research data with metadata or how to preserved them. Therefore the reproducibility and the long-term preservation of research data depend strongly on the scientists, project bodies or the journal to which they want to publish their results. In the scientific life cycle research data pass through different domains and thereby the scientists are often faced with the problem of insufficient infrastructures, which guarantee a persistent preservation and support them during their work, as well as missing benefits for making their research data available. Focus of this thesis will be the development of a general applicable framework and a concept for research data management. A comprehensive requirement analysis will give a review to current strategies, established systems, and their pro and cons. Based on a use case in the field of plant phenotyping, a workflow for data publication, the long term preservation of research data and its citation is under investigation. The conceptual work and the implementation of a necessary infrastructure will make in the frame of the running 5 year DPPN research project, which is a big international project with the aim to develop an infrastructure and standards for the storage and analysis of high-throughput plant phenotyping experiments. The developed framework is a main component to realize a future-proof storage and sustainable citation using persistent identifiers, like the popular Digital Object Identifier (DOI).

21.07.15 (14:00 in Raum G29-E036)
Creating Learning Material from Web Resources
Katrin Krieger (FIN, IWS)

Technology-enhanced learning (TEL), especially Web-based learning, has become a fundamental part in education over the last decades. E-Learning platforms provide access to electronic learning material, accompany in-class lectures in blended learning scenarios or offer assessment facilities for formal and informal testing. Whole courses are held online, whether as qualification training, school education in sparsely populated areas or as courses dealing with special topics, letting remotely located experts teach students all over the world. TEL has torn down barriers in time and space, enabling students to learn where and whenever they want. We observed that learners use general Web resources as learning material. In order to overcome problems such as distraction and abandonment of a given learning task, we want to integrate these Web resources into Web-based learning systems and make them available as learning material within the learning context. We present an approach to generating learning material from Web resources that extracts a semantic fingerprint for these resources, obtains educational objectives, and publishes the learning material as Linked Data.

04.06.2015 (13:15 in Raum 301)
A FRAMEWORK FOR INTELLIGENT DECISION SUPPORT SYSTEM FOR ONSHORE DRILLING RIG SELECTION
Opeyemi Bello (Institute of Petroleum Engineering, Clausthal University of Technology, Germany)

Today, making a choice of drilling rig equipment during the well planning phase of E&P wells could be very challenging task; this is mainly caused by the existing multiple drilling rig manufacturers in the market that meets the operational conditions but not in accordance most operators design specifications. The conventional approach for the selection of appropriate drill rig for onshore operational activities is based on method of exclusion associated with engineering experience and lithology of the field to be developed serving as key drive factors. A poorly selected drill rig could add up unnecessary operation cost.
The objective of this study is to develop an unconventional approach for the selection of drill rig using data mining and machine learning techniques. An Intelligent decision support system will be developed guiding well designers and E&P operators in making decision to select appropriate drilling rig that will deliver a reliable performance resulting to safety drilling operations, mitigate effect of time delay, environmental friendly and most importantly be economically viable. In solving this problem, scientific based-approach will be adopted. First, this study will identify the most effective factors utilized and mostly considered in the selection of a drill rig, establish an object function and considered those factors (i.e. both qualitative and quantitative parameters influencing drill rig selection) in the function by implementing them in data mining and machine learning environment to evaluate their performances and identify a suitable drilling rig. The output results will provide the best drilling rig with appropriate score to compare the performance of each existing drilling rigs for onshore applications.

27.11.2014 (10:30 in Raum 301)
High Performance Data Management beyond Counting Cache Misses
Holger Pirk (Database Architectures group, CWI Amsterdam)

Databases are bandwidth bound applications - this litany has driven research for more than twenty years. However, recent developments in computer hardware have changed the status quo significantly motivating a re-investigation of this assumption. To illustrate the urgency of this line of research, I present a recent study on the efficiency of pivoted two-way partitiong (the basis for many algorithms such as quicksort or database cracking). This study indicates that even such simple algorithms need significant tuning to actually hit the "memory wall". For these tuning efforts we can draw from an arsenal of techniques such as vectorized processing, predication and the use of SIMD instructions. However, a classic technique still plays a key role: parallelization. Unfortunately, the parallel implementation of data processing systems becomes increasingly challenging due to the increasing diversity of involved devices: CPUs, GPUs, APUs, SSDs and classic spinning disks perform best at different degrees of parallelism. For that reason, I will also use this opportunity to present a novel DBMS architecture that aims to mediate between the different devices allowing each to work at sweet spot performance.

21.07.2014 (12:00 in Raum 301)
Exploration by Learning Views from Templates
Thomas Low (AG DKE, Institut für Technische und Betriebliche Informationssysteme)

Nowadays, data not only explodes in terms of size, it also grows in richness. Current search and exploration tools usually ignore a lot of information to provide specialized views on the data. For example, web search engines present search results in a sorted list based on their relevance to a query. However, it also might be interesting to find groups of similar results in order to get an overview. There are many different views on the data. Each emphasizes certain properties of the information space and neglects or ignores others. Depending on the task some views are more appropriate or interesting. In contrast to recent approaches, the goal of this thesis is not to personalize a single application-specific view, but instead to provide means to explore the space of different views on the data. The vision is that views can be interactively selected or learned from partial information given in the form of direct manipulations of visual representations of the information space, e.g., partially sorting a list or moving objects in a two-dimensional map. This translates to questions like: What is a suitable sorting such that one item is an extremum and another is rather average? What is a suitable map-based projection such that two items are close together, but another one is far away? Such user-specified templates allow to narrow down the search space to useful views, which are more likely to contain the desired patterns or clusters.

26.05.2014 (10:00 in Raum 301)
Analyzing Similarity of Cloned Software Variants using Hierarchical Set Models
Slawomir Duszynski (Fraunhofer-Institut für Experimentelles Software Engineering (IESE), Kaiserslautern)

Software reuse approaches, such as software product lines, are known to enable considerable effort and cost savings when developing families of software systems with a significant overlap in functionality. In the practice, however, the need for strategic reuse often becomes apparent only after a number of product variants have already been delivered. The variants are often created in ad-hoc manner - cloning of the original system's code and changing it according to the specific requirements of the customer is frequently observed in the industrial practice. In such a situation, a reuse approach has to be introduced afterwards based on the already existing product implementations. An approach for code similarity analysis, needed for that purpose, is the main focus of the presented dissertation research.

In the talk, we present a reverse engineering approach for obtaining the information about source code similarity of existing product variants. The variant systems are modeled as hierarchical sets of uniquely identifiable elements having known sizes, and the similarity of the variants is expressed using set algebra. The similarity information is available on any abstraction level, from a single code line up to a whole system group. A generic analysis framework is proposed, which can be used for diverse system representations and diverse similarity detection algorithms, including clone detection. The approach supports simultaneous analysis of multiple source code variants and proposes visualization concepts that enable easy interpretation of the analysis results even for large systems and a high number of variants. We hypothesize that the analysis approach allows for obtaining more detailed and more correct variant similarity information with lower analysis effort as compared to the existing approaches. The performed empirical evaluations of the hypothesized improvements are discussed.

22.05.2014 (10:00 in Raum 301)
Long-Term Preservation and Management of Scientific Research Data
Daniel Arend (Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben)

The “big data” problem is one of the main challenges in life sciences. High-throughput technologies became affordable and produce a huge amount of primary data, which are the basis for nearly every bioinformatics analysis pipeline. But there is a huge gap of standards and policies for their maintenance, life cycle and citation. Furthermore, there is a high number of less interacting domain-specific databases, like the European Nucleotide Archive or the BioModels database, but also several general databases and data sharing services like figshare or DRYAD.
Beside those technical aspects, research institutes use no or private policies, which define how to handle primary data, how to describe them with metadata or which state of the datasets must be preserved. Therefore the sustainability and the long-term preservation of research data depend strongly on the scientists, project bodies or the journal to which they want to publish their results. In the scientific life cycle primary data pass through different domains and thereby the scientists are often faced with the problem of insufficient infrastructures, which guarantee a persistent preservation and support them during their work, as well as missing benefits for making their research data available.
Focus of this thesis will be the development of a general applicable framework and policy for research data management. A comprehensive requirement analysis will give a review to current strategies, established systems, and their pro and cons. Based on two use cases in the fields of system biology and plant phenotyping, a workflow for data publication, the long term preservation of primary data, and its citation is under investigation. The conceptual work and the implementation of a necessary infrastructure will make in the frame of a running 5 year research project. Here the developed e!DAL API (electronic Data Archive Library) for Java is a possible solution to address those shortcomings and close the gap between the storage of scientific primary data and their long-term availability. It provides an enhanced storage backend, which is comparable to a file system, but providing different features, which based on literature studies and recommendations of several organizations, to guarantee a long-term preservation of the digital objects. In the case of the DPPN/EPPN project which is a big international project with the aim to develop an infrastructure and standards for the storage and analysis of high-throughput plant phenotyping experiments, the API can be a main component to realize a future-proof storage and sustainable citation using persistent identifiers, like the popular Digital Object Identifier (DOI).
The talk will summarize challenges in research data management with a special focus to long-term preservation of primary data. First, an overview to the state of the art in the research field and existing databases will be given. Furthermore, the use case scenario for research data life cycle with focus to high-throughput phenotyping in the DPPN research collaboration is introduced. A first prototypes of the data citation infrastructure e!DAL will be presented. The talk concludes with an outline of the planed PhD thesis.

20.03.2014 (13:00 in Raum 128)
Collaborative Technology Search Using Search Maps: Enhancing Traceability, Overview and Sensemaking in Collaborative Information Seeking
Dominic Stange (Volkswagen AG)

We propose a search user interface that is especially designed to support information seeking in a collaborative search setting. The motivation of the thesis is twofold. The first goal is to support awareness, understanding, and sensemaking within a group working together on the same search task. The support is achieved by visualizing the information seeking activities of the user group with an interactive two-dimensional search map. The users share the same search map and can actively collaborate and evolve their search topic together. The search map serves as a common ground and enables each user to gain a more comprehensive understanding of the domain in question by taking advantage of the shared view of the community.

The second goal of the thesis is to create a graphical network of entities which are discovered during the search process. The entities are manually extracted by highlighting text within documents encountered during the search process and classified given a previously developed domain taxonomy of a business application in technology search. These classified entities are then linked to each other in a graph database using their classes and the context of the search map to create the link structure. Technology search focuses on identifying and evaluating interesting technologies that can be used in a business application.