Conceptual Modeling - ER 2004: 23rd International Conference on Conceptual Modeling, Shanghai, China, November 8-12, 2004. Proceedings (Lecture Notes in Computer Science) - PDF Free Download (2025)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

3288

This page intentionally left blank

Paolo Atzeni Wesley Chu Hongjun Lu Shuigeng Zhou Tok Wang Ling (Eds.)

Conceptual Modeling – ER 2004 23rd International Conference on Conceptual Modeling Shanghai, China, November 8-12, 2004 Proceedings

Springer

eBook ISBN: Print ISBN:

3-540-30464-9 3-540-23723-2

©2005 Springer Science + Business Media, Inc. Print ©2004 Springer-Verlag Berlin Heidelberg All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America

Visit Springer's eBookstore at: and the Springer Global Website Online at:

http://ebooks.springerlink.com http://www.springeronline.com

Foreword

On behalf of the Organizing Committee, we would like to welcome you to the proceedings of the 23rd International Conference on Conceptual Modeling (ER 2004). This conference provided an international forum for technical discussion on conceptual modeling of information systems among researchers, developers and users. This was the third time that this conference was held in Asia; the first time was in Singapore in 1998 and the second time was in Yokohama, Japan in 2001. China is the third largest nation with the largest population in the world. Shanghai, the largest city in China and a great metropolis, famous in Asia and throughout the world, is therefore a most appropriate location to host this conference. This volume contains papers selected for presentation and includes the two keynote talks by Prof. Hector Garcia-Molina and Prof. Gerhard Weikum, and an invited talk by Dr. Xiao Ji. This volume also contains industrial papers and demo/poster papers. An additional volume contains papers from 6 workshops. The conference also featured three tutorials: (1) Web Change Management and Delta Mining: Opportunities and Solutions, by Sanjay Madria, (2) A Survey of Data Quality Issues in Cooperative Information Systems, by Carlo Batini, and (3) Visual SQL – An ER-Based Introduction to Database Programming, by Bernhard Thalheim. The technical program of the conference was selected by a distinguished program committee consisting of three PC Co-chairs, Hongjun Lu, Wesley Chu, and Paolo Atzeni, and more than 70 members. They faced a difficult task in selecting 57 papers from many very good contributions. This year the number of submissions, 293, was a record high for ER conferences. We wish to express our thanks to the program committee members, external reviewers, and all authors for submitting their papers to this conference. We would also like to thank: the Honorary Conference Chairs, Peter P. Chen and Ruqian Lu; the Coordinators, Zhongzhi Shi, Yoshifumi Masunaga, Elisa Bertino, and Carlo Zaniolo; Workshop Co-chairs, Shan Wang and Katsumi Tanaka; Tutorial Co-chairs, Jianzhong Li and Stefano Spaccapietra; Panel Co-chairs, Chin-Chen Chang and Erich Neuhold; Industrial Co-chairs, Philip S. Yu, Jian Pei, and Jiansheng Feng; Demos and Posters Co-chair, Mong-Li Lee and Gillian Dobbie; Publicity Chair, Qing Li; Publication Chair cum Local Arrangements Chair, Shuigeng Zhou; Treasurer, Xueqing Gong; Registration Chair, Xiaoling Wang; Steering Committee Liaison, Arne Solvberg; and Webmasters, Kun Yue, Yizhong Wu, Zhimao Guo, and Keping Zhao. We wish to extend our thanks to the Natural Science Foundation of China, the ER Institute (ER Steering Committee), the K.C. Wong Education Foundation in Hong Kong, the Database Society of the China Computer Federation, ACM SIGMOD, ACM SIGMIS, IBM China Co., Ltd., Shanghai Baosight Soft-

VI

Foreword

ware Co., Ltd., and the Digital Policy Management Association of Korea for their sponsorships and support. At this juncture, we wish to remember the late Prof. Yahiko Kambayashi who passed away on February 5, 2004 at age 60 and was then a workshop co-chair of the conference. Many of us will remember him as a friend, a mentor, a leader, an educator, and our source of inspiration. We express our heartfelt condolence and our deepest sympathy to his family. We hope that the attendees found the technical program of ER 2004 to be interesting and beneficial to their research. We trust they enjoyed this beautiful city, including the night scene along the Huangpujiang River and the postconference tours to the nearby cities, leaving a beautiful and memorable experience for all.

November 2004

Tok Wang Ling Aoying Zhou

Preface

The 23rd International Conference on Conceptual Modeling (ER 2004) was held in Shanghai, China, November 8–12, 2004. Conceptual modeling is a fundamental technique used in analysis and design as a real-world abstraction and as the basis for communication between technology experts and their clients and users. It has become a fundamental mechanism for understanding and representing organizations, including new e-worlds, and the information systems that support them. The International Conference on Conceptual Modeling provides a major forum for presenting and discussing current research and applications in which conceptual modeling is the major emphasis. Since the first edition in 1979, the ER conference has evolved into the most prestigious one in the areas of conceptual modeling research and applications. Its purpose is to identify challenging problems facing high-level modeling of future information systems and to shape future directions of research by soliciting and reviewing high-quality applied and theoretical research findings. ER 2004 encompassed the entire spectrum of conceptual modeling. It addressed research and practice in areas such as theories of concepts and ontologies underlying conceptual modeling, methods and tools for developing and communicating conceptual models, and techniques for transforming conceptual models into effective information system implementations. We solicited forward-looking and innovative contributions that identify promising areas for future conceptual modeling research as well as traditional approaches to analysis and design theory for information systems development. The Call for Papers attracted 295 exceptionally strong submissions of research papers from 36 countries/regions. Due to limited space, we were only able to accept 57 papers from 21 countries/regions, for an acceptance rate of 19.3%. Inevitably, many good papers had to be rejected. The accepted papers covered topics such as ontologies, patterns, workflows, metamodeling and methodology, innovative approaches to conceptual modeling, foundations of conceptual modeling, advanced database applications, systems integration, requirements and evolution, queries and languages, Web application modeling and development, schemas and ontologies, and data mining. We are proud of the quality of this year’s program, from the keynote speeches to the research papers, with the workshops, panels, tutorials, and industrial papers. We were honored to host the outstanding keynote addresses by Hector Garcia-Molina and Gerhard Weikum. We appreciate the hard work of the organizing committee, with interactions around the clock with colleagues all over the world. Most of all, we are extremely grateful to the program committee members of ER 2004 who generously spent their time and energy reviewing submitted papers. We also thank the many external referees who helped with the review process. Last but not least, we thank the authors who wrote high-quality

VIII

Preface

research papers and submitted them to ER 2004, without whom the conference would not have existed.

November 2004

Paolo Atzeni, Wesley Chu, and Hongjun Lu

ER 2004 Conference Organization

Honorary Conference Chairs Peter P. Chen Ruqian Lu

Louisiana State University, USA Fudan University, China

Conference Co-chairs Aoying Zhou Tok Wang Ling

Fudan University, China National University of Singapore, Singapore

Program Committee Co-chairs Paolo Atzeni Wesley Chu Hongjun Lu

Università Roma Tre, Italy University of California at Los Angeles, USA Univ. of Science and Technology of Hong Kong, China

Workshop Co-chairs Renmin University of China, China Shan Wang Kyoto University, Japan Katsumi Tanaka Yahiko Kambayashi1 Kyoto University, Japan

Tutorial Co-chairs Harbin Institute of Technology, China Jianzhong Li Stefano Spaccapietra EPFL Lausanne, Switzerland

Panel Co-chairs Chin-Chen Chang Erich Neuhold

Chung Cheng University, Taiwan, China IPSI, Fraunhofer, Germany

Industrial Co-chairs Philip S. Yu Jian Pei Jiansheng Feng 1

IBM T.J. Watson Research Center, USA Simon Fraser University, Canada Shanghai Baosight Software Co., Ltd., China

Prof. Yahiko Kambayashi died on February 5, 2004.

X

ER 2004 Conference Organization

Demos and Posters Chair Mong-Li Lee Gillian Dobbie

National University of Singapore, Singapore University of Auckland, New Zealand

Publicity Chair Qing Li

City University of Hong Kong, China

Publication Chair Shuigeng Zhou

Fudan University, China

Coordinators Zhongzhi Shi Yoshifumi Masunaga Elisa Bertino Carlo Zaniolo

ICT, Chinese Academy of Science, China Ochanomizu University, Japan Purdue University, USA University of California at Los Angeles, USA

Steering Committee Liaison Arne Solvberg

Norwegian University of Sci. and Tech., Norway

Local Arrangements Chair Shuigeng Zhou

Fudan University, China

Treasurer Xueqing Gong

Fudan University, China

Registration Xiaoling Wang

Fudan University, China

Webmasters Kun Yue Yizhong Wu Zhimao Guo Keping Zhao

Fudan Fudan Fudan Fudan

University, University, University, University,

China China China China

ER 2004 Conference Organization

XI

Program Committee Jacky Akoka Hiroshi Arisawa Sonia Bergamaschi Mokrane Bouzeghoub Diego Calvanese Cindy Chen Shing-Chi Cheung Roger Chiang Stefan Conrad Bogdan Czejdo Lois Delcambre Debabrata Dey Johann Eder Ramez Elmasri David W. Embley Johann-Christoph Freytag Antonio L. Furtado Andreas Geppert Shigeichi Hirasawa Arthur ter Hofstede Matthias Jarke Christian S. Jensen Manfred Jeusfeld Yahiko Kambayashi Hannu Kangassalo Kamalakar Karlapalem Vijay Khatri Dongwon Lee Mong-Li Lee Wen Lei Mao Jianzhong Li Qing Li Stephen W. Liddle Ee-Peng Lim Mengchi Liu Victor Zhenyu Liu Ray Liuzzi Bertram Ludäscher Ashok Malhotra Murali Mani Fabio Massacci Sergey Melnik

CNAM & INT, France Yokohama National University, Japan Università di Modena e Reggio Emilia, Italy Université de Versailles, France Università di Roma La Sapienza, Italy University of Massachusetts at Lowell, USA Univ. of Science and Technology of Hong Kong, China University of Cincinnati, USA Heinrich-Heine-Universität Düsseldorf, Germany Loyola University, New Orleans, USA Oregon Health Science University, USA University of Washington, USA Universität Klagenfurt, Austria University of Texas at Arlington, USA Brigham Young University, USA Humboldt-Universität zu Berlin, Germany PUC Rio de Janeiro, Brazil Credit Suisse, Switzerland Waseda University, Japan Queensland University of Technology, Australia Technische Hochschule Aachen, Germany Aalborg Universitet, Denmark Universiteit van Tilburg, Netherlands Kyoto University, Japan University of Tampere, Finland Intl. Institute of Information Technology, India Indiana University at Bloomington, USA Pennsylvania State University, USA National University of Singapore, Singapore University of California at Los Angeles, USA Harbin Institute of Technology, China City University of Hong Kong, Hong Kong, China Brigham Young University, USA Nanyang Technological University, Singapore Carleton University, Canada University of California at Los Angeles, USA Air Force Research Laboratory, USA San Diego Supercomputer Center, USA Microsoft, USA Worcester Polytechnic Institute, USA Università di Trento, Italy Universität Leipzig, Germany

XII

ER 2004 Conference Organization

Xiaofeng Meng Renate Motschnig John Mylopoulos Sham Navathe Jyrki Nummenmaa Maria E. Orlowska Oscar Pastor Jian Pei Zhiyong Peng Barbara Pernici Dimitris Plexousakis Sandeep Purao Sudha Ram Colette Rolland Elke Rundensteiner Peter Scheuermann Keng Siau Janice C. Sipior Il-Yeol Song Nicolas Spyratos Veda C. Storey Ernest Teniente Juan C. Trujillo Michalis Vazirgiannis Dongqing Yang Jian Yang GeYu Lizhu Zhou Longxiang Zhou Shuigeng Zhou

Renmin University of China, China Universität Wien, Austria University of Toronto, Canada Georgia Institute of Technology, USA University of Tampere, Finland University of Queensland, Australia Universidad Politécnica de Valencia, Spain Simon Fraser University, Canada Wuhan University, China Politecnico di Milano, Italy FORTH-ICS, Greece Pennsylvania State University, USA University of Arizona, USA Univ. Paris 1 Panthéon-Sorbonne, France Worcester Polytechnic Institute, USA Northwestern University, USA University of Nebraska-Lincoln, USA Villanova University, USA Drexel University, USA Université de Paris-Sud, France Georgia State University, USA Universitat Politècnica de Catalunya, Spain Universidad de Alicante, Spain Athens Univ. of Economics and Business, Greece Peking University, China Tilburg University, Netherlands Northeastern University, China Tsinghua University, China Chinese Academy of Science, China Fudan University, China

ER 2004 Conference Organization

External Referees A. Analyti Michael Adams Alessandro Artale Enrico Blanzieri Shawn Bowers Paolo Bresciani Linas Bukauskas Ugo Buy Luca Cabibbo Andrea Calì Cinzia Cappiello Alain Casali Yu Chen V. Christophidis Fang Chu Valter Crescenzi Michael Derntl Arnaud Giacometti Paolo Giorgini Cristina Gómez Daniela Grigori

Wynne Hsu Stamatis Karvounarakis Ioanna Koffina George Kokkinidis Hristo Koshutanski Kyriakos Kritikos Lotfi Lakhal Domenico Lembo Shaorong Liu Stéphane Lopes Bertram Ludaescher Chang Luo Gianni Mecca Massimo Mecella Carlo Meghini Paolo Merialdo Antonis Misargopoulos Paolo Missier Stefano Modafferi Wai Yin Mok Enrico Mussi

Noel Novelli Alexandros Ntoulas Phillipa Oaks Seog-Chan Oh Justin O’Sullivan Manos Papaggelis V. Phan-Luong Pierluigi Plebani Philippe Rigaux Nick Russell Ulrike Sattler Monica Scannapieco Ka Cheung Sia Riccardo Torlone Goce Trajcevski Nikos Tsatsakis Haixun Wang Moe Wynn Yi Xia Yirong Yang Fan Ye

XIII

XIV

ER 2004 Conference Organization

Co-organized by Fudan University of China National University of Singapore

In Cooperation with Database Society of the China Computer Federation ACM SIGMOD ACM SIGMIS

Sponsored by National Natural Science Foundation of China (NSFC) ER Institute (ER Steering Committee) K.C. Wong Education Foundation, Hong Kong

Supported by IBM China Co., Ltd. Shanghai Baosight Software Co., Ltd. Digital Policy Management Association of Korea

Table of Contents

Keynote Addresses Entity Resolution: Overview and Challenges Hector Garcia-Molina

1

Towards a Statistically Semantic Web Gerhard Weikum, Jens Graupmann, Ralf Schenkel, and Martin Theobald

3

Invited Talk The Application and Prospect of Business Intelligence in Metallurgical Manufacturing Enterprises in China Xiao Ji, Hengjie Wang, Haidong Tang, Dabin Hu, and Jiansheng Feng

18

Conceptual Modeling I Conceptual Modelling – What and Why in Current Practice Islay Davies, Peter Green, Michael Rosemann, and Stan Gallo

30

Entity-Relationship Modeling Re-revisited Don Goelman and Il- Yeol Song

43

Modeling Functional Data Sources as Relations Simone Santini and Amarnath Gupta

55

Conceptual Modeling II Roles as Entity Types: A Conceptual Modelling Pattern Jordi Cabot and Ruth Raventós

69

Modeling Default Induction with Conceptual Structures Julien Velcin and Jean-Gabriel Ganascia

83

Reachability Problems in Entity-Relationship Schema Instances Sebastiano Vigna

96

Conceptual Modeling III A Reference Methodology for Conducting Ontological Analyses Michael Rosemann, Peter Green, and Marta Indulska Pruning Ontologies in the Development of Conceptual Schemas of Information Systems Jordi Conesa and Antoni Olivé

110

122

XVI

Table of Contents

Definition of Events and Their Effects in Object-Oriented Conceptual Modeling Languages Antoni Olivé

136

Conceptual Modeling IV Enterprise Modeling with Conceptual XML David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

150

Graphical Reasoning for Sets of Functional Dependencies János Demetrovics, András Molnár, and Bernhard Thalheim

166

ER-Based Software Sizing for Data-Intensive Systems Hee Beng Kuan Tan and Yuan Zhao

180

Data Warehouse Data Mapping Diagrams for Data Warehouse Design with UML Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

191

Informational Scenarios for Data Warehouse Requirements Elicitation Naveen Prakash, Yogesh Singh, and Anjana Gosain

205

Extending UML for Designing Secure Data Warehouses Eduardo Fernández-Medina, Juan Trujillo, Rodolfo Villarroel, and Mario Piattini

217

Schema Integration I Data Integration with Preferences Among Sources Gianluigi Greco and Domenico Lembo Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He and Tok Wang Ling Managing Merged Data by Vague Functional Dependencies An Lu and Wilfred Ng

231

245 259

Schema Integration II Merging of XML Documents Wanxia Wei, Mengchi Liu, and Shijun Li

273

Schema-Based Web Wrapping Sergio Flesca and Andrea Tagarelli

286

Web Taxonomy Integration Using Spectral Graph Transducer Dell Zhang, Xiaoling Wang, and Yisheng Dong

300

Table of Contents

XVII

Data Classification and Mining I Contextual Probability-Based Classification Gongde Guo, Hui Wang, David Bell, and Zhining Liao

313

Improving the Performance of Decision Tree: A Hybrid Approach LiMin Wang, SenMiao Yuan, Ling Li, and HaiJun Li

327

Understanding Relationships: Classifying Verb Phrase Semantics Veda C. Storey and Sandeep Purao

336

Data Classification and Mining II Fast Mining Maximal Frequent ItemSets Based on FP-Tree Yuejin Yan, Zhoujun Li, and Huowang Chen

348

Multi-phase Process Mining: Building Instance Graphs B.F. van Dongen and W.M.P. van der Aalst

362

A New XML Clustering for Structural Retrieval Jeong Hee Hwang and Keun Ho Ryu

377

Web-Based Information Systems Link Patterns for Modeling Information Grids and P2P Networks Christopher Popfinger, Cristian Pérez de Laborda, and Stefan Conrad

388

Information Retrieval Aware Web Site Modelling and Generation Keyla Ahnizeret, David Fernandes, João M.B. Cavalcanti, Edleno Silva de Moura, and Altigran S. da Silva

402

Expressive Profile Specification and Its Semantics for a Web Monitoring System Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy

420

Query Processing I On Modelling Cooperative Retrieval Using an Ontology-Based Query Refinement Process Nenad Stojanovic and Ljiljana Stojanovic

434

Load-Balancing Remote Spatial Join Queries in a Spatial GRID Anirban Mondal and Masaru Kitsuregawa

450

Expressing and Optimizing Similarity-Based Queries in SQL Like Gao, Min Wang, X. Sean Wang, and Sriram Padmanabhan

464

XVIII Table of Contents

Query Processing II XSLTGen: A System for Automatically Generating XML Transformations 479 via Semantic Mappings Stella Waworuntu and James Bailey Efficient Recursive XML Query Processing in Relational Database Systems Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

493

Situated Preferences and Preference Repositories for Personalized Database Applications Stefan Holland and Werner Kießling

511

Web Services I Analysis and Management of Web Service Protocols Boualem Benatallah, Fabio Casati, and Farouk Toumani

524

Semantic Interpretation and Matching of Web Services Chang Xu, Shing-Chi Cheung, and Xiangye Xiao

542

Intentional Modeling to Support Identity Management Lin Liu and Eric Yu

555

Web Services II WUML: A Web Usage Manipulation Language for Querying Web Log Data Qingzhao Tan, Yiping Ke, and Wilfred Ng

567

An Agent-Based Approach for Interleaved Composition and Execution of Web Services Xiaocong Fan, Karthikeyan Umapathy, John Yen, and Sandeep Purao

582

A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows San-Yih Hwang, Haojun Wang, Jaideep Srivastava, and Raymond A. Paul

596

Schema Evolution Lossless Conditional Schema Evolution Ole G. Jensen and Michael H. Böhlen

610

Ontology-Guided Change Detection to the Semantic Web Data Li Qin and Vijayalakshmi Atluri

624

Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach Hao Fan and Alexandra Poulovassilis

639

Table of Contents

XIX

Conceptual Modeling Applications I Metaprogramming for Relational Databases Jernej Kovse, Christian Weber, and Theo Härder Incremental Navigation: Providing Simple and Generic Access to Heterogeneous Structures Shawn Bowers and Lois Delcambre Agent Patterns for Ambient Intelligence Paolo Bresciani, Loris Penserini, Paolo Busetta, and Tsvi Kuflik

654

668 682

Conceptual Modeling Applications II Modeling the Semantics of 3D Protein Structures Sudha Ram and Wei Wei

696

Risk-Driven Conceptual Modeling of Outsourcing Decisions Pascal van Eck, Roel Wieringa, and Jaap Gordijn

709

A Pattern and Dependency Based Approach to the Design of Process Models Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, and Petia Wohed

724

UML Use of Tabular Analysis Method to Construct UML Sequence Diagrams Margaret Hilsbos and Il- Yeol Song

740

An Approach to Formalizing the Semantics of UML Statecharts Xuede Zhan and Huaikou Miao

753

Applying the Application-Based Domain Modeling Approach to UML Structural Views Arnon Sturm and Iris Reinhartz-Berger

766

XML Modeling A Model Driven Approach for XML Database Development Belén Vela, César J. Acuña, and Esperanza Marcos

780

On the Updatability of XML Views Published over Relational Data Ling Wang and Elke A. Rundensteiner

795

XBiT: An XML-Based Bitemporal Data Model Fusheng Wang and Carlo Zaniolo

810

XX

Table of Contents

Industrial Presentations I: Applications Enterprise Cockpit for Business Operation Management Fabio Casati, Malu Castellanos, and Ming-Chien Shan

825

Modeling Autonomous Catalog for Electronic Commerce Yuan-Chi Chang, Vamsavardhana R. Chillakuru, and Min Wang

828

GiSA: A Grid System for Genome Sequences Assembly Jun Tang, Dong Huang, Chen Wang, Wei Wang, and Baile Shi

831

Industrial Presentations II: Ontology in Applications Analytical View of Business Data: An Example Adam Yeh, Jonathan Tang, Youxuan Jin, and Sam Skrivan

834

Ontological Approaches to Enterprise Applications Dongkyu Kim, Yuan-Chi Chang, Juhnyoung Lee, and Sang-goo Lee

838

FASTAXON: A System for FAST (and Faceted) TAXONomy Design Yannis Tzitzikas, Raimo Launonen, Mika Hakkarainen, Pekka Korhonen, Tero Leppänen, Esko Simpanen, Hannu Törnroos, Pekka Uusitalo, and Pentti Vänskä

841

CLOVE: A Framework to Design Ontology Views Rosario Uceda-Sosa, Cindy X. Chen, and Kajal T. Claypool

844

Demos and Posters iRM: An OMG MOF Based Repository System with Querying Capabilities Ilia Petrov, Stefan Jablonski, Marc Holze, Gabor Nemes, and Marcus Schneider

850

Visual Querying for the Semantic Web Sacha Berger, Franois Bry, and Christoph Wieser

852

Query Refinement by Relevance Feedback in an XML Retrieval System Hanglin Pan, Anja Theobald, and Ralf Schenkel

854

Semantics Modeling for Spatiotemporal Databases Peiquan Jin, Lihua Yue, and Yuchang Gong

856

Temporal Information Management Using XML Fusheng Wang, Xin Zhou, and Carlo Zaniolo

858

SVMgr: A Tool for the Management of Schema Versioning Fabio Grandi

860

Table of Contents

GENNERE: A Generic Epidemiological Network for Nephrology and Rheumatology Ana Simonet, Michel Simonet, Cyr-Gabin Bassolet, Sylvain Ferriol, Cédric Gueydan, Rémi Patriarche, Haijin Yu, Ping Hao, Yi Liu, Wen Zhang, Nan Chen, Michel Forêt, Philippe Gaudin, Georges De Moor, Geert Thienpont, Mohamed Ben Saïd, Paul Landais, and Didier Guillon

XXI

862

Panel Beyond Webservices – Conceptual Modelling for Service Oriented Architectures Peter Fankhauser

865

Author Index

867

This page intentionally left blank

Entity Resolution: Overview and Challenges Hector Garcia-Molina Stanford University, Stanford, CA, USA [emailprotected]

Entity resolution is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An entity resolution algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can. Entity resolution algorithms typically rely on user-defined functions that (a) compare fields or records to determine if they match (are likely to represent the same real world entity), and (b) merge matching records into one, and in the process perhaps combine fields (e.g., creating a new name based on two slightly different versions of the name). In this talk I will give an overview of the Stanford SERF Project, that is building a framework to describe and evaluate entity resolution schemes. In particular, I will give an overview of some of the different entity resolution settings: De-duplication versus fidelity enhancement. In the de-duplication problem, we have a single set of records, and we try to merge the ones representing the same real world entity. In the fidelity enhancement problem, we have two sets of records: a base set of records of interest, and a new set of acquired information. The goal is to coalesce the new information into the base records. Clustering versus snapping. With snapping, we examine records pair-wise and decide if they represent the same entity. If they do, we merge the records into one, and continue the process of pair-wise comparisons. With clustering, we analyze all records and partition them into groups we believe represent the same real world entity. At the end, each partition is merged into one record. Confidences. In some entity resolution scenarios we must manage confidences. For example, input records may have a confidence value representing how likely it is they are true. Snap rules (that tells us when two records match) may also have confidences representing how likely it is that two records actually represent the same real world entity. As we merge records, we must track their confidences. Schema Mismatches. In some entity resolution scenarios we must deal, not just with resolving information on entities, but also with resolving discrepancies among the schemas of the different sources. For example, the attribute names and formats from one source may not match those of other sources. In the talk I will address some of the open problems and challenges that arise in entity resolution. These include: P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 1–2, 2004. © Springer-Verlag Berlin Heidelberg 2004

2

Hector Garcia-Molina

Performance. Entity resolution algorithms must perform very large number of field and record comparisons (via the user provided functions), so it is critical to perform only the absolutely minimum number of invocations to the comparison functions. Developing efficient algorithms is analogous to developing efficient join algorithms for relational databases. Confidences. Very little is understood as to how confidences should be manipulated in an entity resolution setting. For example, say we have two records, one reporting that “Joe” uses cell phone 123, and the other reporting that “Joseph” uses phone 456. The first record has confidence 0.9 and the second one 0.7. A snap rule tells us that “Joe” and “Joseph” are the same person with confidence 0.8. Do we assume this person has been using two phones? Or that 123 is the correct number because that record has a higher confidence? If we do merge the records, what are the resulting confidences? Metrics. Say we have two entity resolution schemes, A and B. How do we know if A yields “better” results and compared to B? Or say we have one base set of records, and we wish to enhance its fidelity with either new set X or new set Y. Since it costs money to acquire either new set, we only wish to use one. Based on samples of X and Y, how do we decide which set is more likely to enhance our base set? To address questions such as these we need to develop metrics that quantify not just to performance of entity resolution, but also its accuracy. Privacy. There is a strong connection between entity resolution and information privacy. To illustrate, say Alice has given out two records containing some of her private information: Record 1 gives Alice’s name, phone number and credit card number; record 2 gives Alice’s name, phone and national identity number. How much information has actually “leaked” depends on how well and adversary, Bob, can piece together these two records. If Bob can determine that the records refer to the same person, then he knows Alice’s credit card number and her national identity number, opening the door for say identity theft. If the records do not snap together, then Bob knows less and we have a smaller information leak. We need to develop good ways to model information leakage in an entity resolution context. Such a model can lead us, for example, to techniques for quantifying the leakage caused by releasing one new fact, or the decrease in leakage caused by releasing disinformation. Additional information on our SERF project can be found at http://www-db.stanford.edu/serf This work is joint with Qi Su, Tyson Condie, Nicolas Pombourcq, and Jennifer Widom.

Towards a Statistically Semantic Web Gerhard Weikum, Jens Graupmann, Ralf Schenkel, and Martin Theobald Max-Planck Institute of Computer Science Saarbruecken, Germany

Abstract. The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevanceranked lists as query results. This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search.

1 The Challenge of “Semantic” Information Search The age of information explosion poses tremendous challenges regarding the intelligent organization of data and the effective search of relevant information in business and industry (e.g., market analyses, logistic chains), society (e.g., health care), and virtually all sciences that are more and more data-driven (e.g., gene expression data analyses and other areas of bioinformatics). The problems arise in intranets of large organizations, in federations of digital libraries and other information sources, and in the most humongous and amorphous of all data collections, the World Wide Web and its underlying numerous databases that reside behind portal pages. The Web bears the potential of being the world’s largest encyclopedia and knowledge base, but we are very far from being able to exploit this potential. Database-system and search-engine technologies provide support for organizing and querying information; but all too often they require excessive manual preprocessing, such as designing a schema and cleaning raw data or manually classifying documents into a taxonomy for a good Web portal, or manual postprocessing such as browsing through large result lists with too many irrelevant items or surfing in the vicinity of promising but not truly satisfactory approximate matches. The following are a few example queries where current Web and intranet search engines fall short or where data P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 3–17, 2004. © Springer-Verlag Berlin Heidelberg 2004

4

Gerhard Weikum et al.

integration techniques and the use of SQL-like querying face unsurmountable difficulties even on structured, but federated and highly heterogeneous databases: Q1: Which professors from Saarbruecken in Germany teach information retrieval and do research on XML? Q2: Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? And are there any metabolic models for acid reflux that could be related to the gene expression data? Q3: What are the most important research results on large deviation theory? Q4: Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Q5: Who was the French woman that I met in a program committee meeting where Paolo Atzeni was the PC chair? Q6: Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture? Why are these queries difficult (too difficult for Google-style keyword search unless one invests a huge amount of time to manually explore large result lists with mostly irrelevant and some mediocre matches)? For Q1 no single Web site is a good match; rather one has to look at several pages together within some bounded context: the homepage of a professor with his address, a page with course information linked to by the homepage, and a research project page on semistructured data management that is a few hyperlinks away from the homepage. Q2 would be easy if asked for a single bioinformatics database with a familiar query interface, but searching the answer across the entire Web and Deep Web requires discovering all relevant data sources and unifying their query and result representations on the fly. Q3 is not a query in the traditional sense, but requires gathering a substantial number of key resources with valuable information on the given topic; it would be best served by looking up a well maintained Yahoo-style topic directory, but highly specific expert topics are not covered there. Q4 cannot be easily answered because a good match does not necessarily contain the keywords “woman”, “prophecy”, “nobleman”, etc., but may rather say something like “Third witch: All hail, Macbeth, thou shalt be king hereafter!” and the same document may contain the text “All hail, Macbeth! hail to thee, thane of Glamis!”. So this query requires some background knowledge to recognize that a witch is a woman, “shalt be” refers to a prophecy, and thane is a title for a Scottish nobleman. Q5 is similar to Q4 in the sense that it also requires background knowledge, but it is more difficult because it additionally requires putting together various information fragments: conferences on which I served on the PC found in my email archive, PC members of conferences found on Web pages, and detailed information found on researchers’ homepages. And after having identified a candidate like Sophie Cluet from Paris, one needs to infer that Sophie is a typical female first name and that Paris most likely denotes the capital of France rather than the 500-inhabitants town of Paris, Texas, that became known through a movie. Q6 finally is what some researchers call “AI-complete”, it will remain a challenge for a long time. For a human expert who is familiar with the corresponding topics, none of these queries is really difficult. With unlimited time, the expert could easily identify relevant pages and combine semantically related information units into query answers. The challenge is to automate or simulate these intellectual capabilities and implement them so that they can handle billions of Web pages and petabytes of data in structured (but schematically highly diverse) Deep-Web databases.

Towards a Statistically Semantic Web

5

2 The Need for Statistics What if all Web pages and all Web-accessible data sources were in XML, RDF, or OWL (a description-logic representation) as envisioned in the Semantic Web research direction [25,1]? Would this enable a search engine to effectively answer the challenging queries of the previous section? And would such an approach scale to billions of Web pages and be efficient enough for interactive use? Or could we even load and integrate all Web data into one gigantic database and use XQuery for searching it? XML, RDF, and OWL offer ways of more explicitly structuring and richly annotating Web pages. When viewed as logic formulas or labeled graphs, we may think of the pages as having “semantics”, at least in terms of model theory or graph isomorphisms1. In principle, this opens up a wealth of precise querying and logical inferencing opportunities. However, it is extremely unlikely that all pages will use the very same tag or predicate names when they refer to the same semantic properties and relationships. Making such an assumption would be equivalent to assuming a single global schema: this would be arbitrarly difficult to achieve in a large intranet, and it is completely hopeless for billions of Web pages given the Web’s high dynamics, extreme diversity of terminology, and uncertainty of natural language (even if used only for naming tags and predicates). There may be standards (e.g., XML schemas) for certain areas (e.g., for invoices or invoice-processsing Web Services), but these will have limited scope and influence. A terminologically unified and logically consistent Semantic Web with billions of pages is hard to imagine. So reasoning about diversely annotated pages is a necessity and a challenge. Similarly to the ample research on database schema integration and instance matching (see, e.g., [49] and the references given there), knowledge bases [50], lexicons, thesauri [24], or ontologies [58] are considered as the key asset to this end. Here an ontology is understood as a collection of concepts with various semantic relationships among them; the formal representation may vary from rigorous logics to natural language. The most important relationship types are hyponymy (specialization into narrower concepts) and hypernymy (generalization into broader concepts). To the best of my knowledge, the most comprehensive, publicly available kind of ontology is the WordNet thesaurus hand-crafted by cognitive scientists at Princeton [24]. For the concept “woman” WordNet lists about 50 immediate hyponyms, which include concepts like “witch” and “lady” which could help to answer queries like Q4 from the previous section. However, regardless of whether one represents these hyponymy relationships in a graph-oriented form or as logical formulas, such a rigid “trueor-false” representation could never discriminate these relevant concepts from the other 48 irrelevant and largely exotic hyponyms of “woman”. In information-retrieval (IR) jargon, such an approach would be called Boolean retrieval or Boolean reasoning; and IR almost always favors ranked retrieval with some quantitative relevance assessment. In fact, by simply looking at statistical correlations of using words like “woman” and “lady” together in some text neighborhood within large corpora (e.g., the Web or large digital libraries) one can infer that these two concepts are strongly related, as opposed to concepts like “woman” and “siren”. Similarly, mere statistics strongly suggests that 1

Some people may argue that all computer models are mere syntax anyway, but this is in the eye of the beholder.

Gerhard Weikum et al.

6

a city name “Paris” denotes the French capital and not Paris, Texas. Once making a distinction of strong vs. weak relationships and realizing that this is a full spectrum, it becomes evident that the significance of semantic relationships needs to be quantified in some manner, and the by far best known way of doing this (in terms of rigorous foundation and rich body of results) is by using probability theory and statistics. This concludes my argument for the necessity of a “statistically semantic” Web. The following sections substantiate and illustrate this point by sketching various technical issues where statistical reasoning is key. Most of the discussion addresses how to handle non-schematic XML data; this is certainly still a good distance from the Semantic Web vision, but it is a decent and practically most relevant first step.

3 Towards More “Semantics” in Searching XML and Web Data Non-schematic XML data that comes from many different sources and inevitably exhibits heterogeneous structures and annotations (i.e., XML tags) cannot be adequately searched using database query languages like XPath or XQuery. Often, queries either return too many or too few results. Rather the ranked-retrieval paradigm is called for, with relaxable search conditions, various forms of similarity predicates on tags and contents, and quantitative relevance scoring. Note that the need for ranking goes beyond adding Boolean text-search predicates to XQuery. In fact, similarity scoring and ranking are orthogonal to data types and would be desirable and beneficial also on structured attributes such as time (e.g., approximately in the year 1790), geographic coordinates (e.g., near Paris), and other numerical and categorical data types (e.g., numerical sensor readings and music style categories). Research on applying IR techniques to XML data has started five years ago with the work [26,55,56,60] and has meanwhile gained considerable attention. This research avenue includes approaches based on combining ranked text search with XPath-style conditions [4,13,35,11,31,38], structural similarities such as tree-editing distances [5,54,69,14], ontology-enhanced content similarities [60,61,52], and applying probabilistic IR and statistical language models to XML [28,2]. Our own approach, the XXL2 query language and search engine [60,61,52], combines a subset of XPath with a similarity operator ~ that can be applied to element or attribute names, on one hand, and element or attribute contents, on the other hand. For example, the queries Q1 and Q4 of Section 1 could be expressed in XXL as follows (and executed on a heterogeneous collection of XML documents):

Here XML data is interpreted as a directed graph, including href or XLink/XPointer links within and across documents that go beyond a merely tree-oriented approach. End nodes of connections that match a path condition such as drama//scene are bound to node variables that can be referred to in other search conditions. Content conditions 2

Flexible XML Search Language.

Towards a Statistically Semantic Web

7

such as = "~woman" are interpreted as keyword queries on XML elements, using IR-style measures (based on statistics like term frequencies and inverse element frequencies) for scoring the relevance of an element. In addition and most importantly, we allow expanding the query by adding “semantically” related terms taken from an ontology. In the example, “woman” could be expanded into “woman wife lady girl witch ...”. The score of a relaxed match, say for an element containing “witch”, is the product of the traditional score for the query “witch” and the ontological similarity of the query term and the related term, .sim(woman, witch) in the particular example. Element (or attribute) name conditions such as ~course are analogously relaxed, so that, for example, tag names “teaching”, “class”, or “seminar” would be considered as approximate matches. Here the score is simply the ontological similarity, for tag names are only single words or short composite words. The result of an entire query is a ranked list of subgraphs of the XML data graph, where each result approximately matches all query conditions with the same binding of all variables (but different results have different bindings). The total score of a result is computed from the scores of the elementary conditions using a simple probabilistic model with independence assumptions, and the result ranking is in descending order of total scores. Query languages of this kind work nicely on heterogeneous and non-schematic XML data collections, but the Web and also large fractions of intranets are still mostly in HTML, PDF, and other less structured formats. Recently we have started to apply XXLstyle queries also to such data by automatically converting Web data into XML format. The COMPASS3 search engine that we have been building supports XML ranked retrieval on the full suite of Web and intranet data including combined data collections that include both XML documents and Web pages [32]. For example, query Q1 can be executed on an index that is built over all of DBLP (cast into XML) and the crawled homepages of all authors and other Web pages reachable through hyperlinks. Figure 1 depicts the visual formulation of query Ql. Like in the original XXL engine, conditions with the similarity operator ~ are relaxed using statistically quantified relationships from the ontology.

Fig. 1. Visual COMPASS Query 3

Concept-oriented Multi-format Portal-aware Search System.

8

Gerhard Weikum et al.

The conversion of HTML and other formats into XML is based on relatively simple heuristic rules, for example, casting HTML headings into XML element names. For additional automatic annotation we use the information extraction component ANNIE that is part of the GATE System developed at the University of Sheffield [20]. GATE offers various modules for analyzing, extracting, and annotating text; its capabilities range from part-of-speech tagging (e.g., for noun phrases, temporal adverbial phrases, etc.) and lexicon lookups (e.g., for geographic names) to finite state transducers for annotations based on regular expressions (e.g., for dates or currency amounts). One particularly useful and fairly light-weight component is the Gazetteer Module for named entity recognition based on part-of-speech tagging and a large dictionary containing names of cities, countries, person names (e.g., common first names), etc. This way one can automatically generate tags like and . For example, we were able to annotate the popular Wikipedia open encyclopdia corpus this way, generating about 2 million person and location tags. And this is the key for more advanced “semantics-aware” search on the current Web. For example, searching for Web pages about the physicist Max Planck would be phrased as person = "Max Planck", and this would eliminate many spurious matches that a Google-style keyword query “Max Planck” would yield about Max Planck Institutes and the Max Planck Society4. There is a rich body of research on information extraction from Web pages and wrapper generation. This ranges from purely logic-based or pattern-matching-driven approaches (e.g., [51,17,6,30]) to techniques that employ statistical learning (e.g., Hidden Markov Models) (e.g., [15,16,39,57,40]) to infer structure and annotations when there is too much diversity and uncertainty in the underlying data. As long as all pages to be wrapped come from the same data source (with some hidden schema), the logicbased approaches work very well. However, when one tries to wrap all homepages of DBLP authors or the course programs of all computer science departments in the world, uncertainty is inevitable and statistics-driven techniques are the only viable ones (unless one is willing to invest a lot of manual work for traditional schema integration, writing customized wrappers and mappers). Despite advertising our own work and mentioning our competitors, the current research projects on combining IR techniques and statistical learning with XML querying is still in an early stage and there are certainly many open issues and opportunities for further research. These include better theoretical foundations for scoring models on semistructured data, relevance feedback and interactive information search, and, of course, all kinds of efficiency and scalability aspects. Applying XML search techniques to Web data is in its infancy; studying what can be done with named-entity recognition and other automatic annotation techniques and understanding the interplay of queries with such statistics-based techniques for better information organization are widely open fields.

4 Statistically Quantified Ontologies The important role of ontologies in making information search more “semantics-aware” has already been emphasized. In contrast to most ongoing efforts for Semantic-Web on4

Germany’s premier scientific society, which encompasses 80 institutes in all fields of science.

Towards a Statistically Semantic Web

9

tologies, our work has focused on quantifying the strengths of semantic relationships based on corpus statistics [52,59] (see also the related work [10,44,22,36] and further references given there). In contrast to early IR work on using thesauri for query expansion (e.g., [64]), the ontology itself plays a much more prominent role in our approach with carefully quantified statistical similarities among concepts. Consider a graph of concepts, each characterized by a set of synonyms and, optionally, a short textual description, connected by “typed” edges that represent different kinds of relationships: hypernyms and hyponyms (generalization and specialization, aka. is-a relations), holonyms and meronyms (part-of relations), is-instance-of relations (e.g., Cinderella being an instance of a fairytale or IBM Thinkpad being a notebook), to name the most important ones. The first step in building an ontology is to create the nodes and edges. To this end, existing thesauri, lexicons, and other sources like geographic gazetteers (for names of countries, cities, rivers, etc. and their relationships) can be used. In our work we made use of the WordNet thesaurus [24] and the Alexandria Digital Library Gazetteer [3], and also started extracting concepts from page titles and href anchor texts in the Wikipedia encyclopedia. One of the shortcomings of WordNet is its lack of instances knowledge, for example, brand names and models of cars, cameras, computers, etc. To further enhance the ontology, we crawled Web pages with HTML tables and forms, trying to extract relationships between table-header column and form-field names and the values in table cells and the pulldown menus of form fields. Such approaches are described in the literature (see, e.g., [21,63,68]). Our experimental findings confirmed the potential value of these techniques, but also taught us that careful statistical thresholding is needed to eliminate noise and incorrect inferencing, once again a strong argument for the use of statistics. Once the concepts and relationships of a graph-based ontology are constructed, the next step is to quantify the strengths of semantic relationships based on corpus statistics. To this end we have performed focused Web crawls and use their results to estimate statistical correlations between the characteristic words of related concepts. One of the measures for the similarity of concepts and that we used is the Dice coefficient

In this computation we represent concept by the terms taken from its set of synonyms and its short textual description (i.e., the WordNet gloss). Optionally, we can add terms from neighbors or siblings in the ontological graph. A document in the corpus is considered to contain concept if it contains at least one word of the term set for and considered to contain both and if it contains at least one word from each of the two term sets. This is a heuristics; other approaches are conceivable which we are investigating. Following this methodology, we constructed an ontolgy service [59] that is accessible via Java RMI or as a SOAP-based Web Service described in WSDL. The service is used in the COMPASS search engine [32], but also in other projects. Figure 2 shows a screenshot from our ontology visualization tool. One of the difficulties in quantifying ontological relationships is that we aim to measure correlations between concepts but merely have statistical information about

10

Gerhard Weikum et al.

Fig. 2. Ontology Visualization

correlations between words. Ideally, we should first map the words in the corpus onto the corresponding concepts, i.e., their correct meanings. This is known as the word sense disambiguation problem in natural language processing [45], obviously a very difficult task because of polysemy. If this were solved it would not only help in deriving more accurate statistical measures for “semantic” similarities among concepts but could also potentially boost the quality of search results and automatic classification of documents into topic directories. Our work [59] presents a simple but scalable approach to automatically mapping text terms onto ontological concepts, in the context of XML document classification. Again, statistical reasoning, in combination with some degree of natural language parsing, is key to tackling this difficult problem. Ontology construction is a highly relevant research issue. Compared to the ample work on knowledge representations for ontological information, the aspects of how to “populate” an ontology and how to enhance it with quantitative similarity measures have been underrated and deserve more intensive research.

5 Efficient Top-k Query Processing with Probabilistic Pruning For ranked retrieval of semistructured, “semantically” annotated data, we face the problem of reconciling efficiency with result quality. Usually, we are not interested in a complete result but only in the top-k results with the highest relevance scores. The state-of-the-art algorithm for top-k queries on multiple index lists, each sorted in descending order of relevance scores, is the Threshold Algorithm, TA for short [23,33, 47]. It is applicable to both relational data such as product catalogs and text documents such as Web data. In the latter case, the fact that TA performs random accesses on very long, disk-resident index lists (e.g., all URLs or document ids for a frequently occurring word), with only short prefixes of the lists in memory, makes TA much less attractive, however.

Towards a Statistically Semantic Web

11

In such a situtation, the TA variant with sorted access only, coined NRA (no random accesses), stream-combine, or TA-sorted in the literature, is the method of choice [23, 34]. TA-sorted works by maintaining lower bounds and upper bounds for the scores of the top-k candidates that are kept in a priority queue in memory while scanning the index lists. The algorithm can safely stop when the lower bound for the score of the rank-k result is at least as high as the highest upper bound for the scores of the candidates that are not among the current top-k. Unfortunately, albeit theoretically instance-optimal for computing a precise top-k result [23], TA-sorted tends to degrade in performance when operating on a large number of index lists. This is exactly the case when we relax query conditions such as ~speaker = ~woman using semantically related concepts from the ontology5. Even if the relaxation uses a threshold for the similarity of related concepts, we may often arrive at query conditions with 20 to 50 search terms. Statistics about the score distributions in the various index lists and some probabilistic reasoning help to overcome this efficiency problem and re-gain performance. In TAsorted a top-k candidate that has already been seen in the index lists in achieving score in list and has unknown scores in the index lists satisfies:

where denotes the total, but not yet known, score that achieves by summing up the scores from all index lists in which occurs, and are the lower and upper bounds of score, and is the score that was last seen in the scan of index list upper-bounding the score that any candidate may obtain in list A candidate remains a candidate as long as where is the candidate that currently has rank with regard to the candidates’ lower bounds (i.e., the worst one among the current top-k). Assuming that can achieve a score in all lists in which it has not yet been encountered is conservative and, almost always, overly conservative. Rather we could treat these unknown scores as random variables and estimate the probability that total score can exceed Then is discarded from the candidate list if

with some pruning threshold This probabilistic interpretation makes some small, but precisely quantifiable, potential error in that it could dismiss some candidates too early. Thus, the top-k result computed this way is only approximate. However, the loss in precision and recall, relative to the exact top-k result using the same index lists, is stochastically bounded and can be set according to the application’s needs. A value of seems to be acceptable in most situations. Technically, the approach requires computing the convolution 5

Note that the TA and TA-sorted algorithms can be easily modified to handle both elementname and element-contents conditions (as opposed to mere keyword sets in standard IR and Web search engines).

12

Gerhard Weikum et al.

of the random variables based on assumed distributions (with parameter fitting) or precomputed histograms for the individual index lists and taking into account the current values, and predicting the of the sum’s distribution. Details of the underlying mathematics and the implementation techniques for this Prob-sorted method can be found in [62]. Experiments with the TREC-12 .Gov corpus and the IMDB data collection have shown that such a probabilistic top-k method gains about a factor of ten (and sometimes more) in run-time compared to TA-sorted. The outlined algorithm for approximate top-k queries with probabilistic guarantees is a versatile building block for XML ranked retrieval. In combination with ontologybased query relaxation, for example, expanding ~woman into (woman or wife or witch), it can add index lists dynamically and incrementally, rather than having to expand the query upfront based on thresholds. To this end, the algorithm considers the ontological similarity between concept from the original query and concept in the relaxed query, and multiplies it with the value of index list to obtain an upper bound for the score (and characterize the score distribution) that a candidate can obtain from the relaxation This information is dynamically combined with the probabilistic prediction of the other unknown scores and their sum. The algorithm can also be combined with distance-aware path indexes for XML data (e.g., the HOPI index structure [53]). This is required when queries contain elementname and element-contents conditions as well as path conditions of the form professor//course where matches for “course” that are close to matches for “professor” should be ranked higher than matches that are far apart. Thus, the Probsorted algorithm covers a large fraction of an XML ranked retrieval engine.

6 Exploiting Collective Human Input The statistical information considered so far refers to data (e.g., scores in index lists) or metadata (e.g., ontological similarities). Yet another kind of statistics is information about user behavior. This could include relatively static properties like bookmarks or embedded hyperlinks pointing to high-quality Web pages, but also dynamic properties inferred from query logs and click streams. For example, Google’s PageRank views a Web page as more important if it has many incoming links and the sources of these links are themselves high authorities [9,12]. Technically, this amounts to computing stationary probabilities for a Markov-chain model that mimics a “random surfer”. What PageRank essentially does is to exploit the intellectual endorsements that many human users (or Web administrators on behalf of organizations) provide by means of hyperlinks. This rationale can be carried over to analyzing and exploiting entire surf trails and query logs of individual users or an entire user community. These trails, which can be gathered from browser histories, local proxies, or Web servers, capture implicit user judgements. For example, suppose a user clicks on a specific subset of the top 10 results returned by a search engine for a query with several keywords, based on having seen the summaries of these pages. This implicit form of relevance feedback establishes a strong correlation between the query and the clicked-on pages. Further suppose that the user refines a query by adding or replacing keywords, e.g., to eliminate ambiguities in the previous query. Again, this establishes correlations between the new keywords and

Towards a Statistically Semantic Web

13

the subsequently clicked-on pages, but also, albeit possibly to a lesser extent, between the original query and the eventually relevant pages. We believe that observing and exploiting such user behavior is a key element in adding more “semantic” or “cognitive” quality to a search engine. The literature contains some very interesting work in this direction (e.g., [19,65,67]), but is rather preliminary at this point. Perhaps, the difficulties in obtaining comprehensive query logs and surf trails outside of big service providers is a limiting factor in this line of experimental research. Our own, very recent, work generalizes the notion of a “random surfer” into a “random expert user” by enhancing the underlying Markov chain to incorporate also query nodes and transitions from queries to query refinements as well as clicked-on documents. Transition probabilities are derived from the statistical analysis of query logs and click streams. The resulting Markov chain converges to stationary authority scores that reflect not only the link structure but also the implicit feedback and collective human input of a search engine’s users [43]. The de-facto monopoly that large Internet service providers have on being able to observe user behavior and statistically leverage this valuable information may be overcome by building next-generation Web search engines in a truly decentralized and ideally self-organized manner. Consider a peer-to-peer (P2P) system where each peer has a full-fledged Web search engine, including a crawler and an index manager. The crawler may be thematically focused or crawl results may be postprocessed so that the local index contents reflects the corresponding user’s interest profile. With such a highly specialized and personalized “power search engine” most queries should be executed locally, but once in a while the user may not be satisfied with the local results and would then want to contact other peers. A “good” peer to which the user’s query should be forwarded would have thematically relevant index contents, which could be measured by statistical notions of similarity between peers. These measures may be dependent on the current query or may be query-independent; in the latter case, statistics is used to effectively construct a “semantic overlay network” with neighboring peers sharing thematic interests [8,42,48,18,7,66]. Both query routing and “statistically semantic” networks could greatly benefit from collective human inputs in addition to standard IR measures like term and document frequencies or term-wise score distributions: knowing the bookmarks and query logs of thousands of users would be a great resource to build on. Further exploring these considerations on P2P Web search should become a major research avenue in computer science. Note that our interpretation of Web search includes ranked retrieval and thus is fundamentally more difficult than Gnutella-style file sharing or simple key lookups via distributed hash tables. Further note that, although query routing in P2P Web search resembles earlier work on metasearch engines and distributed IR (see, e.g., [46] and the references given there), it is much more challenging because of the large scale and the high dynamics of the envisioned P2P system with thousands or millions of computers and users.

7 Conclusion With the ongoing information explosion in all areas of business, science, and society, it will be more and more difficult for humans to keep information organized and

14

Gerhard Weikum et al.

extract valuable knowledge in a timely manner. The intellectual time for schema design, schema integration, data cleaning, data quality assurance, manual classification, directory and search result browsing, clever formulation of sophisticated queries, etc. is already the major bottleneck today, and the situation is likely to become worse. In my opinion, this will render all attempts to master Web-scale information in a perfectly consistent, purely logic-based manner more or less futile. Rather, the ability to cope with uncertainty, diversity, and high dynamics will be mandatory. To this end, statistics and their use in probabilistic inferences will be key assets. One may envision a rich probabilistic algebra that encompasses relational or even object-relational and XML query languages, but interprets all data and results in a probabilistic manner and always produces ranked result result lists rather than Boolean result sets (or bags). There are certainly some elegant and interesting, but mostly theoretical, approaches along these lines (e.g., [27,29,37]). However, there is still a long way to go towards practically viable solutions. Among the key challenges that need to be tackled are customizability, composability, and optimizability. Customizability: The appropriate notions of ontological relationships, “semantic” similarities, and scoring functions are dependent on the application. Thus, the envisioned framework needs to be highly flexible and adaptable to incorporate application-specific or personalized similarity and scoring models. Composability: Algebraic building blocks like a top-k operator need to be composable so as to allow the construction of rich queries. The desired property that operators produce ranked list with some underlying probability (or “score mass”) distribution poses a major challenge, for we need to be able to infer these probability distributions for the results of complex operator trees. This problem is related to the difficult issues of selectivity estimation and approximate query processing in a relational database, but goes beyond the state of the art as it needs to incorporate text term distributions and has to yield full distributions at all levels of operator trees. Optimizability: Regardless of how elegant a probabilistic query algebra may be, it would not be acceptable unless one can ensure efficient query processing. Performance optimization requires a deep understanding of rewriting complex operator trees into equivalent execution plans that have significantly lower cost (e.g., pushing selections below joins or choosing efficient join orders). At the same time, the topk querying paradigm that avoids computing full result sets before applying some ranking is a must for efficiency, too. This combination of desiderata leads to a great research challenge in query optimization for a ranked retrieval algebra.

References 1. Karl Aberer et al.: Emergent Semantics Principles and Issues, International Conference on Database Systems for Advanced Applications (DASFAA) 2004 2. Mohammad Abolhassani, Norbert Fuhr: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents, ECIR 2004 3. Alexandria Digital Library Project, Gazetteer Development, http://www.alexandria.ucsb.edu/gazetteer/

Towards a Statistically Semantic Web

15

4. Shurug Al-Khalifa, Cong Yu, H. V. Jagadish: Querying Structured Text in an XML Database, SIGMOD 2003 5. Sihem Amer-Yahia, Laks V. S. Lakshmanan, Shashank Pandit: FleXPath: Flexible Structure and Full-Text Querying for XML, SIGMOD 2004 6. Arvind Arasu, Hector Garcia-Molina: Extracting Structured Data from Web Pages, SIGMOD 2003 7. Mayank Bawa, Gurmeet Singh Manku, Prabhakar Raghavan: SETS: Search Enhanced by Topic Segmentation, SIGIR 2003 8. Matthias Bender, Sebastian Michel, Gerhard Weikum, Christian Zimmer: Bookmark-driven Query Routing in Peer-to-Peer Web Search, SIGIR Workshop on Peer-to-Peer Information Retrieval 2004 9. Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW Conference 1998 10. Alexander Budanitsky, Graeme Hirst: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures, Workshop on WordNet and Other Lexical Resources 2001 11. David Carmel, Yoëlle S. Maarek, Matan Mandelbrod, Yosi Mass, Aya Soffer: Searching XML Documents via XML Fragments, SIGIR 2003 12. Soumen Chakrabarti: Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002 13. T. Chinenyanga, N. Kushmerick: An Expressive and Efficient Language for XML Information Retrieval, Journal of the American Society for Information Science and Technology (JASIST) 53(6), 2002 14. Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv: XSEarch: A Semantic Search Engine for XML, VLDB 2003 15. William W. Cohen, Matthew Hurst, Lee S. Jensen: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents, in: A. Antonacopoulos, J. Hu (Editors), Web Document Analysis: Challenges and Opportunities, Word Scientific Publishing, 2004 16. William W. Cohen, Sunita Sarawagi: Exploiting Dictionaries in Named Entity Extraction: Combining Semi-markov Extraction Processes and Data Integration Methods, KDD 2004 17. Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites, VLDB 2001 18. Arturo Crespo, Hector Garcia-Molina: Semantic Overlay Networks, Technical Report, Stanford University, 2003. 19. Hang Cui, Ji-Rong Wen, Jian-Yun Nie, Wei-Ying Ma: Query Expansion by Mining User Logs, IEEE Transactions on Knowledge and Data Engineering 15(4), 2003 20. Hamish Cunningham. GATE, a General Architecture for Text Engineering, Computers and the Humanities 36, 2002 21. Hasan Davulcu, Srinivas Vadrevu, Saravanakumar Nagarajan, I. V. Ramakrishnan: OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites, IEEE Intelligent Systems 18(5), 2003 22. Anhai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos, Alon Y. Halevy: Learning to Match Ontologies on the Semantic Web, VLDB Journal 12(4), 2003 23. Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware, Journal of Computer and System Sciences 66(4), 2003 24. Christiane Fellbaum (Editor): WordNet: An Electronic Lexical Database, MIT Press, 1998 25. Dieter Fensel, Wolfgang Wahlster, Henry Lieberman, James A. Hendler (Editors): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential, MIT Press, 2002 26. Norbert Fuhr, Kai Großjohann: XIRQL – An Extension of XQL for Information Retrieval, SIGIR Workshop on XML and Information Retrieval 2000

16

Gerhard Weikum et al.

27. Norbert Fuhr: Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications, Journal of the American Society for Information Science (JASIS) 51(2), 2000 28. Norbert Fuhr, Kai Großjohann: XIRQL: A Query Language for Information Retrieval in XML Documents, SIGIR 2001 29. Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeffer: Learning Probabilistic Relational Models, in: S. Dzeroski, N. Lavrac (Editors), Relational Data Mining, Springer, 2001 30. Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, Sergio Flesca: The Lixto Data Extraction Project – Back and Forth between Theory and Practice, PODS 2004 31. Torsten Grabs, Hans-Jörg Schek: Flexible Information Retrieval on XML Documents. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003 32. Jens Graupmann, Michael Biwer, Christian Zimmer, Patrick Zimmer, Matthias Bender, Martin Theobald, Gerhard Weikum: COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data, Demo Program, VLDB 2004 33. Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Optimizing Multi-Feature Queries for Image Databases, VLDB 2000 34. Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi-Feature Queries in Heterogeneous Environments, International Symposium on Information Technology (ITCC) 2001 35. Lin Guo, Feng Shao, Chavdar Botev, Jayavel Shanmugasundaram: XRANK: Ranked Keyword Search over XML Documents, SIGMOD 2003 36. Maria Halkidi, Benjamin Nguyen, Iraklis Varlamis, Michalis Vazirgiannis: THESUS: Organizing Web Document Collections Based on Link Semantics, VLDB Journal 12(4), 2003 37. Joseph Y. Halpern: Reasoning about Uncertainty, MIT Press, 2003 38. Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, Raghu Ramakrishnan: On the Integration of Structure Indexes and Inverted Lists, SIGMOD 2004 39. Nicholas Kushmerick, Bernd Thomas: Adaptive Information Extraction: Core Technologies for Information Agents. in: M. Klusch et al. (Editors), Intelligent Information Agents, Springer, 2003 40. Kristina Lerman, Lise Getoor, Steven Minton, Craig A. Knoblock: Using the Structure of Web Sites for Automatic Segmentation of Tables, SIGMOD 2004 41. Zhenyu Liu, Chang Luo, Junghoo Cho, Wesley W. Chu: A Probabilistic Approach to Metasearching with Adaptive Probing, ICDE 2004 42. Jie Lu, James P. Callan: Content-based Retrieval in Hybrid Peer-to-peer Networks, CIKM 2003 43. Julia Luxenburger, Gerhard Weikum: Query-log Based Authority Analysis for Web Information Search, submitted for publication 44. Alexander Maedche, Steffen Staab: Learning Ontologies for the Semantic Web, International Workshop on the Semantic Web (SemWeb) 2001 45. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press, 1999 46. Weiyi Meng, Clement T. Yu, King-Lup Liu: Building Efficient and Effective Metasearch Engines, ACM Computing Surveys 34(1), 2002 47. Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image (Multimedia) Databases, ICDE 1999 48. Henrik Nottelmann, Norbert Fuhr: Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection, ECIR 2004 49. Erhard Rahm, Philip A. Bernstein: A Survey of Approaches to Automatic Schema Matching, VLDB Journal 10(4), 2001 50. Stuart J. Russell, Peter Norvig: Artificial Intelligence - A Modern Approach, Prentice Hall, 2002

Towards a Statistically Semantic Web

17

51. Arnaud Sahuguet, Fabien Azavant: Building Light-weight Wrappers for Legacy Web Datasources using W4F, VLDB 1999 52. Ralf Schenkel, Anja Theobald, Gerhard Weikum: Ontology-Enabled XML Search. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003 53. Ralf Schenkel, Anja Theobald, Gerhard Weikum: An Efficient Connection Index for Complex XML Document Collections, EDBT 2004 54. Torsten Schlieder, Holger Meuss: Querying and Ranking XML Documents, Journal of the American Society for Information Science and Technology (JASIST) 53(6), 2002 55. Torsten Schlieder, Holger Meuss: Result Ranking for Structured Queries against XML Documents, DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries, 2000 56. Torsten Schlieder, Felix Naumann: Approximate Tree Embedding for Querying XML Data, SIGIR Workshop on XML and Information Retrieval, 2000 57. Marios Skounakis, Mark Craven, Soumya Ray: Hierarchical Hidden Markov Models for Information Extraction, IJCAI 2003 58. Steffen Staab, Rudi Studer (Editors): Handbook on Ontologies, Springer 2004 59. Martin Theobald, Ralf Schenkel, Gerhard Weikum: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data, International Workshop on Web and Databases (WebDB) 2003 60. Anja Theobald, Gerhard Weikum: Adding Relevance to XML. International Workshop on Web and Databases (WebDB) 2000, extended version in: LNCS 1997, Springer, 2001. 61. Anja Theobald, Gerhard Weikum: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking, EDBT 2002 62. Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation with Probabilistic Guarantees, VLDB 2004 63. Yuri A. Tijerino, David W. Embley, Deryle W. Lonsdale, George Nagy: Ontology Generation from Tables, WISE 2003 64. Ellen M. Voorhees: Query Expansion Using Lexical-Semantic Relations. SIGIR 1994 65. Ji-Rong Wen, Jian-Yun Nie, Hong-Jiang Zhang: Query Clustering Using User Logs, ACM TOIS 20(1), 2002 66. Linhao Xu, Chenyun Dai, Wenyuan Cai, Shuigeng Zhou, Aoying Zhou: Towards Adaptive Probabilistic Search in Unstructured P2P Systems. Asia-Pacific Web Conference (APWeb) 2004 67. Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, Chao-Jun Lu: Implicit Link Analysis for Small Web Search, SIGIR 2003 68. Shipeng Yu, Deng Cai, Ji-Rong Wen, Wei-Ying Ma: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation, WWW Conference 2003 69. Pavel Zezula, Giuseppe Amato, Fausto Rabitti: Processing XML Queries with Tree Signatures. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003

The Application and Prospect of Business Intelligence in Metallurgical Manufacturing Enterprises in China Xiao Ji, Hengjie Wang, Haidong Tang, Dabin Hu, and Jiansheng Feng Data Strategies Dept, Shanghai Baosight Software Co., Ltd Shanghai 201203, China

Abstract. This paper introduces the application of Business Intelligence (BI) technologies in metallurgical manufacturing enterprises in China. It sets forth the development procedure and successful cases of BI in Shanghai Baoshan Iron & Steel Co., Ltd (Shanghai Basteel in short), and puts forward the methodology adaptable to the construction of BI systems in the metallurgical manufacturing enterprises in China. Finally, it prospects the next generation of BI technologies in Shanghai Baosteel. It should be mentioned as well that it is the Data Strategies Dept of Shanghai Baosight Software Co., Ltd (Shanghai Baosight in short) and the Technology Center of Shanghai Baoshan Iron & Steel Co., Ltd. that supports and does research works on BI solutions in Shanghai Baosteel.

1 Introduction 1.1

The Application of BI Technologies in Metallurgical Manufacturing Enterprises in the World

The executives of enterprises sometimes are totally at a loss when they face with the explosive increasing data from different kinds of application systems with different levels such as MES, ERP, CRM, SCM, etc. Statistics show that the amount of data will be doubled within eighteen months. But among them, how much do we really need, and how much do we really can use for the further analysis? The main advantage of BI technologies is to discover and turn these massive data into the useful information for enterprise decision-making. The researches and application of BI have become a hot topic in global IT area since the term of BI technology was first brought forward by Howard Dresner from Gartner Group in 1989. Through our years practice, we consider BI a concept rather than an information technology. It is a business concept in solving the problems for enterprise production, operation, management, and etc. Taking enterprise data warehouse as basis, the BI technologies uses professional P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 18–29, 2004. © Springer-Verlag Berlin Heidelberg 2004

The Application and Prospect of Business Intelligence

19

knowledge and special data mining technologies to disclose key factors in solving business problems, and assisting operational management and decision-making. As the most advanced metallurgical manufacturing enterprise in China, Shanghai Baosteel has begun to use BI technologies in solving some key problems in daily production and management since last decade. It has applied BI technologies such as data analysis and data mining in both self-motion and selfconsciousness since the development of Solution of Iron Ores Mixing in 1995, and thereafter quality control system, SPC, IPC, and finally the large-scale enterprise data warehouse nowadays. In the meantime, Shanghai Baosight has formed its own characteristics of applying BI in metallurgical manufacturing enterprises, especially in quality control area. In addition, Shanghai Baosight has cultivated its experienced professional team in system development and project management. Following are the some achievements in specific areas. Data Warehouse: Considering the size, complexity and technical level, Shanghai Baosteel enterprise data warehouse system is a rare and advanced system in China. As a successful BI case, such data warehouse system has become a model in metallurgical manufacturing today. Quality Control and Analysis: In such area, many data mining techniques with high level technologies and characteristics have been widely applied for quality improvement, and can be extended to other manufacturing enterprises as well. SPC and IPC: As basis of quality control, SPC and IPC systems with special characteristics are commonly used in Shanghai Baosteel. Of course they are fitted to the other manufacturing enterprises too. The achievements in the above three areas prove that Shanghai Baosteel is leading in BI application in metallurgy and manufacturing enterprises in China. And with experience transfer, the others metallurgy manufacturing enterprises will follow the step of Shanghai Baosteels. And Shanghai Baosight will go further too in the related BI application areas. Comparing with international craft brothers such as POSCO and the United States Steel Corporation (UEC), Shanghai Baosteel is also among the top in BI application. UEC once invited Shanghai Baosteel to introduce its experience in building metallurgical manufacturing enterprise.

1.2

The Information System Development of Shanghai Baosteel

Shanghai Baosteel is the largest and the most modernized iron and steel complex in China. Baosteel has established its status as a world steel-making giant with comprehensive advantages in its reputation, talents, innovation, management and technology. According to the publication “Guide to the World Steel Industry”, Shanghai Baosteel ranks among the first three of the most competitive steel-makers worldwide, and is also believed as the most potentially competitive iron and steel enterprise in the future. Shanghai Baosteel specializes in producing high-tech and high-value-added steel products. Meanwhile it has become the main steel supplier to automobile industries, household appliances, container, oil and natural gas exploration, and

20

Xiao Ji et al.

pressure vessel in China. Meanwhile, Shanghai Baosteel exports its products to over forty countries and regions including Japan, South Korea and countries in Europe and America. All the facilities that the company possesses are based on the advanced technologies of contemporary steel smelting, cold and hot processing, hydraulic sensing, electronic control, computer and information communications. They feature large-scale, continuity and automation, and are kept the most advanced technology in the world. Shanghai Baosteel possesses tremendous strength of research and development. It has made great efforts in developing new technology, new products and new equipment, and has accumulated vigorous driving force for company’s further development. Shanghai Baosteel is located in Shanghai, China. Its first phase construction project began on the 23rd of December in 1978, and was completed and put into production on the 15th of September in 1985. Its second phase project went into operation in June, 1991 and third phase project was completed before the end of 2000. Shanghai Baosteel turned to be a stock company officially on the 3rd of February in 2000, and was successfully listed in Shanghai Security Exchange on the 12th December in the same year. In the early days when Shanghai Baosteel was setting up in 1978, the sponsors considered that they should build up computer systems to assist management. They realized it should import the most advanced equipments, techniques and management at the time from Japan, and take some factories of the Nippon Steel as models. In May 1981, with the impelling of the minister from the Ministry of the Metallurgy and Manufacturing, Shanghai Baosteel finished the “the Feasible Research of the Synthetic Computer System”, and lodged to build Shanghai Baosteel information system with five-level computer structures by setting up four area-control computer systems between the L3 systems and the central management information system. On the 15th February 1996, Shanghai Baosteel and IBM contracted to import the advanced computer system of IBM 9672 from the US as the area level management information system of hot and cold rolling areas in phase three project, changing the way in phase two project that there were two respective management systems within hot rolling areas and cool rolling areas. The decision was a revolution on information system construction in Shanghai Baosteel. And in the coming days, the executives of Shanghai Baosteel decided to build the comprehensive information system using IBM 9672 to integrate the whole distributed information systems. They then cancelled the fifth-level management information system, and the new system was put into production in March 1998, ensuring the proper production of 1580 hot rolling mill, 1420 cold rolling mill, and following second steel-making system. In May 2001, Shanghai Baosteel raised new strategic concept of Enterprise System Innovation. The ESI system included a three level architecture. First to rebuild the business processes of Shanghai Baosteel to bring up new effective

The Application and Prospect of Business Intelligence

21

ones; second, to reconstruct the organizational structure on the basis of new business processes; third to build corresponding information systems to assist to realize the new business processes. The main objective of ESI system is to help Shanghai Baosteel to realize its business target, and to be a good competitor in steel enterprises, and to prepare to face the overall challenges after China becomes a member of WTO. The above ESI decision was prospectively made by the executives of Shanghai Baosteel, to help Baosteel to realize its modernized management, and to be one of the Global 500 in the world. And now Shanghai Baosteel has successfully finished its third phase information system development. In the first phase project, several process control systems, self-developed central management system (IBM 4341) with batch processing, and PC networks were set up. In the second phase project, process control systems and product control systems, imported technology based management information system (IBM 4381) for 2050 hot rolling mill, self-developed management information system (IBM RS6000) for 2030 cold rolling mill, ironmaking regional management information system, steel-making regional management information system were built. In the third phase project, better configured process control systems, production control systems for 1580 hot rolling mill, and 1420, 1550 cold rolling mills, enterprise-wide OA and human resource management system, and ERP system which included integrated production and sales system and equipment management system, were successfully developed. After the three phase project construction, Shanghai Baosteel has formed its four-level production computer system. In recent years, with ESI concept, many assisted information systems were set up as well, such as integrated equipment maintenance management system, data warehouse and data mining applications, information services system for mills and departments, e-business platform BSTEEL.COM online, and Supply Chain Management, etc. The architecture of Shanghai Baosteel’s information system can be illustrated as followed.

Fig. 1. Information Architecture of Shanghai Baosteel

22

2 2.1

Xiao Ji et al.

Application of Business Intelligence in Shanghai Baosteel The Application and Methodology of Business Intelligence in Shanghai BaoSteel

As one of the most advanced metallurgical manufacturing enterprises in China, Shanghai Baosteel is now in its rapid development age. In order to continuously reduce cost and improve competitiveness in the international or the domestic markets, executives strongly realize the importance of the followings: To speed up the logistic turnover, and to improve the level of the products turnover. To stabilize and improve the products quality. To promote the sales and related capabilities to expand markets sharing. To strengthen the infrastructure of cost and finance. To optimize the allocation of enterprise resources, which farthest satisfies the markets’ requirements. In order to achieve the above objectives, the requirement to build an enterprise data warehouse system has been raised. In order to satisfy the strategy of Shanghai Baosteel’s information development, the data warehouse system should help Shanghai Baosteel to organize every kind of data required by the enterprise analysts and to transfer all needed information to end users. Then Shanghai Baosteel and Shanghai Baosight started to evaluate and plan the data warehouse system. The evaluation estimates the current enterprise infrastructure and the operational environments of Shanghai Baosteel. As the high level of information development, the data warehouse system could be built, and planned to build the first data warehouse subject area for Shanghai Baosteel - the technique and quality management data mart. Currently Shanghai Baosteel builds the enterprise data warehouse system on two IBM S85 machine with major data source from the ERP system. This data warehouse system includes ODS data stores, and perfectly integrated subject data stores according to the “Quick Data Warehouse Building” methodology. The first quality management data mart has accumulated much experience, and it has included the decision supporting information about the related products and their quality management. Nowadays, the system has already built the enterprise statistics management data mart, technique and quality management data mart, sales and marketing management data mart, production management data mart, equipment management data mart, finance and cost data mart which includes planning values, metal balancing, cost analysis, BUPC, finance analysis, and production administration information system, enterprise guidelines system, manufacturing mill area analysis which includes steel-making, hot rolling, cold rolling, etc. The amount of current data in the system is around 2TB, and the ETL task deals with about 3GB data everyday, and the newly appended data are about 1GB. In addition, nearly 1700 static analytical reports are produced each day, and 1600 kinds of dynamic queries are provided synchronously.

The Application and Prospect of Business Intelligence

23

At the same time, through many years’ practice and researches, Shanghai Baosight has abstracted a set of effective business intelligence solutions for manufacturing industry. This solution is significant for product designing, quality management, cost management in the metallurgical manufacturing enterprise. Typically, the implement of business intelligence for metallurgical manufacturer consists of the following 6 processes that offer the logical segmentation of works, and check whether the project is built steadily. The following flow chart illustrates the overview and work flow for the development phrases of this methodology.

Fig. 2. The Methodology of the BI construction

1. Assessment Within this phrase the users’ current situations and conditions should be studied. These factors will absolutely affect the data warehouse solutions. The target of phase is to analyze the users’ problems and the methods to resolve them. The initial assessment should identify and clarify the targets, and the requirements for the research for clarifying the targets. This kind of assessment will result in the decision of starting, delaying or the canceling of a project. 2. Requirements investigation In this phrase, the project group gathers the high level requirements in the aspects of operation and information technology (IT), and collects the information required by the departments’ targets. The result of this phrase is to submit a report, which identifies the business purpose, meanings, information requirements and the user interfaces. These requirements are also going to be used in other phases of the project and the design of data warehouse. In addition, the topic data model and data warehouse subject of enterprise level are accomplished in this phrase. 3. Design In the side of subject selection, the item group fasten on the collection detailed information request and designing of the scheme of the data flat roof

24

Xiao Ji et al.

include data, process, application modeling. In this phrase, many kinds of methods of collect information and test, such as data modeling, processing modeling, meeting, prototype presentation are used. Item group evaluate the technology scheme, business request and information request. Now, the difference between the IT scheme and the requested IT scheme is very outstanding. So it is advised that an appropriate data warehouse design and scheme should be applied. 4. Construction This phrase includes creating physical databases and data gathering, application testing and code review. The manager of the data warehouse and the leader of end-user should know well the system. After successfully test, the data platform can be used. 5. Deployment and maintenance In this phrase, the data warehouse and BI system can be displayed to business users. At the same time, trainings to the users should start too. After deployment, maintenance and users opinions should be considered. 6. Summary In this phrase, the whole project should be evaluated, and it consists of three steps. The first is to sum up the success and lessons learned. Second is to check whether the configuration is realized as expected. If needed, plans should be changed. The third is to evaluate the influence and the benefit to the company.

2.2

Successful cases of Shanghai Baosteel’s BI application

Shanghai Baosteel’s BI involves knowledge not only data warehouse, mathematics and statistics, data mining and knowledge discovery, but also professional knowledge of metallurgy, automatic control, management, etc. These are the main characteristics of Shanghai Baosteel’s BI application. And there are many successful cases in Shanghai Baosteel till now. The Production Administration System Based on Data Warehouse As a metallurgical manufacturing enterprise, rational production and proper administration is required in Shanghai Baosteel. According to the management requirement, in order to report the latest production status to the high level executives and get the latest guides from top managers, managers from all mills and functional departments must take part in the morning conference, which is presided by the general manager assistant or the vice general manager. Before the data warehouse system is built, the corresponding staff has to go to the production administration center everyday. And all conference information was organized by a Foxpro system with manual input, and the data mainly came from the phone and ERP system. The new production administration system then take the most advantages of enterprise data warehouse system. Based on the product information collected by data warehouse system, the system can automatically organize on Web daily information of production administration, material flow chart, quality analysis results, etc., to support the daily production administration and routine

The Application and Prospect of Business Intelligence

25

executives’ morning conference. And now with the online meeting system on Baosteel intranet, the managers can even take part in the conference and get all kinds of information in their offices. After the system has been put into production, it has won itself good reputation. Integrated Process Control Systems Quality is the life of the enterprise. In order to challenge the furious competition from the market, continuous improvement on the quality control is needed. The IPC systems have realized the improvement of the quality during the productions with lowest cost, and form the core ability of Shanghai Baosteel’s QA management - Know How. As a supporting analysis system, IPC system assists the quality manager’s control abilities during production processes, advances the technical person’s statistical and analysis abilities, and provides more accurate, convenient and institutional approaches for the operational manipulators to inspect products. These systems integrate both high visualized functions and multi-layered data mining functions in a subtle way. The Quality Data Mining System The quality data mart was the first BI system that brought benefits for Shanghai Baosteel, and it plays a more and more important role in daily management. On one side, it provides daily reports, and the analysis functions as online quality analysis, capability changing analysis, quality exception analysis, finished product quality analysis, quality cost statistics, index data maintenance, integrate analysis, KIV-KOV modeling, and so on. On the other side the data mart supports well the quality data mining. Quality data mining system based on data warehouse is strongly aid to the metallurgy industry. There are many cases of data mining and knowledge discovery, such as reducing sampling of steel ST12, improve the bend strength of the hot-dip galvanized products of the steel ST06Z, material calculation design based on knowledge. In the case of reducing the sampling of steel ST12, the original specification required that it must do sampling at both head and tail. It cost very much manpower and equipment. After the analysis of some key indexes such as the bend strength, tensile strength, etc., some

Fig. 3. The Production Administration System of Shanghai Baosteel.

Fig. 4. The Web Page of the Storage Presentation.

26

Xiao Ji et al.

similar analysis results of the head sampling and the tail sampling have been found out, and the tail sampling was a little bit worse. Through the reviews of some experts and testing practice, after the April of 2004 Shanghai Baosteel released a new operational specification to test only the tail sampling for steel ST12. As a result, it reduces cost RMB$2.60m annually. The Iron Ore Mixing System Raw material mixing is one of the most important jobs at the beginning of the steel-making. Shanghai Baosteel once faced many problems, such as how to evaluate a new ore that was not listed in the original ore mixing scheme? Which sintering ore mostly affect the final quality? Is there one scheme that can fit all different needs? Can we improve the quality of sintering mine while at the same time reduce the cost of sinter? Data mining in the Iron Ore Mixing System is to find out ways to meet the need of all kinds of sintering ores. The system forecasts the sinter quality through modeling, supports the mixing method with low cost, creates iron ore mixing knowledge database, and also provides friendly user interface. The data mining of iron ore mixing is in four steps: data preparation, iron ore evaluation with clustering analysis, modeling with neural networks, optimization. The evaluation results from the system are almost the same as those from experts. The forecasting accuracy reaches above 85 A Defect Diagnosis Expert System The defect diagnosis is an important basis of reliability engineering, and is and important component and key technology of total quality control. Computer aided defect diagnosis can reduce and prevent the same defects from occurring repeatedly. It can also assist to provide information for decisionmaking. The system comes from experiments and massive data made by technicians after real accidents happen. It was developed with computer technologies, statistics analysis, data mining technologies, and artificial intelligence, and is consisted of data storage, statistics analysis, knowledge repository, and defect diagnosis. The system contains both high generalized visualized functions and multi-layered data mining functions.

Fig. 5. The Quality Control of IPC.

Fig. 6. Improvement the Bend Strength of the Steel ST06Z Products.

The Application and Prospect of Business Intelligence

27

The Next Generation of Business Intelligence in Shanghai Baosteel

3

Shanghai Baosteel is the main body of Shanghai Baosteel Group. As Shanghai Baosteel Group became the Fortune’s 2003 Global 500, the application of BI in Shanghai Baosteel will be strengthened and developed further. Followings are the tasks to perform.

3.1

Carrying out the Application in Department Level

Shanghai Baosteel will persist in developing its own characteristics of BI, and will take quality control and synthetic reports as its main goal, and will extend the combination of IPC, data warehouse and data mining. Quality control is the everlasting subject in manufacturing and is a durative market in which product design and development should be strengthened. Nowadays enterprises emphasize strategies particularly on process improvement in response to both daily improvements from client’s requirements and drastic competitive market. In the industrial manufacturing, especially metallurgical manufacturing, there are many factors that cause quality problems, such as equipment invalidation, staff’s carelessness, parameter abnormal, raw material differences, fluctuating settings. Especially in large steel enterprises with complicated business and technical flows, “Timely finding and forecasting exceptions, promptly controlling and quality analysis” is a necessity. Therefore based on the quality control notion of 6 sigma, the application which based on data warehouse technologies, together with process control, fuzzy control, neural networks, expert system, data mining, can be applied in complicated working procedure as blast furnace, iron- making, steel-making, continuous casting, steel rolling. It is certainly the road to develop further the BI in department level.

Fig. 7. The Forecast of the RDI in the Iron Ore.

Fig. 8. The Diagnosis Expert System of the Steel Tube.

28

3.2

Xiao Ji et al.

Strengthening Researches on Application of BI in Enterprise Level

In enterprise level there are many requirements, which can lead to data warehouse based EIS, Key Performance Indexes (KPI) system, etc. KPI is a measurable management target which can set, sample, calculate and analyze the key parameters of the internal organization flow’s input and output. It is a tool that can decompose the organizational strategic goal to the operational tasks, and is the basis of the organizational performance management. It can make the definite responsibilities to a department manager, and extend to the staff in the department. So building a definite credible KPI system is the sticking point for a good performance management.

3.3

Following up the Technical Tide of BI and Applying New Technologies into Industry

BI is a subject which overlaps many disciplines. Shanghai Baosteel and Shanghai Baosight are actively following up the technical tide of BI and researching new BI techniques in the metallurgical manufacturing among fields as stream data management, text (message) data mining, KPI practice in manufacturing, customized information based on position, knowledge management, etc. Stream data management: Data which from L3 system (production control system) has the characteristics of stream data, so the knowledge of stream data management can be applied when IPC systems need to analyze timely and do data mining on the production line. Text (message) data mining: Data communication between the ERP system and other information systems of Shanghai Baosteel are implemented by messages. All the messages have been extracted and loaded into data warehouse system. So how to use text mining techniques to analyze and solve exceptions quickly will be a new challenge. The practice of KPI in manufacturing, customized information based on position, and knowledge management are new subjects and trends to provide extensive BI application in metallurgical manufacturing. Meanwhile, Shanghai Baosight and Technology Center of Shanghai Baosteel are fully taking the advantage of the previous experience to develop the data mining tools which have independent knowledge property rights. Practical Miner from Technology Center has been popular in Shanghai Baosteel for years, while Shanghai Baosight is developing a data mining tool according to the standards of CWM1.1 and CORBA, and is expected to release early in 2005.

4

Conclusions

Shanghai Baosteel is leading in Chinese metallurgical manufacturing industry, while it is a leader in BI application as well. With many years’ application and practice, it has benefited much from BI. And it will pursue an even further goal in BI in the near future.

The Application and Prospect of Business Intelligence

29

References 1. Inmon Bill, Data mart does not equal data warehouse, DM Direct,Nov.l997. 2. Inmon W.H., Building the data warehouse (Second Edition), John Wiley &: Sons, Inc, New York, 1996. 3. Ji Xiao, Yu Ge, Application of SAS software in the establishment of a data mart for quality analysis system in the metallurgical industry, Proc. of the 25th Annual SUGI Conf., USA. 2000. 4. 4. Kollios George, Gunopulos Dimitrios, Tsotras Vassilis J.On Indexing Mobile Objects. PODS 1999: 261-272. 5. Ji Xiao, Zhou Shichun, Data warehousing helps enterprise improve quality management, Proc. of the 26th Annual SUGI Conf., USA. 2001. 4. 6. Ji Xiao, Yu Ge, Bao Yubin, et al. Data warehousing and its application for procuct quality analysis in metallurgical enterprises, Proc. of 2000 Int. conf. on AMSMA, Guangzhou, China. 2000. 7. SAS Institute Inc., Rapid Warehousing Methodology. Cary, NC: SAS Institute Inc., 2000: 20-40. 8. Tang Haidong, Ji Xiao, Tracking your data warehouse using SAS software, Proc. of the 20th Annual SEUGI Conf., Paris. 2001. 12. 9. Theodoratos D, Sellis T., Designing data warehouses configuration, Proc. of the 23rd Intl. VLDB, 1997: 126-135. 10. Wang Daling, Bao Yubin, Ji Xiao, et al. Development of a data mining system supporting quality control in metallurgy enterprise, Proc. of 16th World Computer Congress 2000 on IIP, Beijing, China. 2000: 578-581. 11. Wang Daling, Bao Yubin, Ji Xiao, Wang Guoren, Integrated classification rule management system for data mining, Second Int. Conf. (WAIM 2001) Proceedings, 2001. 7. 122-129. 12. Wang Guoren, Yu Ge, Zhang Bin,et al. Schema integration architecture for multidatabase systems, The 21st Annual Int. Computer Software and Application Conf. (IEEE Compsac97). Washington, DC, Aug. 1997.IEEE Computer Society, 1997: 200-203. 13. Widom J., Research problems in data warehousing, Proc. of the 4th Int. Conf. on Information and Knowledge Management (CIKM), Balthore MD USA. November 1995: 25-30. 14. Wu M C, Buchmann A. Encoded, bitmap indexing for data warehouses, Proc. of ICDE 1998: 220-230. 15. Yu Ge, Wang Guoren, Zheng Huaiyuan, et al. Transform more semantics from relational databases to object-oriented database. Proc. of the 4th Int. Conf. on Database Systems for Advanced Application (DASEFAA) Apr1995. Singapore. 300-307.

Conceptual Modelling – What and Why in Current Practice Islay Davies1, Peter Green2, Michael Rosemann1, and Stan Gallo2 1

Centre for Information Technology Innovation Queensland University of Technology Brisbane, Australia

{ig.davies,m.rosemann}@qut.edu.au 2 UQ Business School University of Queensland Ipswich, Australia

{p.green,s.gallo}@uq.edu.au

Abstract. Much research has been devoted over the years to investigating and advancing the techniques and tools used by analysts when they model. As opposed to what academics, software providers and their resellers promote as should be happening, the aim of this research was to determine whether practitioners still embraced conceptual modelling seriously. In addition, what are the most popular techniques and tools used for conceptual modelling? What are the major purposes for which conceptual modelling is used? The study found that the top six most frequently used modelling techniques and methods were ER diagramming, data flow diagramming, systems flowcharting, workflow modelling, RAD, and UML. However, the primary contribution of this study was the identification of the factors that uniquely influence the continued-use decision of analysts, viz., communication (using diagrams) to/from stakeholders, internal knowledge (lack of) of techniques, user expectations management, understanding models integration into the business, and tool/software deficiencies.

1 Introduction The areas of business systems analysis, requirements analysis, and conceptual modelling are well-established research directions in academic circles. Comprehensive analytical work has been conducted on topics such as data modelling, process modelling, meta modelling, model quality, and the like. A range of frameworks and categorisations of modelling techniques have been proposed (e.g. [6, 9]). However, they mostly lack an empirical foundation. Thus, it is difficult to provide solid statements on the importance and potential impact of related research on the actual practice of conceptual modelling. More recently, Wand and Weber [13, p. 364] assume “the importance of conceptual modelling” and they state “Practitioners report that conceptual modelling is difficult and that it often falls into disuse within their organizations.” Unfortunately, anecdotal feedback to us from information systems (IS) practitioners confirmed largely the assertion of Wand and Weber [13]. Accordingly, as researchers involved in attempting to advance the theory of conceptual modelling in organisations, we were concerned to determine that practitioners still found conceptual modelling useful and that they were indeed still performing conceptual modelling as part of their business systems analysis processes. Moreover, if practitioners still found modelling useful, why P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 30–42, 2004. © Springer-Verlag Berlin Heidelberg 2004

Conceptual Modelling – What and Why in Current Practice

31

did they find it useful and what were the major factors that inhibited the wider use of modelling in their projects. In this way, the research that we were performing would be relevant for the practice of information systems development (See the IS Relevance debate on ISWorld, February 2001). Hence, the research in this paper is motivated in several ways. First, we want to obtain empirical data that conceptual modelling is indeed being performed in IS practice in Australia. Such data will give overall assurance to the practical relevance of the research that we perform in conceptual modelling. Second, we want to find out what are the principal tools, techniques, and purposes for which conceptual modelling is performed currently in Australia. In this way, researchers can obtain valuable information to help them direct their research towards aspects of conceptual modelling that contribute most to practice. Finally, we were motivated to perform this study so that we could gather and analyse data on major problems and benefits unique to the task of conceptual modelling in practice. So, this research aims to provide current insights into actual modelling practice. The underlying research question is “Do practitioners actually use conceptual modelling in practice?” The derived and more detailed questions are: What are popular tools and techniques used for conceptual modelling in Australia? What are the purposes of modelling? What are major problems and benefits unique to modelling? In order to provide answers for these questions, an empirical study using a webbased questionnaire has been designed. The goal was to determine what modelling practices are being used in business, as opposed to what academics, software providers and their resellers believe should be used. In summary, we found that the current state of usage of business systems/conceptual modelling in Australia is: ER diagramming, data flow diagramming, systems flowcharting, and workflow modelling being most frequently used for database design and management, software development, documenting and improving business processes. Moreover, this modelling work is supported in most cases by the use of Visio (in some version) as an automated tool. Furthermore, planned use of modelling techniques and tools into the short-term future appears to be expected to reduce significantly compared to current usage levels. The remainder of the paper unfolds in the following manner. The next section reviews the related work in terms of empirical data in relation to conceptual modelling practice. The third section explains briefly the instrument and methodology used. Then, an overview of the quantitative results of the survey is given. The fifth section presents succinctly the results of the analysis of the textual data on the problems and benefits of modelling. The last section concludes and gives an indication of further work planned.

2 Related Work Over the years, much work has been done on how to do modelling – the quality, correctness, completeness, goodness of representation, understandability, differences between novice and expert modellers, and many other aspects (e.g., [7]). Comparatively little empirical work however has been undertaken on modelling in practice. Floyd [3] and Necco et al. [8] conducted comprehensive empirical work into the use of modelling techniques in practice but that work is now considerably dated. Batra and Marakas [1] attempted to address this problem of a lack of current empirical evi-

32

Islay Davies et al.

dence however their work focused on comparing the perspectives of the academic and practitioner communities regarding the applications of conceptual data modelling. Indeed, these authors simply reviewed the academic and practitioner literatures without actually collecting primary data on the issue. Moreover, their work is now dated. However, it is interesting that they (p. 189) observe “there is a general lack of any substantive evidence, anecdotal or empirical, to suggest that the concepts are being widely used in the applied design environment.” Batra and Marakas [1, p. 190] state that “Researchers have not attempted to conduct case or field studies to gauge the cost-benefits of enterprise-wide conceptual data modelling (CDM).” This research has attempted to address the problems alluded to by Batra and Marakas [1]. Iivari [4] provided some data on these questions in a Finnish study of the perceptions of effectiveness of CASE tools. However, he found the adoption rate of CASE tools by developers in organisations very low (and presumably the extent of conceptual modelling to be low as well). More recently, Persson and Stirna [10] noted the problem, however, their work was limited in that it was only an exploratory study into practice. Most recently, Chang et al. [2] conducted 11 interviews with experienced consultants in order to explore the perceived advantages and disadvantages of business process modelling. This descriptive study did not, however, investigate the critical success factors of process modelling. Sedera et al. [11] have conducted three case studies to determine a process modeling success model, however they have not yet reported on a planned empirical study to test this model. Furthermore, the studies by Chang et al. [2] and Sedera et al. [11] are limited to the area of process modeling.

3 Methodology This study was conducted in the form of a web-based survey issued with the assistance of the Australian Computer Society (ACS) to its members. The survey consisted of seven pages1. The first page explained the objectives of our study. It also highlighted the available incentive, i.e., free participation in one of five workshops on business process modelling. The second page asked for the purpose of the modelling activities. In total, 17 purposes (e.g., database design and management, software development) were made available. The respondents were asked to evaluate the relevance of each of these purposes using a five-point Likert scale ranging from 1 (not relevant) to 5 (highly relevant). The third page asked for the modelling techniques2 used by the respondent. It provided a list of 18 different modelling techniques ranging from data flow diagram and ER diagrams, to the various IDEF standards, up to UML. For each modelling technique, the participants had to provide information about the past, current and future use of the modelling technique. It was possible to differentiate between infrequent and frequent use. Furthermore, participants could indicate whether they knew the technique or did not use it at all. It was possible also to add further modelling techniques that they used. The fourth page was related to the modelling tools. Following the same structure as for the modelling technique, a list of 24 modelling tools was provided. A hyperlink provided a reference to the homepage of each tool provided. It was clarified also if a tool had been known under a different name 1 2

A copy of the survey pages is available from the authors on request. ‘Technique’ here is used as an umbrella term referring to the constructs of the technique, their rules of construction, and the heuristics and guidelines for refinement.

Conceptual Modelling – What and Why in Current Practice

33

(e.g., Designer2000 for the Oracle9i Developer Suite). The fifth page explored qualitative issues. Participants were asked to list major problems and issues they had experienced with modelling as well as perceived key success factors. On the sixth page, demographic data was collected. This data included person type (practitioner, academic or student), years of experience in business systems analysis and modelling, working area (business or IT), training in modelling and the size of the organisation. The seventh page allowed contact details for the summarised results of the study and the free workshop to be entered. The instrument was piloted with 25 members of two research centres as well as with a selected group of practitioners. Minor changes were made based on the experiences within this pilot. A major contribution of this paper is an examination of the data gathered through the fifth page of the survey. This section of the survey asked respondents to list critical success factors for them in the use of conceptual modelling and problems or issues they encountered in successfully undertaking modelling in their organisations. The phenomena that responses to these questions allowed us to investigate were why do we continue/discontinue to use a technical method (implemented using a technological tool) – conceptual modelling. To analyse these phenomena, we used the following procedure: 1. What responses confirm the factors we already know about in regard to these phenomena; and 2. What responses are identifying new factors that are unique to the domain of conceptual modelling? To achieve step 1, we performed a review of the current thinking and literature in the areas of adoption and continued use of a technology. Then, using Nvivo 2, one researcher classified the textual comments, where relevant, according to these known factors. This researcher’s classification was then reviewed and confirmed with a second researcher. The factors identified from the literature and used in this first phase of the process are summarised and defined in Table 1. After step 1, there remained factors that did not readily fit into one or other of the known factor categories. These unclassified responses had the potential to provide us with insight on factors unique and important to the domain of conceptual modelling. However, the question was how to derive this information in a relatively objective and unbiased manner from the textual data. We used a new state-of-the-art textual content analysis tool called Leximancer3. Using this tool, we identified from the unclassified text five new factors specific to conceptual modelling. Subsequently, one researcher again classified the remaining responses using these newly identified factors. His classification was reviewed and confirmed by a second researcher. Finally, the relative importance of each of the new factors was determined.

3.1 Why Use Leximancer? The Leximancer system allows its users to analyse large amounts of text quickly. The tool performs this analysis both systematically and graphically by creating a map of the constructs – the document map – that are displayed in such a manner that links to related subtext may be subsequently explored. Each of the words on the document map represents a concept that was identified. The concept is placed on the map in 3

For more information on Leximancer, see www.leximancer.com

34

Islay Davies et al.

proximity of other concepts in the map through a derived combination of the direct and indirect relationships between those concepts. Essentially, the Leximancer system is a machine-learning technique based on the Bayesian approach to prediction. The procedure used for this is a self-ordering optimisation technique and does not use

Conceptual Modelling – What and Why in Current Practice

35

neural networks. Once the optimal weighted set of words is found for each concept, it is used to predict the concepts present in fragments of related text. In other words, each concept has other concepts that it attracts (or is highly associated with contextually) as well as concepts that it repels (or is highly disassociated with contextually). The relationships are measured by the weighted sum of the number of times two concepts are found in the same ‘chunk’. An algorithm is used to weight them and determine the confidence and relevancy of the terms to others in a specific chunk and across chunks. Leximancer was selected for this qualitative data analysis for several reasons: Its ability to derive the main concepts within text and their relative importance using a scientific, objective algorithm; Its ability to identify the strengths between concepts (how often they co-occur) – centrality of concepts; Its ability to assist the researcher in applying grounded theory analysis to a textual dataset; Its ability to assist in visually exploring textual information for related themes to create new ideas or theories; and Its ability to assist in identifying similarities in the context in which the concepts occur – contextual similarity.

4 Survey Results and Discussion From 674 individuals who started to fill out the survey, 370 actually completed the entire survey, which leads to a completion rate of 54.8%. Moreover, of the 12,000 members of the ACS, 1,567 indicated in their most recent membership profiles that they were interested in conceptual modelling/business systems analysis. Accordingly, our 370 responses indicate a relevant response rate of 23.6%, which is very acceptable for a survey. Moreover, we offered participation in one of five seminars on business process modelling free of charge as an inducement for members to participate. This offer was accepted by 186 of 370 respondents. Corresponding with the nature of the ACS as a professional organisation, 87% of the participants were practitioners. The remaining respondents were academics (6%) and students (7%). It is also not a surprise that 85% of the participants characterised themselves as an IT service person while only 15% referred to themselves as a businessperson or end user. Sixty-eight percent of the respondents indicated that they gained their knowledge in Business Systems Analysis from University. Further answers were TAFE (Technical and Further Education) (6%), ACS (3%). Twenty-three percent indicated that they did not have any formal training in Business Systems Analysis. Forty percent of the respondents indicated that they have less than five years experience with modelling. Thirty-eight percent have between 5 and 15 years of experience. A significant proportion, 22%, has more than 15 years of experience with modelling. These figures indicate that the average expertise of the respondents is supposedly quite high. Twentyeight percent of respondents indicated that they worked in firms employing less than 50 people, most likely small software consulting firms. However, a quarter of the respondents worked in organisations with 1000 or less employees. So, by Australian standards, they would be involved in software projects of reasonable size. We were concerned to obtain information in three principle areas of conceptual modelling in Australia viz., what techniques are used currently in practice, what tools

36

Islay Davies et al.

are used for modelling in practice, and what are the purposes for which conceptual modelling is used. Table 2 presents from the data the top six most frequently used modelling techniques. It describes the usage of techniques as not known or not used, infrequently used (which in the survey instrument was defined as used less than five times per week), and frequently used. The table clearly demonstrates that the top six most frequently used (used 5 or more times a week) techniques are ER diagramming, data flow diagramming, systems flowcharting, workflow modelling (range of workflow modelling techniques), RAD, and UML. It is significant to note that even though object-oriented analysis, design, and programming has been the predominant paradigm for systems development over the last decade 64 percent of respondents either did not know or did not use UML. While not every conceptual modelling technique available was named in the survey, the eighteen techniques used were selected based on their popularity reported in prior literature. It is interesting again to note that approximately 40 percent of respondents (at least) do either not know or use any of the 18 techniques named in the survey.

Moreover, while not explicitly reported in Table 2, this current situation of nonusage appears to be set to increase into the short-term future (next 12 months) as the planned frequent use of the top four techniques is expected to drop to less than half its current usage, viz., ER diagramming (17 percent), data flow diagramming (15 percent), systems flowcharting (10 percent), and workflow modelling (12 percent). Furthermore, no increase in the intention to use any of the other techniques was reported, to balance this out. Perhaps, this short-term trend reflects the perception that the current general downturn in the IT industry will persist into the future. Accordingly, respondents perceive a significant reduction of new developmental work requiring business systems modelling in the short-term future. It may also just reflect the lack of planning of future modelling activities. Our work was also interested in what tools were used to perform the conceptual modelling work that was currently being undertaken. Table 3 presents the top six most frequently used tools when performing business systems analysis and design. The data is reported using the same legend as that used for Table 2. Again, while not every conceptual modelling tool available was named in the survey, the twenty-four tools were selected based on their popularity reported in prior literature. Table 3 clearly indicates that Visio (58 percent – both infrequent and frequent use) is the preferred tool of choice for business systems modelling currently. This result is not surprising as the top four most frequently used techniques are well

Conceptual Modelling – What and Why in Current Practice

37

supported by Visio (in its various versions). A long way second in frequent use is Rational Rose (19 percent – both infrequent and frequent use) reflecting the current level of use of object-oriented analysis and design techniques. Again, at least 40 percent of respondents (approximately) do either not know or use any of the 24 tools named in the survey – even a relatively simple tool like Flowcharter or Visio.

Moreover, while not explicitly reported in Table 3, into the short-term future (next 12 months), the planned frequent use of the top two tools is expected to drop significantly from their current usage levels, viz., Visio (21 percent) and Rational Rose (8 percent) with no real increase reported for planned use of other tools to compensate for this drop. Again, this trend in planned tool usage appears to reflect the fact that respondents expect a significant reduction in new developmental work requiring business systems modelling in the short-term future. Business systems modelling (conceptual modelling) must be performed for some purpose. Accordingly, we were interested in obtaining data on the various purposes for which people might be undertaking modelling. Using a five-point Likert scale (where 5 indicates Very Frequent Use), Table 4 presents (in rank order from the highest to the lowest score) the average score for purpose of use from the respondents.

38

Islay Davies et al.

Table 4 indicates that database design and management remains the highest average purpose for use of modelling techniques. This fact links to the earlier result of ER diagramming being the most frequently used modelling technique. Moreover, software development as a purpose would support the high usage of data flow diagramming and ER diagramming noted earlier. Indeed, the relatively highly regarded purposes of documenting and improving business processes, and managing workflows, would support further the relatively high usage of workflow modelling and flowcharting indicated earlier. The more specialised tasks like identifying activities for activitybased costing and internal control purposes in auditing appear to be relatively infrequently used purposes for modelling. This fact however may derive from the type of population that was used for the survey, viz., members of the Australian Computer Society.

5 Textual Analysis Results and Discussion Nine hundred and eighty (980) individual comments were received across the questions on critical success factors and problems/issues for modelling. Using the known factors (Table 1) influencing continued use of new technologies in firms, Table 5 shows the classification of the 980 comments after phase 1 of the analysis using Nvivo.

Clearly, relative advantage (disadvantage)/usefulness from the perspective of the analyst was the major driving factor influencing the decision to continue (discontinue)

Conceptual Modelling – What and Why in Current Practice

39

modelling. Does conceptual modelling (and/or its supporting technology) take too much time, make my job easier, make my job harder, and make it easier/harder for me to elicit/confirm requirements with users? Such comments typically contributed to this factor. Furthermore, it is not surprising to see that complexity of the method and/or tool, compatibility of the method and/or tool with the responsibilities of my job, the views of “experts”, and top management support were other major factors driving analysts’ decisions on continued use. Prior literature had told us to expect these results, in particular, the key importance of top management support to the continued successful use of such key business planning and quality assurance mechanisms as conceptual modelling for systems. However, nearly one-fifth of the comments remained unclassified. Were there any new, important factors unique to the conceptual modelling domain contained in this data? Fig. 1 shows a document (concept) map produced by Leximancer from the unclassified comments.

Fig. 1. Concept map produced by Leximancer on the unclassified comments

Five factors were identified from this map using the centrality of concepts and the relatedness of concepts to each other within identifiable ‘chunks’. While the resolution of the Leximancer generated concept map (Fig. 1) may be difficult to read on its

40

Islay Davies et al.

own here, the concepts (terms) depicted are referred to within the discussion of the relevant factors below. A. Internal Knowledge (Lack of) of Techniques This group centred on such concepts as knowledge, techniques, information, large, easily and lack. Related concepts were work, systems, afraid, UML and leading. Accordingly, we used these concepts to identify this factor as the degree of direct/indirect knowledge (or lack of) in relation to the use of effective modelling techniques. Highlighted inadequacies raise issues of the modeller’s skill level and questions of insufficient training. B. User Expectations Management This group centred on such concepts as expectations, stakeholders, audience and review. Understanding, involved, logic and find were related concepts. Consequently, we used these items to identify this factor as issues arising from the need to manage the expectations of users as to what they expect conceptual modelling to do for them and to produce. In other words, the analyst must ensure that the stakeholders/audience for the outputs of conceptual modelling have a realistic understanding of what will be achieved. Continued (discontinued) use of conceptual modelling may be influenced by difficulties experienced (or expected) with users over such issues as acceptance, understanding and communication of the outcomes of the modelling techniques. C. Understanding the Models Integration into the Business This group centred on understanding, enterprise, high, details, architecture, logic, physical, implementation and prior. Accordingly, we identified a factor as the degree to which decisions are affected by stakeholder/modeller’s perceived understanding (or lack of) in relation to the models integration into business processes (initial and ongoing). In other words, for the user, to what extent do the current outputs of the modelling process integrate with the existing business processes and physical implementations to support the goals of the overall enterprise architecture? D. Tool/Software Deficiencies This group was focused on such concepts as software, issues, activities, and model. Subsequently, a factor was identified as the degree to which decisions are affected by issues relating directly to the perceived lack of capability of the software and/or the tool design. E. Communication (Using Diagrams) to/from Stakeholders This final group involved such concepts as diagram, information, ease, communication, method, examples, and articulate. Related concepts were means, principals, inability, hard, audience, find, and stakeholders. From these key concepts, we deduced a factor as the degree to which diagrams can facilitate effective communication between analysts and key stakeholders in the organisation. In other words, to what extent can the use of diagrams enhance (hinder) the explanation to, and understanding by, the stakeholders of the situation being modelled? Using these five new factors, we revisited the unclassified comments and, using the same dual coder process as before, we confirmed a classification for those outstanding comments easily. Table 6 presents this classification and the relative importance of those newly identified factors.

Conceptual Modelling – What and Why in Current Practice

41

As can be seen from Table 6, communication using diagrams and internal knowledge (lack of) of the modelling techniques are major issues specific to the continued use of modelling in organisations. To a lesser degree, properly managing users’ expectations of modelling and ensuring users understand how the outcomes of a specific modelling task support the overall enterprise systems architecture are important to the continued use of conceptual modelling. Deficiencies in software tools that support conceptual modelling frustrate the analyst’s work occasionally.

6 Conclusions and Future Work This paper has reported the results of a survey conducted nationally in Australia on the status of conceptual modelling. It achieved 370 responses and a relevant response rate of 23.6 percent. The study found that the top six most frequently used modelling techniques were ER diagramming, data flow diagramming, systems flowcharting, workflow modelling, RAD, and UML. Furthermore, it found that clearly Visio is the preferred tool of choice for business systems modelling currently. Rational Rose and Oracle Developer suite were a long way second in frequent use. Database design and management remains the highest average purpose for use of modelling techniques. This fact links to the result of ER diagramming being the most frequently used modelling technique. Moreover, software development as a purpose would support the high usage of data flow diagramming and ER diagramming. A major contribution of this study is the analysis of textual data concerning critical success factors and problems/issues in the continued use of conceptual modelling. Clearly, relative advantage (disadvantage)/usefulness from the perspective of the analyst was the major driving factor influencing the decision to continue (discontinue) modelling. Moreover, using a state-of-the-art textual analysis and machine-learning software package called Leximancer, this study identified five factors that uniquely influence the continued use decision of analysts, viz., communication (using diagrams) to/from stakeholders, internal knowledge (lack of) of techniques, user expectations management, understanding models integration into the business, and tool/software deficiencies. The results of this work are limited in several ways. Although every effort was taken to mitigate potential limitations, it still suffers from the usual problems with surveys, most notably, potential bias in the responses and lack of generalisability of the results to other people and settings. More specifically, in relation to the qualitative analysis, even though a form of dual coding (with confirmation) was employed, there still remains subjectivity in the classification of comments. Furthermore, while the members of the research team all participated, the identification of the factors using

42

Islay Davies et al.

the Leximancer document map and the principles of relatedness and centrality remains arguable. We intend to extend this work in two ways. First, we will analyse the data further investigating cross-tabulations and correlations between the quantitative data and the qualitative results reported in this paper. For example, do the factors influencing the continued-use decision vary by the demographic dimensions of source of formal training, years of modelling experience, and the like. Second, we want to administer the survey in other countries (Sweden and Netherlands already) to address the issues of lack of generalisability in the current results and cultural differences in conceptual modelling.

References 1. Batra, D., Marakas, G. M.: Conceptual data modelling in theory and practice. European Journal of Information Systems, Vol. 4 Nr. 3 (1995) 185-193 2. Chang, S., Kesari, M., Seddon, P.: A content-analytic study of the advantages and disadvantages of process modelling. 14th Australasian Conference on Information Systems. Eds.: J. Burn., C. Standing, P. Love. Perth (2003) 3. Floyd, C.: A comparative evaluation of system development methods. Information Systems Design Methodologies: Improving the Practice. North-Holland, Amsterdam (1986) 19-37 4. Iivari, J.: Factors affecting perceptions of CASE Effectiveness. IEEE Software. Vol. 4 (1995) 143-158 5. Karahanna, E., Straub, D. W., Chervany, N. L.: Information Technology Adoption Across Time: A Cross-Sectional Comparison of Pre-Adoption and Post-Adoption Beliefs. MIS Quarterly, Vol. 23 Nr. 2 (1999) 183-213 6. Karam, G. M., Casselman, R. S.: A cataloging framework for software development methods. IEEE Computer, Feb., (1993) 34-46 7. Lindland, O. I., Sindre, G., Solvberg, A.: Understanding Quality in Conceptual Modeling. IEEE Software, March, (1994) 42-49 8. Necco, C. R., Gordon, C. L., Tsai, N. W.: Systems analysis and design: Current practices. MIS Quarterly, Dec., (1987) 461-475 9. Olle, T. W., Hagelstein, J., Macdonald, I. G., Rolland, C., Sol, H. G., M.van Assche, F. J., Verrijn-Stuart, A. A.: Information Systems Methodologies: A Framework for Understanding. Addison-Wesley, Wokingham (1991) 10. Persson, A., Stirna, J.: Why Enterprise Modelling? An Explorative Study Into Current Practice. 13th Conference on Advanced Information Systems Engineering, Switzerland (2001) 465-468 11. Sedera, W., Gable, G., Rosemann, M., Smyth, R.: A success model for business process modeling: findings from a multiple case study. Pacific Asia Conference on Information Systems (PACIS’04). Eds.: Liang T.P., Zheng, Z.: Shanghai (2004) 12. Tan, M., Teo, T. S. H.: Factors Influencing the Adoption of Internet Banking. Journal of the Association for Information Systems, Vol. 1 (2000) 1-43 13. Wand, Y., Weber, R.: Research Commentary: Information Systems and Conceptual Modeling – A Research Agenda. Information Systems Research. Vol. 13 Nr. 4 (2002) 363-376.

Entity-Relationship Modeling Re-revisited Don Goelman1 and Il-Yeol Song2 1

Department of Computer Science Villanova University Villanova, PA 19085

[emailprotected] 2

College of Information Science and Technology Drexel University Philadelphia, PA 19104 [emailprotected]

Abstract. Since its introduction, the Entity-Relationship (ER) model has been the vehicle of choice in communicating the structure of a database schema in an implementation-independent fashion. Part of its popularity has no doubt been due to the clarity and simplicity of the associated pictorial Entity-Relationship Diagrams (“ERD’s”) and to the dependable mapping it affords to a relational database schema. Although the model has been extended in different ways over the years, its basic properties have been remarkably stable. Even though the ER model has been seen as pretty well “settled,” some recent papers, notably [4] and [2 (from whose paper our title is derived)], have enumerated what their authors consider serious shortcomings of the ER model. They illustrate these by some interesting examples. We believe, however, that those examples are themselves questionable. In fact, while not claiming that the ER model is perfect, we do believe that the overhauls hinted at are probably not necessary and possibly counterproductive.

1 Introduction Since its inception [5], the Entity-Relationship (ER) model has been the primary approach for presenting and communicating a database schema at the “conceptual” level (i.e., independent of its subsequent implementation), especially by means of the associated Entity-Relationship Diagram (ERD). There’s also a fairly standard method for converting it to a relational database schema. In fact, if the ER model is in some sense “correct,” then the associated relational database schema should be in pretty good normal form [15]. Of course, there have been some suggested extensions to Chen’s original ideas (e.g., specialization and aggregation as in [10, 19]), some different approaches for capturing information in the ERD, and some variations on the mapping to the relational model, but the degree of variability has been relatively minor. One reason for the remarkable robustness and popularity of the approach is no doubt the wide appreciation for the simplicity of the diagram. Consequently, the desirability of incorporating additional features in the ERD must be weighed against the danger of overloading it with so much information that it loses its visual power in communicating the structure of a database. In fact, the model’s versatility is also evident in its relatively straightforward mappability to the newer Object Data Model [7]. Now admittedly an industrial strength ERD reflecting an actual enterprise would necessarily be some order of magnitude more complex than even the production numbers in standard texts [e.g., 10]. However, this does not weaken the ability of a simple ERD to P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 43–54, 2004. © Springer-Verlag Berlin Heidelberg 2004

44

Don Goelman and Il-Yeol Song

capture local pieces of the enterprise, nor does it lessen the importance of ER-type thinking in communicating a conceptual model. Quite recently, however, both Camps and Badia have demonstrated [4, and 2 (from whose paper the title of this one is derived)] some apparent shortcomings in the ER model, both in the model itself and in the processes of conversion to the relational model and its subsequent normalization. They have illustrated these problems through some interesting examples. They also make some recommendations for improvements, based on these examples. However, while not claiming that the ER model can be all things to all users, we believe that the problems presented in the examples described in those two papers are due less to the model and more to its incorrect application. Extending the ERD to represent complex multi-relation constraints or constraints at the attribute level are interesting research topics, but are not always desirable. We claim that representing them would clutter the ERD as a conceptual model at the enterprise level; complex constraints would be better specified in a textual or language-oriented format than at the ERD level. The purpose of this paper is to take these examples as a starting point to discuss the possible shortcomings of the ER model and the necessity, or lack thereof, for modifying it in order to address them. We therefore begin by reviewing and analyzing those illustrations. Section 2 describes and critiques Camps’ scenarios; Section 3 does Badia’s. Section 4 considers some related issues, most notably a general design principle only minimally offered in the ER model. Section 5 concludes our paper.

2 The Camps Paper In [4], the author begins by describing an apparently simple enterprise. It has a straightforward ERD that leads to an equally straightforward relational database schema. But Camps then escalates the situation in stages, to the point where the ER model is not currently able to accommodate the design, and where normalizing the associated relational database schema is also unsatisfying. Since we are primarily concerned with problems attributed to the ER model, we will concentrate here on that aspect of the paper. However, the normalization process at this point is closely tied to that model, so we will include some discussion of it as well. We now give a brief recapitulation, with commentary. At first, Camps considers an enterprise with four ingredients: Dealer, Product, State, and Concession, where Concession is a ternary relationship among the other three, implemented as entity types. Each ingredient has attributes with fairly obvious semantics, paraphrased here: d-Id, d-Address; p-Id, p-Type; s-Id, s-Capital; and cDate. The last attribute’s semantics represents the date on which a given state awards a concession to a given dealer for a given product. As for functional dependencies, besides the usual ones, we are told that for a given state/product combination, there can only be one dealer. Thus, a minimal set of dependencies is as follows:

Entity-Relationship Modeling Re-revisited

45

An ERD for this is given in Figure 1 (attributes are eliminated in the figures, for the sake of clarity), and the obvious relational database schema is as follows:

Fig. 1. Example of 1:N:N relationship (from Figure 1 in [4], modified)

The foreign key constraints derive here from the two components of Concession’s key, which are primary keys of their native schemas. Since the only functional dependencies are those induced by keys, the schema is in BCNF. Here Camps imposes further constraints:

In other words, if a product is offered as a concession, then it can only be with a single dealer regardless of the state; and analogously on the state-dealer side. The author is understandably unhappy about the absence of a standard ERD approach to accommodate the resulting binary constraining relationships (using the language of [12]), which he renders in a rather UML-like fashion [17], similar to Figure 2. At this point, in order to highlight the generic structure, he introduces new notation (A, B, C, D for State, Dealer, Product, Concession, respectively). However, we will keep the current ones for the sake of comfort, while still pursuing the structure of his narrative. He notes that the resulting relational database schema includes the non-3NF relation schema Concession(s-Id,p-Id,d-Id,c-Date). Further, when Camps wishes to impose the constraints that a state (respectively product) instance can determine a dealer if and only if there has been a concession arranged with some product (respectively state), he expresses them with these conditions:

Each of these can be viewed as a double inclusion dependency and must be expressed using the CHECK construct in SQL.

46

Don Goelman and Il-Yeol Song

Fig. 2. Two imposed FDs (from Figure 2 of [4])

Now we note that it is actually possible to capture the structural properties of the enterprise at this stage by the simple (i.e., ternary-free) ERD of either Figure 3a [13] or Figure 3b [18]. The minimal set of associated functional dependencies in Figure 3a is as follows:

One, therefore, obtains the following relational database schema, which is, of course, in BCNF, since all functional dependencies are due to keys:

Fig. 3a. A binary model of Figure 2 with Concession as a M:N relationship

Entity-Relationship Modeling Re-revisited

47

Fig. 3b. A binary model of Figure 2 with Concession as an intersection (associate) entity

Admittedly, this approach loses something: the ternary character of Concession. However, any dealer-relevant information to a concession instance can be discovered by a simple join; a view can also be conveniently defined. The ternary relationship in Figure 2 is therefore something of a red herring when constraining binary relationships are imposed to a ternary relationship. In other words, it is possible that an expansion of the standard ERD language to include n-ary relationships’ being constrained by m-ary ones might be a very desirable feature, but its absence is not a surprising one. Jones and Song showed that the ternary schema with FDs imposed in Figure 2 can have lossless decomposition, but cannot have an FD-preserving schema (Pattern 11 in [13]). Camps now arrives at the same schema (E) (by normalizing his non-3NF one, not by way of our ERD in Figure 3a). The problem he sees is incorporating the semantics of (C). The constraints he develops are:

The last two conditions seem not to make sense syntactically. The intention is most likely the following (keeping the first condition and rephrasing the other two):

At any rate, Camps shows how SQL can accommodate these conditions too using CHECKS in the form of ASSERTIONS, but he considers any such effort (to need any conditions besides key dependencies and inclusion constraints) to be anomalous. We feel that this is not so surprising a situation after all. The complexity of real-world database design is so great that, on the contrary, it is quite common to encounter a situation where many integrity constraints are not expressible in terms of functional and inclusion dependencies alone. Instead, one must often use the type of constructions that Camps shows us or use triggers to implement complex real-world integrity constraints.

3 The Baida Paper In his paper [2] in turn, Badia revisits the ER model because of the usefulness and importance of the ER model. He contends that, as database applications get more

48

Don Goelman and Il-Yeol Song

complex and sophisticated and the need for capturing more semantics is growing, the ER model should be extended with more powerful constructs to express powerful semantics and variable constraints. He presents six scenarios that apparently illustrate some inadequacies of the ER model; he classifies the first five as relationship constraints that the model is not up to incorporating and the sixth as an attribute constraint. We feel that some of the examples he marshals, described below in 3.3 and 3.6, are questionable, leading us to ask whether they warrant extending the model. Badia does discuss the down side of overloading the model, however, including a thoughtful mention of tradeoffs between minimality and power. In this section we give a brief recapitulation of the examples, together with our analyses.

3.1 Camps Redux In this portion of his paper, Badia presents Camps’ illustrations and conclusions, which he accepts. We’ve already discussed this.

3.2 Commutativity in ERD’s In mathematical contexts, we call a diagram commutative [14] if all different routes from a common source to a common destination are equivalent. In Figure 4, from Badia’s paper (there called Figure 1), there are two different ways to navigate from Course to Department: directly, or via the Teacher entity. To say that this particular diagram commutes, then, is to say that for each course, its instructor must be a faculty member of the department that offers it. Again, there is a SQL construct for indicating this. Although Badia doesn’t use the term, his point here is that there is no mechanism for ERD’s to indicate a commutativity constraint. This is correct, of course. Consider the case of representing this kind of multi-relation constraints in the diagram with over just 50 entities and relationships, which are quite common in real-world applications. We believe, therefore, that this kind of a multi-relation constraint is better to be specified as a textual or a language-oriented syntax, such as OCL [17], rather than at a diagram level. In this way, a diagram can clearly deliver its major semantics without incurring visual overload and clutter.

Fig. 4. An example of multi-paths between two entities (from Figure 1 in [2])

Entity-Relationship Modeling Re-revisited

49

In certain limited situations [8] the Offers relationship might be superfluous and recovered by composing the other two relationships (or, in the relational database schema, by performing the appropriate joins). We would need to be careful about dropping Offers, however. For example, if a particular course were at present unstaffed, then the Teaches link would be broken. This is the case when Course entity has partial (optional) participation to Department entity. Without an explicit Offers instance, we wouldn’t know which department offers the course. This is an example of a chasm trap which requires an explicit Offers relationship [6]. Another case where we couldn’t rely on merely dropping one of the relationship links would arise if a commutative diagram involved the composition of two relationships in each path; then we would surely need to retain them both and to implement the constraint explicitly. We note that allowing cycles and redundancies in ERD’s has been a topic of research in the past. Atzeni and Parker [1] advise against it; Markowitz and Shoshani [15] feel that it is not harmful if it is done right. Dullea and Song [8, 9] provide a complete analysis of redundant relationships in cyclic ERD’s. Their decision rules on redundant relationships are based on both maximum and minimum cardinality constraints.

3.3 Acyclicity of a Recursive Closure Next, Badia considers the recursive relationship ManagerOf (on an Employee entity). He would like to accommodate the hierarchical property that nobody can be an indirect manager of oneself. Again, we agree with this observation but can’t comment on how desirable such an ER feature would be at a diagram level. Badia points out that this is a problem even at the level of the relational database, although some Oracle releases can now accommodate the constraint.

3.4 Fan Traps At this point the author brings Figure 5 (adapted from [6], where it appears as Figure 11.19(a); for Badia it is Figure 2) to our attention. (The original figure uses the “Merise,” or “look here” approach [17]; we’ve modified it to make it consistent with the other figures in this paper.) The problem, called a fan trap arises when one attempts to enforce a constraint that a staff person must work in a branch operated by her/his division. This ER anomaly percolates to the relational schemas as well. Further, if one attempts to patch things up by including a third binary link, between Staff and Branch, then one is faced with the commutativity dilemma of Section 3.2. In general fan traps arise when there are two 1 :N relationships from a common entity type to two different destinations. The two typical solutions for fan traps are either to add a third relationship between the two many-side entities or rearrange the entities to make the connection unambiguous. The problem in Figure 5 here is simply caused by an incorrect ERD and can be resolved by rearranging entities as shown in Figure 6. Figure 6 avoids the difficulties at both the ER and relational levels. In fact, this fix is even exhibited in the Connolly source itself. We note that the chasm trap discussed in Section 3.2 and the fan trap are commonly called connection traps [6] which make the connection between two entities separated by the third entity ambiguous.

50

Don Goelman and Il-Yeol Song

Fig. 5. A semantically wrong ERD with a fan trap (from Figure 2 in [2] and Figure 11.19(a) from [6])

Fig. 6. A correct ERD of Figure 5, after rearranging entities

3.5 Temporal Considerations Here Badia looks at a Works-in relationship, M:N between Employee and Project, with attributes start-date and end-date. A diagram for this might look something like Figure 7b; for the purposes of clarity, most attributes have been omitted. Baida states that the rule that even though en employee may work in many projects, an employee may not work in two projects at the same time may not be represented in an ERD. It appears impossible to express the rule, although the relationship is indeed M:N. But wouldn’t this problem be solved by creating a third entity type, TimePeriod, with the two date attributes as its composite key, and letting Works-in be ternary? The new relationship would be M:N:1, as indicated in Figure 7c, with the 1 on the Project node, of course. In figures of 7a through 7d, we show several variations of this case related to capturing the history of works-in relationships and the above constraint. We’ll comment additionally on this in Section 4.

Fig. 7a. An employee may work in only one project and each project can have many employees. The diagram already assumes that an employee must work for only one project at a time. This diagram is not intended to capture any history of works-in relationship

Fig. 7b. An employee may work in many projects and each project may have many employees. The diagram assumes that an employee may work for many projects at the same time. This diagram is also not intended to capture any history of works-in relationship

Entity-Relationship Modeling Re -revisited

51

Fig. 7c. An employee may work in only one project at a time. This diagram can capture a history of works-in relationship of an employee for projects and still satisfies the constraint that an employee may work in only one project at a time

Fig. 7d. In Figure 7.c, if entity TimePeriod is not easily materialized, we can reify the relationship Works-in to an intersection entity. This diagram can capture the history of works-in relationship, but does not satisfy the constraint that an employee may work in only one project

3.6 Range Constraints While the five previous cases exemplify what Badia calls relationship constraints, this one is an attribute constraint. The example given uses the following two tables: Employee (employee_id, rank_id, salary, ...) Rank (rank_id, max_salary, min_salary) The stated problem is that the ERD that represents the above schema cannot express the fact that the salary of an employee must be within the range determined by his or her rank. Indeed, in order to enforce this constraint, explicit SQL code must be generated. Baida correctly sates that the absence of information at the attribute level is a limitation and cause difficulty in solving semantic heterogeneity. We believe, however, that information and constraints at the attribute level could be expressed at the data dictionary level or in a separate low level diagram below the ERD level. Again, this will keep an ERD as a conceptual model at enterprise level without too much clutter. Consider the complexity of representing attribute constraints in ERDs for realworld applications that have over 50 entities and several hundreds of attributes. The use of a CASE tool that supports a conceptual ERD with its any low level diagram for attributes and/or its associated data dictionary should be a right direction for this problem.

4 General Cardinality Constraints While on the whole, as indicated above, we feel many of the alleged shortcomings of the ER model claimed in recent papers are not justified, some of those points have been well taken and are quite interesting. However, there is another important feature of conceptual design that we shall consider here, one that the ER model really does lack. In this section, we briefly discuss McAllister’s general cardinality constraints [16] and their implications. McAllister’s setting is a general n-ary relationship R. In other words, R involves n different roles. This term is used, rather than entity types, since the entity types may not all be distinct. For example, a recursive relationship, while binary in the mathe-

52

Don Goelman and Il-Yeol Song

matical sense, involves only a single entity type. Given two disjoint sets of roles A and B, McAllister defines Cmax(A,B) and Cmin(A,B) as follows: for a tuple , with one component from each role in A, and a tuple , with one component from each role in B, let us denote by the tuple generated by the two sets of components; we recall that A and B are disjoint. Then Cmax(A,B) (respectively Cmin(A,B)) is the maximum allowable cardinality over all of the set of tuples such that For example, consider the Concession relationship of Figure 1. Then to say that Cmax({State, Product},{Dealer}) = 1 is to express the fact that And the condition Cmin({Product},{State,Dealer}) = 1 is equivalent to the constraint that Product is total on Concession. Now, as we see from these examples, Cmax gives us information about functional dependencies and Cmin about participation constraints. When B is a singleton set and A its complement, this is sometimes called the “Chen” approach to cardinality [11] or “look across”; when A is a singleton set and B its complement, it is called the “Merise” approach [11] or “look here.” All told, McAllister shows that there are different combinations possible for A and B, where n is the number of different roles. Clearly, given this explosive growth, it is impractical to include all possible cardinality constraints in a general ERD, although McAllister shows a tabular approach that works pretty well for ternary relationships. He shows further that there are many equalities and inequalities that must hold among the cardinalities, so that the entries in the table are far from independent. The question arises as to which cardinalities have the highest priorities and should thus appear in an ERD. It turns out that the Merise and Chen approaches give the same information in the binary case but not in the ternary one, which becomes the contentious case (n>3 is rare enough not to be a serious issue). In fact one finds both Chen [as in 10] and Merise [as in 3] systems in practice. In his article, Genova feels that UML [17] made the wrong choice by using the Chen method for its Cmin’s, and he suggests that class diagrams include both sets of information (but only when either A or B is singleton). That does not seem likely to happen, though. Still, consideration of these general cardinality constraints and McAllister’s axioms comes in handy in a couple of the settings we have discussed. The general setting helps understand connections between, for example, ternary and related binary relationships as in Figure 2 and [12]. And it similarly sheds light on preservation (and loss) of information in Section 3.5 above, when a binary relationship is replaced by a ternary one. Finally, we believe that it also provides the deep structural information for describing the properties of decompositions of the associated relation schemas. It is therefore indisputable in our opinion that these general cardinality constraints do much to describe the fundamental structure of a relationship in the ER model; only portions of which, like the tip of an iceberg, are currently visible in a typical ERD. And yet we are not claiming that such information should routinely be included in the model.

5 Conclusion We have reviewed recent literature ([4] and [2]) that illustrate through some interesting examples areas of conceptual database design that are not accommodated suffi-

Entity-Relationship Modeling Re-revisited

53

ciently at the present time by the Entity-Relationship model. However, some of these examples seem not to hold up under scrutiny. Capabilities that the model does indeed lack are constraints on commutative diagrams (Section 3.2 above), recursive closures (3.3), and some range conditions (3.6) as pointed out by Badia. Another major conceptual modeling tool missing in the ER model is that of general cardinality constraints [16]. These constraints are the deep structure that underlies such more visible behavior as constraining and related relationships, Chen and Merise cardinality constraints, functional dependencies and decompositions, and participation constraints. How many of these missing features should actually be incorporated into the ER model is pretty much a question of triage, of weighing the benefits of a feature against the danger of circuit overload. We believe that some complex constraints such as multi-relation constraint are better to be represented as a textual or a language-oriented syntax, such as OCL [17], rather than at the ER diagram level. We also believe that information and constraints at the attribute level could be expressed at the data dictionary level or in a separate low level diagram below the ERD level. In these ways, we will keep an ERD as a conceptual model at enterprise level to deliver major semantics without visual overload and too much clutter. Consider the complexity of an ERD for a real-world application that has over 50 entities and hundreds of attributes and representing all those complex multi-relation and attribute constraints in the ERD. The use of a CASE tool that supports a conceptual ERD with its any low level diagram for attributes and/or its associated data dictionary should be a right direction for this problem. We note that we do not claim that some research topics suggested by Baida, such as relationships over relationships and attributes over attributes, are not interesting or worthy. Research in those topics would bring interesting new insights and powerful ways of representing complex semantics. What we claim here is that the ERD itself has much value as it is now, especially for relational applications, where all the examples of Baida indicate. We believe, however, that extending the ER model to support new application semantics such as biological applications should be encouraged. The “D” in ERD connotes to many researchers and practitioners the simplicity and power of communication that account for the model’s popularity. Indeed, as the Entity-Relationship model nears its birthday, we find its robustness remarkable.

References 1. 1.Atzeni, P. and Parker, D.S., “Assumptions in relational database theory”, in Proceedings of the ACM Symposium on Principles of Database Systems, March 1982. 2. Badia, A. “Entity-Relationship Modeling Revisited”, SIGMOD Record, 33(1), March 2004, pp. 77-82. 3. Batini, C., Ceri, S., and Navathe, S., Conceptual Database Design, Benjamin/Cummings, 1992. 4. Camps Paré, R. “From Ternary Relationship to Relational Tables: A Case against Common Beliefs”, SIGMOD Record, 31(20), June 2002, pp. 46-49. 5. Chen, P. “The Entity-Relationship Model – towards a Unified View of Data”, ACM Transactions on Database Systems, 1(1), 1976, pp. 9-36. 6. Connolly, T. and Begg, C., Database Systems, Edition, Addison-Wesley, 2002. 7. Dietrich, S. and Urban, S., Beyond Relational Databases, Prentice-Hall, to appear. 8. Dullea, J. and Song, I.-Y., “An Analysis of Cardinality Constraints in Redundant Relationships,” in Proceedings of Sixth International Conferences on Information and Knowledge Management (CIKM97), Las Vegas, Nevada, USA, Nov. 10-14, 1997, pp. 270-277.

54

Don Goelman and Il-Yeol Song

9. Dullea, J., Song, I.-Y., and Lamprou, I., “An Analysis of Structural Validity in EntityRelationship Modeling,” Data and Knowledge Engineering, 47(3), 2003, pp. 167-205. Ed., Addison10. Elmasri, R. and Navathe, S.B., Fundamentals of Database Systems, Wesley, 2003. 11. Genova, G., Llorenz, J., and Martinez, P., “The meaning of multiplicity of n-ary associations in UML”, Journal of Software and Systems Modeling, 1(2), 2002. 12. Jones, T. and Song, I.-Y., “Analysis of binary/ternary cardinality combinations in entityrelationship modeling”, Data & Knowledge Engineering, 19(1), 1996, pp. 39-64. 13. Jones, T. and Song, I.-Y., “Binary Equivalents of Ternary Relationships in EntityRelationship Modeling: a Logical Decomposition Approach.” Journal of Database Management, 11(2), 2000, (April-June, 2000), pp. 12-19. 14. MacLane, S., Categories for the Working Mathematician, Springer-Verlag, 1971. 15. Markowitz, V. and Shoshani, A., “Representing Extended Entity-Relationship Structures in Relational Databases: A Modular Approach”, ACM Transactions on Database Systems, 17(3), 1992, pp. 423-464. 16. McAllister, A., “Complete rules for n-ary relationship cardinality constraints”, Data & Knowledge Engineering, 27,1998, pp. 255-288. 17. Rumbaugh, J., Jacobson, I., and Booch, G., The Unified Modeling Language Reference Manual, Addison-Wesley, 1999. 18. Song, I.-Y., Evans, M., and Park, E.K., “A Comparative Analysis of Entity-Relationship Diagrams,” Journal of Computer and Software Engineering, 3(4) (1995), pp. 427-459. 19. Teorey, T., Database Modeling & Design, Edition, Morgan Kaufmann, 1999.

Modeling Functional Data Sources as Relations Simone Santini and Amarnath Gupta* University of California, San Diego

Abstract. In this paper we present a model of functional access to data that, we argue, is suitable for modeling a class of data repositories characterized by functional access, such as web sites. We discuss the problem of modeling such data sources as a set of relations, of determining whether a given query expressed on these relations can be translated into a combination of functions defined by the data sources, and of finding an optimal plan to do so. We show that, if the data source is modeled as a single relation, an optimal plan can be found in a time linear in the number of functions in the source but, if the source is modeled as a number of relations that can be joined, finding the optimal plan is NP-hard.

1 Introduction These days, we see a great diversification in the type, structure, and functionality of the data repositories with which we have to deal, at least when compared with as little as fifteen or twenty years ago. Not too long ago, one could quite safely assume that almost all the data that a program had both the need and the possibility to access were stored in a relational database or, were this not the case, that the amount of data, their stability, and their format made their insertion into a relational database feasible. As of today, such a statement would be quite undefensible. A large share of the responsibility for this state of affairs must be ascribed, of course, to the rapid diffusion of data communication networks, which created a very large collection of data that a person or a program might want to use. Most of the data available on data communication networks, however, are not in relational form [1] and, due to the volume and the instability of the medium, the idea of storing them all into a stable repository is quite unfeasible. The most widely known data access environment of today, the world-wide web, was created with the idea of displaying reasonably well formatted pages of material to people, and of letting them “jump” from one page to another. It followed, in other words, a rather procedural model, in which elements of the page definition language (tags) often stood for actions: a link specified a “jump” from one page to another. While a link establishes a connection between two pages, this connection is not symmetric (a link that carries you from page A to page B will not carry you from page B to page A) and therefore is not a relation between two pages (in the sense in which the term *

The work presented in this paper was done under the auspices and with the funding of NIH project NCRR RR08 605, Biomedical Informatics Research Network, which the authors gratefully acknowledge.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 55–68, 2004. © Springer-Verlag Berlin Heidelberg 2004

56

Simone Santini and Amarnath Gupta

“relation” is used in databases), but rather a functional connection that, given page A, will produce page B. In addition to this basic mechanism, today many web sites that contain a lot of data allow one to specify search criteria using the so-called forms. A form is an input device through which a fixed set of values can be assigned to an equally fixed set of parameters, the values forming a search criterion against which the data in the web site will be matched, returning the data that satisfy the criterion. Consider the web site of a public library (an example to which we will return in the following). Here one can have a form that, given the name of an author returns a web page (or other data structures) containing the titles of the books written by that author. This doesn’t imply that a corresponding form will exist that, given the title of a book, will return its author. In other words, the dependence is not necessarily invertible. This limitation tells us that we are not in the presence of a set of relations but, rather, in the presence of a data repository with functional access. The diffusion of the internet as a source of data has, of course, generated a great interest in the conceptual modeling of web sites [2–4]. In this paper we present a formalization of the problem of representing a functional data source as a set of relations, and of translating (whenever possible) relational queries into sequences of functions.

2 The Model For the purpose of this work, a functional data source is a set of procedures that, given a number of attributes whose value has been fixed, instructs us on how to obtain a data structure containing further attributes related to the former. To fix the ideas, consider again the web site of a library. A procedure is defined that, given the name of an author, retrieves a data structure containing the titles of all the books written by that author. The procedure for doing so looks something like this: Procedure 1: author - > set(title) i) go to the “search by author” page; ii) put the desired name into the “author” slot of the form that you find there; iii) press the button labeled “go”; iv) look at the page that will be displayed next, and retrieve the list of titles. Getting the publisher and the year of publication of a book, given its author and title is a bit more complicated: Procedure 2: author, title - > publisher, year i) execute procedure 1 and get a list of titles; ii) search the desired title in the list; iii) if found then iii.1) access the book page, by clicking on the title; iii.2) search the publisher and year, and return them; iv) else fail. On the other hand, in most library web pages there is no procedure that allows one to obtain a list of all the books published by a given publisher in a given year, and a query asking for such information would be impossible to answer.

Modeling Functional Data Sources as Relations

57

We start by giving an auxiliary definition, and then we give the definition of the kind of functional data sources that we will consider in the rest of the paper. Definition 1. A data sort S is a pair (N, T), written S = N : T, where N is the name of the sort, and T its type. Two data sorts are equal if their names and their types coincide. A data sort, in the sense in which we use the term here, is not quite a “physical” data type. For instance, author:string and title:string are both of the same data type (string) but they represent different data sorts1.The set of complex sorts is the transitive closure of the set of data sorts with respect to Cartesian product of sorts and the formation of collection types (sets, bags, and lists). Definition 2. A functional data source is a pair (S, F) where set of data sorts, is a set of functions are composite sorts made of sorts in S.

is a where both

and

In the library web site, author:string, and year:int are examples of data sorts. The procedures are instantiations of functions. Procedure 1, for example, instantiates a function The elements “author:string” and “title:string” are examples of composite sorts. Sometimes, when there is no possibility of confusion, we will omit the type of the sort. Our goal in this paper is to model a functional data source like this one in a way that resembles a set of relations upon which we can express our query conditions. To this end, we give the following definition. Definition 3. A relational model of a functional data source is a set of relations where and all the are sorts of the functional data source. The relation is called a relational façade for the underlying data source, and will sometimes be indicated as The problems we consider in this paper are the following: (1) Given a model R of a functional data source (S, F) and a query on the model, is it possible to answer the query using the procedures defined for the functional data source?, and (2) if the answer to the previous question is “yes,” is it possible to find an optimal sequence of procedures that will answer the query with minimal cost? It goes without saying that not all the queries that are possible on the model are also possible on the data source. Consider again the library web site; a simple model for this data source is composed of a single relation, that we can call “book,” and defined as: book(name:string, title:string, publisher:string, year:int). 1

The entities that we call data sorts are known in other quarters as “semantic data types.” This name, however, entails a considerable epistemological commitment, quite out of place for a concept that, all in all, has nothing semantic about it: an author:string is as syntactic an entity as any abstract data type, and does not require extravagant semantic connotations.

58

Simone Santini and Amarnath Gupta

A query like (N, T):- book(N, T, ‘dover’, 1997), asking for the author and title of all books published by dover in 1997 is quite reasonable in the model, but there are no procedures on the web site to execute it. We will assume, to begin with, that the model of the web site contains a single relation. In this case we can also assume, without loss of generality, that the relation is defined in the Cartesian product of all the sorts in the functional data source: Throughout this paper, we will only consider non-recursive queries. It should be clear in the following that recursive queries require a certain extension of our method, but not a complete overhaul of it. Also, we will consider conjunctive queries2, whose general form can be written as:

where are constants, all the S’s come from the sorts of the relation R, and the are comparison operators drawn from a suitable set, say We will for the moment assume that the functional data source provides no mechanism for verifying conditions of the type The only operations allowed are retrieving data by entering values (constants) in a suitable field of a form or traversing a link in a web site with a constant as a label (such as the title of a book in the library example). Given the query (1) in a data source like this, we would execute it by first determining whether the function can be computed. If it can, we compute and, for each result returned, check whether the conditions are verified. The complicated part of this query schema is the first step: the determination of the function that, given the constants in the query, allows us to obtain the query outputs augmented with all the quantities needed for the comparisons.

3 Query Translation Informally, the problem that we consider in this section is the following. We have a collection of data sorts Given two data sorts defined as Cartesian products of elements of and one can define a formal (and unique) correspondence function This function operates on the model of the data source (this is why we used the adjective “formal” for it: it is not necessarily a function that one can compute) and, given the values returns the corresponding values If are the input values, this function computes the relational algebra operation

where the N’s are the names of the sorts S, as per definition 1. A correspondence function can be seen, in other words, as the functional counterpart of the query (2) 2

Any query can, of course, be translated in a disjunctive normal form, that is, in a disjunction of conjunctive queries. The system in this case will simply pose all the conjunctive queries and then take the union of all the results.

Modeling Functional Data Sources as Relations

59

which, on a single table, is completely general. (Remember that we don’t yet consider conditions other than the equality with a constant.) The set of all correspondence functions contains the grounding of all queries that we might ask on the model. The functional data source, on the other hand, has procedures each one of which implements a specific function a situation that we will indicate with The set of all implemented correspondence functions if Our query implementation problem is then, given a query with the relative correspondence function to find a suitable combination of functions in that is equal to In order to make this statement more precise, we need to clarify what do we mean by “suitable combination of functions” that is, we need to specify a function algebra. We will limit our algebra to three simple operations that create sequences of functions, as shown in Table 1. (We assume, pragmatically, that more complex manipulations are done by the procedures

A function for which a procedure is defined, and that transforms a data sort S into a data sort P can be represented as a diagram

The operators of the function algebra generate diagrams like those in the first and third column of Table 2. In order to obtain the individual data types, we introduce the formal operator of projection. The projection is “formal” in that it exists only in the diagrams: in practice, when we have the data type P × Q we simply select the portion of it that we need. The projections don’t correspond to any procedure and their cost is zero. The dual of the projection operator is the Cartesian product which, given two data of type A and B produces from them a datum of type A × B. This is also a formal operator with zero cost. where the dotted line with the × symbols is there to remind us that we are using a Cartesian product operator, and the arrow goes from the type that will appear first in the product to the type that will appear second (we will omit the arrow when this indication is superfluous). The Cartesian product of the functions

and

is represented

as

With these operations, and the corresponding diagrams, in place, we can arrange the correspondence functions in a diagram, which we call the computation diagram of a data source.

60

Simone Santini and Amarnath Gupta

Definition 4. The computation diagram of a functional data source is a graph G = (N, E) with nodes labeled by a labeling function S being the set of composite data sorts of the source, and edges labeled by the labeling function such that each edge is one of the following: 1. A function edge, such that if the edge is and represented as in (3); 2. projection edges, 3. cartesian product edges

then

Let us go back now to our original problem. We have a query and a corresponthat we need to compute, where dence function are the data sorts for which we give values, and are the results that we desire. In order to see whether the computation is possible, we adopt the following strategy: first, build the computation diagram of the data source, then we add a node called to the graph, and connect it to as well as a node with edges coming from finally, we check whether a path exists from to If we are to find an optimal solution to the grounding of a correspondence function we need to assign a cost to each node of the graph and, in order to do this, we need to determine the cost of traversing an edge. The cost functions of the various combinations that appear in a computation graph are defined in table 2.

The problem of finding the optimal functional expression for a given query can therefore be reduced to that of finding the shortest path in a suitable function graph, a problem that we will now briefly elucidate. Let G be a function graph, G.V the set of its vertices, and G.E the set of its edges. For every node let be the distance between and the source of the path, the predecessor(s) of in the minimal path, and the set of nodes adjacent to (accounting for the edge directions) In addition, a cost function vertex × vertex real is defined such that is the cost of the edge If then The algorithm in table 3 uses the Djikstra’s shortest path algorithm to build a function graph that produces a given set of output from a given set of input, if such a graph

Modeling Functional Data Sources as Relations

61

exists: the function returns the set of nodes in G where, for each node is set to the cost of the path from to according to the cost function Dijkstra’s algorithm is a standard one and is not reported here.

4 Relaxing Some Assumptions The model presented so far is a way of solving a well known problem: given a set of functions, determine what other functions can be computed using their combination; our model is somewhat more satisfying from a modeling point of view because of the explicit inclusion of the cartesian product of data sorts and the function algebra operators necessary to take them into account but, from an algorithmic point of view, what we are doing is still finding the transitive closure of a set of functional dependencies. We will now try to ease some of the restrictions on the data source. These extensions, in particular the inclusion of joins, can’t be reduced to the transitive closure of a set of functional dependencies, and therein lies, from our point of view, the advantage of the particular form of our model. Comparisons. The first limitation that we want to relax is the assumption that the data source doesn’t have the possibility of expressing any of the predicates in the query (1). There are cases in which some limited capability in this sense is available.

62

Simone Santini and Amarnath Gupta

We will assume that the following limitations are in place: firstly, the data sources provides a finite number of predicate possibilities; secondly each predicate is of the form where S and R are fixed data sorts, and “op” is an operator that can be chosen amongst a finite number of alternative. The general idea here comes, of course, from an attempt to model web sites in which conditions can be expressed as part of “forms.” In order to incorporate these conditions into our method, one can consider them as data sorts: each condition is a data sort that takes values in the set of triples with of sort S and of sort R. In other words, indicating a sort as a pair N : T, where N is the name and T the data type of the sort, a comparison data sort is isomorphic to where 2 is the data type of the booleans. A procedure that accepts in input a value of a data sort and a condition on the data sorts would be represented as

The only difference between condition data sorts and regular data sorts is that conditions can’t be obtained as the result of a procedure, so that in a computation graph a condition should not have any incoming edge. Joins. Let us consider now the case in which the model of the functional data source consists of a number of relations. We can assume, for the sake of clarity, that there are only two relations in the model:

Each of these relations supports intra-relational queries that can be translated into functions and executed using the computation graph of that part of the functional source that deals with the data sorts in the relation. In addition, however, we have now queries that need to join data between the two relations. Consider the relations: and the following query:

We can compute this query in two ways. The first makes use of the following two correspondence functions:

To implement this query, we adopt the following procedure:

Modeling Functional Data Sources as Relations

63

Procedure 3: use the computation graph of to compute returning a set of i) pairs ii) for each pair returned: ii.1) compute using the graph of obtaining a set of results ii.2) for each form the pair and add it to the output. The procedure can be represented using a computation graph in which the graphs that compute and are used as components. Let us indicate the graph that computes the function as:

Then a join like that in the example is computed by the following diagram:

The second possibility to compute the join is symmetric. While in this case we used the relation to produce the variable on which we want to join and the relation to impose the join condition, we will now do the reverse. We will use the functions

and a computation diagram similar to the previous one. Checking whether the source can process the join, therefore, requires checking if either the pairs of functions or can be computed. The concept can be easily extended to a source with many relations and a query with many joins as follows. Take a conjunctive query, and let the set of its joins, with We can always rewrite a query so that each variable X will appear in only one relation, possibly adding some join conditions. Consider, for example, the fragment R(A, X), P(B, X), Q(C, X), which can be rewritten as

We will assume that all queries are normalized in this way. Given a variable X, let be the relation in which X appear. Also, given a relation R in the query, let the Cartesian product of its input sorts, and the Cartesian product of its output sorts.

64

Simone Santini and Amarnath Gupta

The algorithm for query rewriting is composed of two parts. The first is a function that determines whether a function from a given set of input to a given set of outputs can be implemented, and represented in Table 4. The second finds a join combination that satisfies the query. It is assumed that a set of the join conditions that appear in the query is given. The algorithm, reported in table 5 returns a computation graph that computes the query with all the required joins.

The correctness of the algorithm is proven in the following proposition: Proposition 1. Algorithm 1 succeeds if and only if the query with the required joins can be executed. The proof can be found in [5]. While the algorithm “joins” is an efficient (linear in the number of joins) way of finding a plan whose correctness is guaranteed, finding an optimal plan is inherently harder: Theorem 1. Finding the minimal set of functions that implements all the joins in the query is NP-hard. Proof. We prove the theorem with a reduction from graph cover. let G = (V, E) be a graph, with sets of nodes edges and with

Modeling Functional Data Sources as Relations

65

Given such a graph, we build a functional source and a query as follows. For each node define a sort and a function All the sorts are of the same data type. For each edge define a condition Also, define a function Finally, define the relations and the query

where the equality conditions are derived from the edges of the graph. The reduction procedure is clearly polynomial so, in order to prove the theorem we only need to prove that a solution of graph cover for G exists if and only if a cost-bound plan can be found for the query. 1. Suppose that a query plan for the query exists that uses B + 1 functions: (the function must obviously be part of every plan, since it is the only function that gives us the required output Y). Consider the set which contains, clearly, B nodes, and the edge of the graph. This edge is associated to a condition in the query and, since the query has been successfully planned, either the function or are in the plan. Consequently, either or are in the set, and the edge is covered. be a covering and consider the plan 2. let now The output is clearly produced correctly as long as all the join conditions are satisfied. let be a join condition. This corresponds to an edge and, since S is a covering, either or are in S. Assume that it is (if it is we can clearly carry out a similar argument). Then the plan contains the function which computes so that the variables and and the join are determined by the following graph fragment

5 Related Work The idea of modeling certain types of functional sources using a relational façade (or some modification thereof) is, of course, not new. The problem of conciliating the broad matching possibilities of a relation with the constraints deriving from the source has been solved in various ways the most common of which, to the best of our knowledge, is by the use of adornments [6,7], which also go under the name of binding patterns.

66

Simone Santini and Amarnath Gupta

Given a relation a binding pattern is a classification of the variables into input variables (which must be “bound” when the relation is accessed in the query, hence the name of the technique), output variables (which must be free when the relation is accessed), and dyadic variables, which can be indifferently inputs or outputs. Any query that accesses the relation by assigning values to the input variables and requiring values for some or all the output variables can be executed on that relation façade. A relational façade can, of course, have multiple binding patterns. If the relational façade is used to model an relation isomorphic to it, for instance, it allows all the possible bound/free binding patterns on its variables or, equivalently, all its variables are dyadic. In the following, a binding pattern for any relation will be represented as a string (where and stand for input, output, and dyadic, respectively, although dyadic variables will not appear in the examples that follow). Unlike our techinque, which determines query feasibility at run time, binding patterns are determined as part of the model. This difference results in a number of limitations of binding patterns, some examples of which are given below. Multiple relations with hidden sorts. Consider a source with five sorts, X, Y, P,W,Q, and the functional dependencies shown in the following diagram

We want to model this source as a pair of relations: and while the sort W should not be exported. Considering the two relations and the functions needed to answer queries on them, we can see that the relation has two binding patterns: and while has only A query such as would be rejected by the binding pattern verification system because produces a set of X values from the query constant but can’t take the X’s as an input, Mapping the query to a functional diagram, however, produces

Modeling Functional Data Sources as Relations

67

which is computable. Therefore, the query can be answered using the model presented in this paper. Non-binding conditions. Binding patterns are based, as the name suggests, on the idea of binding certain variables in a relation, that is, on the idea of assigning them specific values. Because of these foundations, binding pattern models are ill-equipped to deal with non-binding conditions (that is, essentially, with all conditions except equality and membership in a finite set). As an example, consider a source with three sorts, A, B, and C, and a function in addition, the source has a comparison capability which allows it to compare B with a fourth sort D and return C’s for a specified value of A such that a specified condition is verified: the diagram of this source is:

Because the condition is non-binding, it doesn’t contribute any binding pattern to the relation R(A, B, C) for which the only binding pattern is, therefore, A query such as where “=18 and self.agesize() select(dueDate>Today)->size(). A type with this stereotype defines an event type. Like any other entity type, event types may be specialized and/or generalized. This will allow us to build a taxonomy of event types, where common elements are defined only once. It is convenient to define a root entity type, named Event, as shown in Figure 2. All event types are direct or indirect subtypes of Event. In fact, Event is defined as derived by union of its subtypes. We define in this event type the attribute time, which gives the occurrence time of each event. We define also the abstract operation effect, whose purpose will be made clear later. It is not necessary to stereotype event types as > because all direct or indirect subtypes of Event will be considered event types. The view of events as entities is not restricted to domain events. We apply it also to query events.

3.2 Event Characteristics The characteristics of an event are the set of relationships in which it participates. There is at least one relationship between each event entity and a time point, representing the event occurrence time. We assume that the characteristics of an event are determined when the event occurs, and remain fixed. In a particular language, the characteristics of events should be modeled like those of ordinary entities. In the UML, we model them as attributes or associations. Figure 2 shows the definition of the external domain event type NewProduct, with four attributes (including time) and an association with Vendor. The immutability of characteristics can be defined by setting their isReadOnly property to true (not shown in the Figure) [29, p. 89+].

Fig. 2. Definition of event type NewProduct

Definition of Events and Their Effects

141

Event characteristics may be derived. The value for a derived characteristic may be computed from other characteristics and/or the state of the IB when the event occurs, as specified by the corresponding derivation rule. The practical importance of derived characteristics is that they can be referred to in any expression (integrity constraints, effect, etc.) exactly as the base ones, but their definition appears in a single place (derivation rule). In the UML, derived elements (attributes, associations, entity types) are marked with a slash (/). We define derivation rules by means of defining operations [26]. In the example of Figure 2, attribute vendorName gives the name of the vendor that will supply the new product. The association between NewProduct and Vendor may be derived from the vendor’s name. The defining operation: NewProduct::vendor( ): vendor gives the vendor associated with an event instance. In the UML 2.0, the result of operations is specified by a body expression [29, p. 76+]. Using the OCL, the formal specification of the above operation may be:

3.3 Event Constraints An event constraint is a condition an event must satisfy to occur [8]. An event constraint involves the event characteristics and the state of the IB before the event occurrence. It is assumed that the state of the IB before the event occurrence satisfies all defined constraints. Therefore, an event E can occur when the domain is in state S if: (1) state S satisfies all constraints, and (2) event E satisfies its event constraints. An IS checks event constraints when the events occur and the values of their characteristics have been established, but before the events have any effect in the IB or produce any answer. Events that do not satisfy their constraints are not allowed to occur and, therefore, they must be rejected. Event constraints checking is (assumed to be) done instantaneously. In a particular conceptual modeling language, event constraints can be represented like any other constraint. In the UML, they can be expressed as invariants or as constraint operations [27]. Event constraints are always creation-time constraints because they must be satisfied when events occur. Here we will define constraints by operations, called constraint operations, and we specify them in the OCL. In the UML, we show graphically constraint operations with the stereotype . The result of the evaluation of constraint operations must be true. A constraint of the NewProduct event (Figure 2) is that the product being added cannot exist already. We define it with the constraint operation doesNotExist. The specification in the OCL is:

On the other hand, the vendor must exist. This is also an event constraint. However, in this case the constraint can be expressed as a cardinality constraint. The multiplicity 1 in the vendor role requires that each instance of NewProduct must be

142

Antoni Olivé

linked to exactly one vendor. The constraint is violated if the vendor() operation does not return an instance of Vendor. An event constraint defined in a supertype applies to all its direct and indirect instances. This is one of the advantages of defining event taxonomies: common constraints can be defined in a generalized event type. Figure 3 shows an example. The event type ExistingProductEvent is defined as the union of NewRequirement, PurchaseOrderRelease and ProductDetails. The constraint that the product must exist is defined in ExistingProductEvent, and it applies to all its indirect instances. Note that the constraint has been defined by a cardinality constraint, as explained above. Although it is not shown in Figure 3, the event type ExistingProductEvent is a subtype of Event. Figure 3 shows also the constraint validDate in NewRequirement. The constraint is satisfied if dateRequired is greater than the event date.

Fig. 3. ExistingProductEvent is asubtype of Event (not shown here) and a common supertype of domain event types. NewRequirement and PurchaseOrderRelease and of the query event type ProductDetails

3.4 Query Events Effects The effect of a query event is an answer providing the requested information. The effect is specified by an expression whose evaluation on the IB gives the requested information. The query is written in some language, depending on the conceptual modeling language used. In the UML, we can represent the answer to a query event and the query expression in several ways. We explain one of them here, which can be used as is, or as a basis for the development of alternative ways. The answer to a query event is modeled as one or more attributes and/or associations of the event, with some predefined name. In the examples, we shall use names starting with answer. An alternative could be the use of a stereotype to indicate that an attribute or association is the answer of the event. Now, we need a way to define the value of the answer attributes and associations. To this end, we use the operation effect that we have defined in Event. This operation will have a different specification in each event type. For query events, its purpose is

Definition of Events and Their Effects

143

to specify the values of the answer attributes and associations. The specification of the operation can be done by means of postconditions, using the OCL. Figure 3 shows the representation of external query event type ProductDetails. The answer is given by attribute: The specification of the effect operation may be:

Alternatively, in O-O languages the answer to a query event could be specified as the invocation of some operation. The effect of this operation would then be the answer of the query event.

3.5 Domain Events Effects: The Postcondition Approach The effect of a domain event is a set of structural events. There are two main approaches to the definition of that set: the postconditions and the structural events approaches [25]. These approaches are called declarative and imperative specifications, respectively, in [34]. In the former, the definition is a condition satisfied by the IB after the application of the event effect. In the latter, the definition is an expression whose evaluation gives the corresponding structural events. Both approaches can be used in our method, although we (as many others) tend to favor the use of postconditions. We deal with the postcondition approach in this subsection, and the structural events approach in the next one.

Fig. 4. Definition of OrderReception and OrderReschedule event types

In the postcondition approach, the effect of an event Ev is defined by a condition C over the IB. The idea is that the event Ev leaves the IB in a state that satisfies C. It is also assumed that the state after the event occurrence satisfies all constraints defined over the IB. Therefore, the effect of event Ev is a state that satisfies condition C and all IB constraints.

144

Antoni Olivé

In an O-O language, we can represent the effect of a domain event in several ways. As we did for query events, we explain one way here, which can be used as is, or as a basis for the development of alternative ways. We define a particular operation in each domain event type, whose purpose is to specify the effect. To this end, we use the operation effect that we have defined in Event. This operation will have a different specification in each event type. Now, the postcondition of this operation will be exactly the postcondition of the corresponding event. As we have been doing until now, in the UML we also use the OCL to specify these postconditions formally. As an example, consider the external domain event type NewRequeriment, shown in Figure 3. The effect of one instance of this event type is the addition of one instance into entity type Requirement (see Figure 1). Therefore, in this case the specification of the effect operation is:

In our method, we do not define preconditions in the specification of effect operations. The reason is that we implicitly assume that the events satisfy their constraints before the application of their effect. In the example, we assume implicitly that a NewRequirement event references an existing product, and that its required date is valid. The postcondition states simply that a new instance of Requirement has been created in the IB, with the corresponding values for its attributes and association. Any implementation of the effect operation that leaves the IB in a state that satisfies the postcondition and the IB constraints is valid. Another example is the external domain event type OrderReception (see Figure 4). An instance of OrderReception occurs when a scheduled receipt is received. The event effect is that the purchase order now becomes ReceivedOrder (see Figure 1), and that the quantity on hand of the corresponding product is increased by the quantity received. We specify this effect with two postconditions of effect() in OrderReception:

3.6 Domain Events Effects: The Structural Events Approach In the structural events approach, the effect of an event Ev is defined by an expression written in some language. The idea is that the evaluation of the expression gives the

Definition of Events and Their Effects

145

set S of structural events corresponding to the event Ev effect. The application of S to the previous state of the IB produces the new state. The new state of the IB is the previous state plus the entities or relationships inserted, and minus the entities or relationships deleted. This approach is in contrast with the previous one, which defines a condition that characterizes the state of the IB after the event. It is assumed that the set S is such that it leaves the IB in a new state that satisfies all the constraints. Therefore, when defining the expression, one must take into account the existing constraints, and to ensure that the new state of the IB will satisfy all of them. Our method could be used in O-O languages that follow the structural events approach. The idea is to provide a method for the effect operations. The method is a procedural expression, written in the corresponding language, whose evaluation yields the structural events.

3.7 Comparison with Previous Work In most current conceptual modeling methods and languages, events are not considered objects. Instead of this, events are represented as invocations of actions or operations, or the reception of signals or messages. Event types are defined by means of operations (with their signatures) or an equivalent construct. We believe that the view of events as entities (albeit of a special kind) provides substantial benefits to behavioral modeling. The reason is that the uniform treatment of event and entity types implies that most (if not all) language constructs available for entity types can be used also for event types. In particular: (1) Event types with common characteristics, constraints, derivation rules and effects can be generalized, so that common parts are defined in a single place, instead of repeating them in each event type. We have found that, in practice, many event types have characteristics, constraints and derivation rules in common with others [14]; (2) The graphical notation related to entity types (including attributes, associations, multiplicities, generalization, packages, etc.) can be used also for event types; and (3) Event types can be specialized in a way similar to entity types, as we explain in the next section.

4 Event Specialization One of the fundamental constructs of O-O conceptual modeling languages is the specialization of entity types. When we consider events as entities, we have the possibility of defining specializations of event types. We may use these specializations when we want to define an event type whose characteristics, constraints and/or effect are extensions and/or specializations of another event type. For example, assume that some instances of NewRequirement are special because they require a large quantity of their product and, for some reason, the quantity required must be ordered immediately to the corresponding vendor. This special behavior can be defined in a new event type, SpecialRequirement, defined as a specialization of NewRequirement, as shown Figure 5. Note that SpecialRequirement redefines the constraint validDate, and adds a new constraint called largeQuantity. The required date of the new events must be beyond

146

Antoni Olivé

Fig. 5. Two specializations of the event type NewRequirement (Fig. 3)

the current date plus the vendor’s lead time, and the quantity required must be at least ten times the product order minimum. In the UML, the body of operations may be overridden when an operation is redefined, whereas preconditions and postconditions can only be added [29, p. 78]. Therefore, we redefine validDate as:

The new constraint largeQuantity can be defined as:

The effect of a SpecialRequirement is the same as that of a NewRequirement, but we want the system to generate an instance of PurchaseOrderRelease (see Figure 3). We define this extension as an additional postcondition of the effect operation:

On the other hand, we can define event types derived by specialization. A derived event type is an event type whose instances at any time can be inferred by means of a derivation rule. An event type Ev is derived by specialization of event types when Ev is derived and their instances are also instance of [28]. We may use event types defined by specialization when we want to define particular constraints and/or effect for events that satisfy some condition. For example, suppose that some instances of NewRequirement are urgent because they are required within the temporal horizon of the current MRP plan (seven days), and therefore they could not have been taken into account when the plan was generated. We want a behavior similar to the previous example. The difference is that now we determine automatically which are the urgent requirements. We define a new

Definition of Events and Their Effects

147

event type, UrgentRequirement, shown Figure 5, defined as derived by specialization of NewRequirement. In the UML, the name we give to the defining operations for derived entity types is allInstances [26]. In this case, allInstances is a class operation that gives the population of the type. The derivation rule of UrgentRequirement is then:

The effect of an urgent requirement is the same as that of a new requirement, but again we want the system to generate an instance of PurchaseOrderRelease (see Figure 3). We would define this extension as an additional postcondition of the effect operation, as we did in the previous example. Comparison with Previous Work. Event specialization is not possible when events are seen as operation invocations. The consequence is that this powerful modeling construct cannot be used in methods like those mentioned in the Introduction.

5 Conclusions In the context of O-O conceptual modeling languages, we have proposed a method that models events as entities (objects), and event types as a special kind of entity types. The method makes an extensive use of language constructs such as constraints, derived types, derivation rules, type specializations, operations and operation redefinition, which are present in all complete conceptual modeling languages. The method can be adapted to most O-O languages. In this paper we have explained in detail its adaptation to the UML. The method is fully compatible with the UML-based CASE tools, and thus it can be adopted in industrial projects, if it is felt appropriate. The main advantage of the method we propose is the uniform treatment we give to event and entity types. The consequence is that most (if not all) language constructs available for entity types can be used also for event types. Event types may have constraints and derived characteristics, like entity types. Characteristics, constraints and effects shared by several event types may be defined in a single place. Event specialization allows the incremental definition of new event types, as refinements of their supertypes. Historical events ease the definition of constraints, derivation rules and event effects. In summary, we believe that the view of events as entities provides substantial benefits to behavioral modeling. Among the work that remains to be done, there is the integration of the proposed method with the state transition diagrams. These diagrams allow defining the kinds of events that were beyond the scope of this paper.

Acknowledgements I would like to thank Jordi Cabot, Jordi Conesa, Dolors Costal, Xavier de Palol, Cristina Gómez, Anna Queralt, Ruth Raventós, Maria Ribera Sancho and Ernest

148

Antoni Olivé

Teniente for their help and many useful comments to previous drafts of this paper. This work has been partly supported by the Ministerio de Ciencia y Tecnologia and FEDER under project TIC2002-00744.

References 1. Abrial, J-R. The B-Book. Cambridge University Press, 1996, 779p. 2. Bonner, A.J.; Kifer, M. “The State of Change: A Survey”. LNCS 1472, 1998, pp. 1-36. 3. Borgida, A.; Greenspan, S. “Data and Activities: Exploiting Hierarchies of Classes”. Workshop on Data Abstraction, Databases and Conceptual Modelling, 1980, pp. 98-100. 4. Bubenko, J.A.jr. “Information Modeling in the Context of System Development”. Proc. IFIP 1980, North-Holland, 1980, pp. 395-411. 5. Cabot, J.; Olivé, A.; Teniente, E. “Representing Temporal Information in UML”. Proc. UML’03. LNCS 2863, pp. 44-59. 6. Ceri, S.; Fraternalli, P. Designing Database Applications with Objects and Rules. The IDEA Methodology. Addison-Wesley, 1997, 579p. 7. Coleman, D.; Arnold, P.; Bodoff, S.; Dollin, C.; Gilchrist, H.; Hayes, F.; Jeremaes, P. Object-Oriented Development. The Fusion Method. Prenticel Hall, 1994, 316p. 8. Cook, S.; Daniels, J. Designing Object Systems. Object-Oriented Modelling with Syntropy. Prentice Hall, 1994, 389 p. 9. Costal, D.; Olivé, A.; Sancho, M-R. “Temporal Features of Class Populations and Attributes in Conceptual Models”. Proc. ER’97, LNCS 1331, Springer, pp. 57-70. 10. D’Souza, D.F.; Wills, A.C. Objects, Components and Frameworks with UML. The Catalysis Approach. Addison-Wesley, 1999, 785 p. 11. Dardenne, A.; van Lamsweerde, A.; Fickas, S. “Goal-directed requirements acquisition”. Science of Computer Programming, 20(1993), pp. 3-50. 12. Davis, A.M. Software Requirements. Objects, Functions and States. Prentice-Hall, 1993. 13. Embley, D.W.; Kurtz, B.D.; Woodfield, S.N. Object-Oriented System Analysis. A ModelDriven Approach. Yourdon Press, 1992, 302 p. 14. Frias, L.; Olivé, A.; Queralt, A. “EU-Rent Car Rentals Specification”. UPC, Research Report LSI 03-59-R, 2003,159 p., http: //www.lsi.upc.es/dept/techreps/techreps.html. 15. Gamma, E.; Helm, R.; Johnson, R.; Vlissides, J. Design Patterns. Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995, 395 p. 16. Harel, D.; Gery, E. “Executable Object Modeling with Statecharts”. IEEE Computer, July 1997, pp. 31-42. 17. IEEE. IEEE Standard for Conceptual Modeling Language Syntax and Semantics for IDEF1X97 (IDEFobject). IEEE Std 1320.2-1998, 1999. 18. ISO/TC97/SC5/WG3. “Concepts and Terminology for the Conceptual Schema and the Information Base”, J.J. van Griethuysen (ed.), March, 1982. 19. Jungclaus, R.; Saake, G.; Hartmann, T.; Sernadas, C. ‘TROLL-A Language for ObjectOriented Specification of Information Systems”. ACM TOIS, 14(2), 1996, pp. 175-211. 20. Larman, C. Applying UML and Patterns. Prentice Hall, 2002, 627 p. 21. Martin, J.; Odell, J.J. Object-Oriented Methods: A Foundation. Prentice Hall, 1995,412 p. 22. Martin, R.C. Agile Software Development, Principles, Patterns and Practices. Prentice Hall, 2003, 529p. 23. Mellor, S.J.; Balcer, M.J. Executable UML. A Foundation for Model-Driven Architecture. Addison-Wesleyy, 2002, 368p. 24. Mylopoulos, J.; Bernstein, P.A.; Wong, H.K.T. “A Language Facility for Designing Database-Intensive Applications”. ACM TODS, 5(2), pp. 185-207, 1980.

Definition of Events and Their Effects

149

25. Olivé, A. “Time and Change in Conceptual Modeling of Information Systems”. In Brinkkemper, S.; Lindencrona, E.; Solvberg, A. “Information Systems Engineering. State of the Art and Research Themes”, Springer, 2000, pp. 289-304. 26. Olivé, A. “Derivation Rules in Object-Oriented Conceptual Modeling Languages”. Proc. CAiSE 2003, LNCS 2681, pp. 404-420. 27. Olivé, A. “Integrity Constraints Definition in Object-Oriented Conceptual Modeling Languages”. Proc. ER 2003, LNCS 2813, pp. 349-362, 2003. 28. Olivé, A.; Teniente, E. “Derived types and taxonomic constraints in conceptual modeling”. Information Systems, 27(6), 2002, pp. 391-409. 29. OMG. UML Superstructure 2.0 Final Adopted Specification, 2003, http://www.omg.org/cgi-bin/doc?ptc/2003-08-02. 30. Robinson, K.; Berrisford, G. Object-oriented SSADM. Prentice Hall, 1994, 524p. 31. Rumbaugh, J.; Jacobson, I.; Booch, G. The Unified Modeling Language Reference Manual. Addison-Wesley, 1999,550 p. 32. Selic,B.; Gullekson, G.; and Ward, P.T. Real-Time Object-Oriented Modeling. John Wiley & Sons, 1994, 525p. 33. Teisseire, M; Poncelet, P.; Cichetti, R. “Dynamic Modelling with Events”, Proc. CAiSE’94, LNCS 811, pp. 186-199, 1994. 34. Wieringa, R. “A survey of structured and object-oriented software specification methods and techniques”. ACM Computing Surveys, 30(4), December 98, pp. 459-527.

Enterprise Modeling with Conceptual XML David W. Embley1, Stephen W. Liddle2, and Reema Al-Kamha1 1

Department of Computer Science Brigham Young University, Provo, Utah 84602, USA {embley,reema}@cs.byu.edu 2

School of Accountancy and Information Systems Brigham Young University, Provo, Utah 84602, USA [emailprotected]

Abstract. An open challenge is to integrate XML and conceptual modeling in order to satisfy large-scale enterprise needs. Because enterprises typically have many data sources using different assumptions, formats, and schemas, all expressed in – or soon to be expressed in – XML, it is easy to become lost in an avalanche of XML detail. This creates an opportunity for the conceptual modeling community to provide improved abstractions to help manage this detail. We present a vision for Conceptual XML (C-XML) that builds on the established work of the conceptual modeling community over the last several decades to bring improved modeling capabilities to XML-based development. Building on a framework such as C-XML will enable better management of enterprise-scale data and more rapid development of enterprise applications.

1

Introduction

A challenge [3] for modern enterprise modeling is to produce a simple conceptual model that: (1) works well with XML and XML Schema; (2) abstracts well for conceptual entities and relationships; (3) scales to handle both large data sets and complex object interrelationships; (4) allows for queries and defined views via XQuery; and (5) accommodates heterogeneity. The conceptual model must work well with XML and XML Schema because XML is rapidly becoming the de facto standard for business data. Because conceptualizations must support both high-level understanding and high-level program construction, the conceptual model must abstract well. Because many of today’s huge industrial conglomerations have large, enterprise-size data sets and increasingly complex constraints over their data, the conceptual model must scale up. Because XQuery, like XML, is rapidly becoming the industry standard, the conceptual model must smoothly incorporate both XQuery and XML. Finally, because we can no longer assume that all enterprise data is integrated, the conceptual model must accommodate heterogeneity. Accommodating heterogeneity also supports today’s rapid acquisitions and mergers, which require fast-paced solutions to data integration. We call the answer we offer for this challenge Conceptual XML (C-XML). C-XML is first and foremost a conceptual model, being fundamentally based on object-set and relationship-set constructs. As a central feature, C-XML supports P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 150–165, 2004. © Springer-Verlag Berlin Heidelberg 2004

Enterprise Modeling with Conceptual XML

151

high-level object- and relationship-set construction at ever higher levels of abstraction. At any level of abstraction the object and relationship sets are always first class, which lets us address object and relationship sets uniformly, independent of level of abstraction. These features of C-XML make it abstract well and scale well. Secondly, C-XML is “model-equivalent” [9] with XML Schema, which means that C-XML can represent each component and constraint in XML Schema and vice versa. Because of this correspondence between C-XML and XML Schema, XQuery immediately applies to populated C-XML model instances and thus we can raise the level of abstraction for XQuery by applying it to high-level model instances rather than low-level XML documents. Further, we can define high-level XQuery-based mappings between C-XML model instances over in-house, autonomous databases, and we can declare virtual views over these mappings. Thus, we can accommodate heterogeneity at a higher level of abstraction and provide uniform access to all enterprise data. Besides enunciating a comprehensive vision for the XML/conceptual-modeling challenge [3], our contributions in this paper include: (1) mappings to and from C-XML and XML Schema, (2) defined mechanisms for producing and using firstclass, high-level, conceptual abstractions, and (3) XQuery view definitions over both standard and federated conceptual-model instances that are themselves conceptual-model equivalent. As a result of these contributions, C-XML and XML Schema can be fully interchangable in their usage over both standard and heterogeneous XML data repositories. This lets us leverage conceptual model abstractions for high-level understanding while retaining all the complex details involved with low-level XML Schema intricacies, view mappings, and integration issues over heterogeneous XML repositories. We present the details of our contributions as follows. Section 2 describes C-XML. Section 3 shows that C-XML is “model-equivalent” with XML Schema by providing mappings between the two. Section 4 describes C-XML views. We report the status of our implementation and conclude in Section 5.

2

C-XML: Conceptual XML

C-XML is a conceptual model consisting of object sets, relationship sets, and constraints over these object and relationship sets. Graphically a C-XML model instance M is an augmented hypergraph whose vertices and edges are respectively the object sets and relationship sets of M, and whose augmentations consist of decorations that represent constraints. Figure 1 shows an example. In the notation boxes represent object sets – dashed if lexical and not dashed if nonlexical because their objects are represented by object identifiers. With each object set we can associate a data frame (as we call it) to provide a rich description of its value set and other properties. A data frame lets us specify, for example, that OrderDate is of type Date or that ItemNr values must satisfy the value pattern “[A-Z]{3}-\d{7}”. Lines connecting object sets are relationship sets; these lines may be hyper-lines (hyper-edges in hyper-graphs) with diamonds when they have more than two connections to object sets. Optional

152

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 1. Customer/Order C-XML Model Instance.

or mandatory participation constraints respectively specify whether objects in a connected relationship may or must participate in a relationship set (an “o” on a connecting relationship-set line designates optional while the absence of an “o” designates mandatory). Thus, for example, the C-XML model instance in Figure 1 declares that an Order must include at least one Item but that an Item need not be included in any Order. Arrowheads on lines specify functional constraints. Thus, Figure 1 declares that an Item has a Price and a Description and is in a one-to-one correspondence with ItemNr and that an Item in an Order has one Qty and one SalePrice. In cases when optional and mandatory participation constraints along with functional constraints are insufficient to specify minimum and maximum participation, explicit min..max constraints may be specified. Triangles denote generalization/specialization hierarchies. We can constrain ISA hierarchies by partition union or mutual exclusion (+) among specializations. Any object-set/relationship-set connection may have a role, but a role is simply a shorthand for an object set that denotes the subset consisting of the objects that actually participate in the connection.

3

Translations Between C-XML and XML Schema

Many translations between C-XML and XML Schema are possible. In recent ER conferences, researchers have described varying conceptual-model translations to and/or from XML or XML DTD’s or XML-Schema-like specifications. (See,

Enterprise Modeling with Conceptual XML

153

for example, [4, 6, 10].) It is not our purpose here to argue for or against a particular translation. Indeed, we would argue that a variety of translations may be desirable. For any translation, however, we require information and constraint preservation. This ensures that an XML Schema and a conceptual instantiation of an XML Schema as a C-XML model instance correspond and that a system can reflect manipulations of the one in the other. To make our correspondence exact, we need information- and constraintpreserving translations in both directions. We do not, however, require that translations be inverses of one another – translations that generate members of an equivalence class of XML Schema specifications and C-XML model instances are sufficient. In Section 3.1 we present our C-XML-to-XML-Schema translation, and in Section 3.2 we present an XML-Schema-to-C-XML translation. In Section 3.3 we formalize the notions of information and constraint preservation and show that the translations we propose preserve information and constraints.

3.1

Translation from C-XML to XML Schema

We now describe our process for translating a C-XML model instance C to an XML Schema We illustrate our translation process with the C-XML model instance of Figure 1 translated to the corresponding XML Schema excerpted in Figure 2. Fully automatic translation from C to is not only possible, but can be done with certain guarantees regarding the quality of Our approach is based on our previous work [8], which for C generates a forest of scheme trees such that (1) has a minimal number of scheme trees, and (2) XML documents conforming to have no redundant data with respect to functional and multivalued constraints of C. For our example in Figure 1, the algorithms in [8] will generate the following two nested scheme trees.

Observe that the XML Schema in Figure 2 satisfies these nesting specifications. Item in the second scheme tree appears as an element on Line 8 with ItemNr, Description, and Price defined as its attributes on Lines 28–30. PreviousItem is nested, by itself, underneath Item, on Line 18, and Manufacturer, RequestDateTime, and Qty are nested underneath Item as a group on Lines 13–15. The XML-Schema notation that accompanies these C-XML object-set names obscures the nesting to some extent, but this additional notation is necessary either to satisfy the syntactic requirements of XML Schema or to allow us to specify the constraints of the C-XML model instance. As we continue, recall first that each C-XML object set has an associated data frame that contains specifications such as type declarations, value restrictions, and any other annotations needed to specify information about objects in

154

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 2. XML Schema Excerpt for the C-XML Model Instance in Figure 1.

the object set. For our work here, we let the kind of information that appears in a data frame correspond exactly to the kind of data constraint information specifiable in XML Schema. One example we point out explicitly is order information, which is usually absent in conceptual models, but is unavoidably present in XML. Thus, if we wish to say that CustomerName precedes CustomerAddr, we add the annotation CustomerAddr” to the CustomerName data frame and add the annotation CustomerName” to the CustomerAddr data frame. In our discussion, we assume that these annotations are in the data frames that accompany the object sets CustomerName and CustomerAddr in Figure 1. Our conversion algorithm preserves all annotations found in C-XML data frames. This is where we obtain all the type specifications in Figure 2. We cap-

Enterprise Modeling with Conceptual XML

155

ture the order specification, CustomerName Customer Addr, by making CustomerName and CustomerAddr elements (rather than attributes) and placing them, in order, in their proper place in the nesting – for our example in Lines 58 and 59 nested under CustomerDetails. In the conversion from C-XML to XML Schema we use attributes instead of elements where possible. An object set can be represented as an attribute of an element if it is lexical, is functionally dependent on the element, and has no order annotations. The object sets OrderID and OrderDate, for example, satisfy these conditions and appear as attributes of an Order element on Lines 75 and 76. Both attributes are also marked as “required” because of their mandatory connection to Order as specified by the absence of an “o” on their connection to Order in Figure 1. When an object set is lexical but not functional and order constraints do not hold, the object set becomes an element with minimum and maximum participation constraints. PreviousItem in Line 18 has a minimum participation constraint of 0 and a maximum of unbounded. Because XML Schema will not let us directly specify relationship sets we convert them all to binary relationship sets by introducing a tuple identifier. We can think of each diamond in a C-XML diagram as being replaced by a nonlexical object set containing these tuple identifiers. To obtain a name for the object set containing the tuple identifiers, we concatenate names of nonfunctionally dependent object sets. For example, given the relationship set for Order, Item, SalePrice, and Qty, we generate an OrderItem element (Line 63). If names become too long, we abbreviate using only the first letter of some object-set names. Thus, for example, we generate ItemMR (Line 11) for the relationship set connecting Item, Manufacturer, RequestDateTime, and Qty. When a lexical object set has a one-to-one relationship with a nonlexical object set, we use the lexical object set as a surrogate for the nonlexical object set and generate a key constraint. In our example, this generates key constraints for Order/OrderID in Line 35 and Item/ItemNr in Line 39. We also use these surrogate identifiers, as needed, to maintain explicit referential integrity. Observe that in the scheme trees above, Item in the first tree references Item in the root of the second scheme tree and also that PreviousItem in the second scheme tree is a role and therefore a specific specialization (or subset) of Item in the root. Thus, we generate keyref constraints, one in Lines 69–72 to ensure the referential integrity of ItemNr in the OrderItem element and another in Lines 22–25 for the PreviousItem element. Another construct in C-XML we need to translate is generalization/specialization. XML Schema uses the concept of substitution groups to allow the use of multiple element types in a given context. Thus, for example, we generate an abstract element for Customer in Line 44, but then specify in Lines 45–55 a substitution group for Customer that allows RegularCustomer and PreferredCustomer to appear in a Customer context. We model content that would normally be associated with the generalization by generating a group that is referenced in each specialization (in Lines 47 and 52). In our example, we generate the group

156

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

CustomerDetails and nest the details of Customer such as CustomerName, CustomerAddr, and Orders under CustomerDetails as we do beginning in Line 56. Further, we can nest any information that only applies to one of the specializations directly with that specialization; thus, in Line 48 we nest Discount under PreferredCustomer. Finally, XML documents need to have a single content root node. Thus, we assume the existence of an element called Document (Line 4) that serves as the universal content root.

3.2

Translation from XML Schema to C-XML

We translate XML Schema instances to C-XML by separating structural XML Schema concepts (such as elements and attributes) from non-structural XML Schema concepts (such as attribute types and order constraints). Then we generate C-XML constructs for the structural concepts and annotate generated C-XML object sets with the non-structural information. We can convert an XML Schema S to a C-XML model instance by generating object sets for each element and attribute type connected by relationship sets according to the nesting structure of S. Figure 3 shows the result of applying our conversion process to the XML Schema instance of Figure 2. Note that we nest object and relationship sets inside one another corresponding to the nested element structure of the XML Schema instance. Whether we display C-XML object sets inside or outside one another has no semantic significance. The nested structure, however, is convenient because it corresponds to the natural XML Schema instance structure. The initial set of generated object and relationship sets is straightforward. Each element or attribute generates exactly one object set, and each element that is nested inside another element generates a relationship set connecting the two. Each attribute associated with an element always generates a corresponding object set and a relationship set connecting to the object set generated by Participation constraints for attribute-generated relationship sets are always on the side and are either 1 or 0..1 on the side. Participation constraints for relationship sets generated by element nesting require a bit more work. If the element is in a sequence or a choice, there may be specific minimum/maximum occurrence constraints we can use directly. For example, according to the constraints on Line 60 in Figure 2 a CustomerDetails element may contain a list of 0 or more Order elements. However, an Order element must be nested inside a CustomerDetails element. Thus, for the relationship set connecting CustomerDetails and Order, we place participation constraints of on the CustomerDetails side, and 1 on the Order side. In order to make the generated C-XML model instance less redundant, we look for certain patterns and rewrite the generated model instance when appropriate. For example, since ItemNr has a key constraint, we infer that it is one-toone with Item. Further, the keyref constraints on ItemNr for PreviousItem and OrderItem indicate that rather than create two additional ItemNr object sets, we can instead relate PreviousItem and OrderItem to the ItemNr nested in Item.

Enterprise Modeling with Conceptual XML

157

Fig. 3. C-XML Model Instance Translated from Figure 2.

Another optimization is the treatment of substitution groups. In our example, since RegularCustomer and PreferredCustomer are substitutable for Customer, we construct a generalization/specialization for the three object sets and factor out the common substructure of the specializations into the generalization. Thus, CustomerDetails exists in a one-to-one relationship with Customer. Another complication in XML Schema is the presence of anonymous types. For example, the complex type in Line 5 of Figure 2 is a choice of 0 or more Customer or Item elements. We need a generalization/specialization to represent this, and since C-XML requires names for object sets, we simply concatenate all the top-level names to form the generalization name CustomerItem. There are striking differences between the C-XML model instances of Figures 1 and 3. The translation to XML Schema introduced new elements Document, CustomerDetails, OrderItem, and ItemMR in order to represent a toplevel root node, generalization/specializations, and decomposed relationship sets. If we knew that a particular XML Schema instance was generated from

158

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

an original C-XML model instance, we could perform additional optimizations. For example, if we knew CustomerDetails was fabricated by the translation to XML Schema, we could observe that in the reverse translation to C-XML it is superfluous because it is one-to-one with Customer. Similarly, we could recognize that Document is a fabricated top-level element and omit it from the reverse translation; this would also eliminate the need for CustomerItem and its generalization/specialization. Finally, we could recognize that relationship sets have been decomposed, and in the reverse translation reconstitute them. The original C-XML to XML Schema translation could easily place annotation objects in the generated XML Schema instance marking elements for this sort of optimization.

3.3

Information and Constraint Preservation

To formalize information and constraint preservation for schema translations, we use first-order predicate calculus. We represent any schema specification in predicate calculus by generating an predicate for each tuple container and a closed formula for each constraint [7]. Using the closed-world assumption, we can then populate the predicates to form an interpretation. If all the constraints hold over the populated predicates, the interpretation is valid. For any schema specification of type A there is a corresponding valid interpretation We can guarantee that a translation T translates a schema specification to a constraint-equivalent schema specification by checking whether the constraints of the generated predicate calculus for the schema specification of type B imply the constraints of the generated predicate calculus for the schema specification of type A. A translation T that translates a schema specification into a schema translation induces a translation from an interpretation for a schema of type A to an interpretation for a schema of type B. We can guarantee that a T-induced translation translates any valid interpretation into an information equivalent valid interpretation by translating both of the corresponding valid interpretations to predicate calculus interpretations and and checking for information equivalence. Definition 1. A translation T from schema specification to a schema specification preserves information if there exists a procedure P that for any valid interpretation corresponding to computes from where is the interpretation corresponding to induced by T. Definition 2. A translation T from schema specification to a schema specification preserves constraints if the constraints of imply the constraints of Lemma 1. Let be a valid interpretation for a populated C-XML model instance There exists a translation that correctly represents as a valid interpretation in predicate calculus1. 1

Due to space constraints, we have omitted all proofs in this paper.

Enterprise Modeling with Conceptual XML

Lemma 2. Let Schema instance rectly represents calculus.

159

be an XML document that conforms to an XML There exists a translation that coras a valid interpretation in predicate

Theorem 1. Let T be the translation described in Section 3.1 that translates a C-XML model instance to an XML Schema instance T preserves information and constraints. Theorem 2. Let T be the translation described in Section 3.2 that translates an XML Schema instance to a C-XML model instance T preserves information and constraints.

4

C-XML Views

This section describes three types of views – simple views that help us scale up to large and complex XML schemas, query-generated views over a single XML schema, and query-generated views over heterogeneous XML schemas.

4.1

High-Level Abstractions in C-XML

We create simple views in two ways. Our first way is to nest and hide C-XML components inside one another [7]. Figure 3 shows how we can nest object sets inside one another. We can pull any object set inside any other connected object set, and we can pull any object set inside any connected relationship set so long as we leave at least two object sets outside (e.g. in Figure 1 we can pull Qty and/or SalePrice inside the diamond). Whether an object set appears on the inside or outside has no effect on the meaning. Once we have object sets on the inside, we can implode the object set or relationship set and thus remove the inner object sets from the view. We can, for example, implode Customer, Item, and PreferredCustomer in Figure 3, presenting a much simpler diagram showing only five object sets and two generalization/specialization components nested in Document. To denote an imploded object or relationship set, we shade the object set or the relationship-set diamond. Later, we can explode object or relationship sets and view all details. Since we allow arbitrary nesting, it is possible that relationship-set lines may cross object- or relationship-set boundaries. In this case, when we implode, we connect the line to the imploded object or relationship set and make the line dashed to indicate that the connection is to an interior object set. Our second way to create simple views is to discard C-XML components that are not of interest. We can discard any relationship set, and we can discard all but any two connections of an relationship set We can also discard any object set, but then must discard (1) any connecting binary relationship sets, (2) any connections to relationship sets and (3) any specializations and relationship sets or relationship-set connections to these specializations. Figure 4 shows an example of a high-level abstraction of Figure 1. In

160

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 4. High-Level View of Customer/Order C-XML Model Instance.

Figure 4 we have discarded Price and its associated binary relationship set, the relationship set for PreviousItem, and the connections to RequestDateTime and Qty in the relationship set involving Manufacturer. We have also hidden OrderID, OrderDate, and all customer information except CustomerName inside Order, and we have hidden SalePrice and Qty inside the Order-Item relationship set. Note that both the Order object set and the Order-Item relationship set are shaded, indicating the inclusion of C-XML components; that neither the Item object set nor the Item-Manufacturer relationship set are shaded, indicating that the original connecting information has been discarded rather than hidden within; and that the line between CustomerName and Order is dashed, indicating that CustomerName connects, not to Order directly, but rather to an object set inside Order. Theorem 3. Simple, high-level views constructed by properly discarding C-XML components are valid C-XML model instances. Corollary 1. Any simple, high-level view can be represented by an XML Schema.

4.2

C-XML XQuery Views

We now consider the use of C-XML views to generate XQuery views. As other researchers have pointed out [2,5], XQuery can be hard for users to understand and manipulate. One reason XQuery can be cumbersome is because it must follow the particular hierarchical structure of an underlying XML schema, rather than the simpler, logical structure of an underlying conceptual model. Further, different XML sources might specify conflicting hierarchical representations of the same conceptual relationship [2]. Thus, it is highly desirable to be able to construct XQuery views by generating them from a high-level conceptual modelbased description. [5] describes an algorithm for generating XQuery views from ORA-SS descriptions. [2] also describes how to specify XQuery views by writing conceptual XPath expressions over a conceptual schema and then automatically generating the corresponding XQuery specifications. In a similar fashion, we can

Enterprise Modeling with Conceptual XML

161

Fig. 5. C-XQuery View of Customers Nested within Items Ordered.

generate XQuery views directly from high-level C-XML views. In some situations a graphical query language would be an excellent choice for creating C-XML views [9], but in keeping with the spirit of C-XML we define an XQuery-like textual language called C-XQuery. Figure 5 shows a high-level view written in C-XQuery over the model instance of Figure 1. We introduce a view definition with the phrase define view, and specify the contents of the view with FLWOR (for, let, where, order by, return) expressions [14]. The first for $item in Item phrase creates an iterator over objects in the Item object set. Since there is no top-level where clause, we iterate over all the items. Also, since C-XML model instances do not have “root nodes” the idea of context is different. In this case, Item defines the Item object set as the context of the path expression. For each such item, we return an ... structure populated according to the nested expressions. C-XQuery is much like ordinary XQuery, with the main distinguishing factor that our path expressions are conceptual, and so, for example, they are not concerned with the distinction between attributes and elements. Note particularly that for the data fields, such as ItemNr, CustomerName, and OrderDate, we do not care whether the generated XML treats them as attributes or elements. A more subtle characteristic of our conceptual path expressions is that since they operate over a flat C-XML structure, we can traverse the conceptual-model graph more flexibly, without regard for hierarchical structure. Thus, we generalize the notion of a path expression so that the expression A//B designates the path from A to B regardless of hierarchy or the number of intervening steps in the path [9]. This can lead to ambiguity in the presence of cycles or multiple paths between nodes, but we can automatically detect ambiguity and require the user to disambiguate the expression (say, by designating an intermediate node that fixes a unique path).

162

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 6. C-XQuery over the View of Customers Nested within Items Ordered.

Given a view definition, we can write queries against the view. For the view in Figure 5, for example, the query in Figure 6 finds customers who have purchased more than $300 worth of nitrogen fertilizer within the last 90 days. To execute the query, we unfold the the view according to the view definition and minimize the resulting XQuery. See [13] for a discussion of the underlying principles. The view in Figure 6 illustrates the use of views over views. Indeed, applications can use views as first-class data sources, just like ordinary sources, and we can write queries against the conceptual model and views over that model. In any case, we translate the conceptual queries to XQuery specifications over the XML Schema instance generated for the C-XML conceptual model. Theorem 4. A C-XQuery view Q over a C-XML model instance C can be translated to an XQuery query over an XML Schema instance Observe that by the definition of XQuery [14], any valid XQuery instance generates an underlying XML Schema instance. By Theorem 4, we thus know that for any C-XQuery view we retain a correspondence to XML Schema. In particular, this means we can compose views of views to an arbitrary depth and still retain a correspondence to XML Schema.

4.3

XQuery Integration Mappings

To motivate the use of views in enterprise conceptual modeling, suppose through mergers and acquisitions we acquire the catalog inventory of another company. Figure 7 shows the C-XML for this assumed catalog. We can rapidly integrate this catalog into the full inventory of the parent company by creating a mapping from the acquired company’s catalog to the parent company’s catalog. Figure 8 shows such a mapping. In order to integrate the source (Figure 7) with the target (Figure 1), the mapping needs to generate target names in the source. In

Enterprise Modeling with Conceptual XML

163

Fig. 7. C-XML Model Instance for the Catalog of an Acquired Company.

Fig. 8. C-XQuery Mapping for Catalog Integration.

this example, CatalogItem, CatalogNr, and ShortName correspond respectively to Item, ItemNr, and Description. We must compute Price in the target from the MSRP and MarkupPercent values in the source, as Figure 8 shows. We assume the function CatalogNr-to-ItemNr is either a hand-coded lookup table, or a manually-programmed function to translate source catalog numbers to item numbers in the target. The underlying structure of this mapping query corresponds directly to the relevant section of the C-XML model instance in Figure 1, so integration is now immediate. The mapping in Figure 8 creates a target-compatible C-XQuery view over the acquired company’s catalog in Figure 7. When we now query the parent company’s items, we also query the acquired company’s catalog. Thus, the previous examples are immediately applicable. For example, we can find those customers who have ordered more than $300 worth of nitrogen fertilizer from either the inventory of the parent company or the inventory of the acquired company by simply issuing the query in Figure 6. With the acquired company’s catalog integrated, when the query in Figure 6 iterates over customer orders, it iterates over data instances for both Item in Figure 1 and CatalogItem in Figure 8. (Now, if the potential terrorist has purchased, say $200 worth of nitrogen fertilizer from the original company and $150 worth from the acquired company, the potential terrorist will appear on the list, whereas the potential terrorist would have appeared on neither list before.) We could also write a mapping query going in the opposite direction, with Figure 1 as the source and Figure 7 as the target. Such bidirectional integration is useful in circumstances where we need to shift between perspectives, as is often the case in enterprise application development. This is especially true because all enterprise data is rarely fully integrated.

164

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

In general it would be nice to have a mostly automated tool for generating integration mappings. In order to support such a tool, we require two-way mappings between both schemas and data elements. Sometimes we can use automated element matchers [1, 12] to help us with the mapping. However, in other cases the mappings are intricate and require programmer intervention (e.g. calculating Price from MSRP plus a MarkupPercent or converting CatalogNr to ItemNr). In any case, we can write C-XQuery views describing each such mapping, with or without the aid of tools (e.g. [11]), and we can compose these views to provide larger C-XQuery schema mappings. Of course there are many integration details we do not address here, such as handling dirty data, but the approach of integrating by composing C-XQuery views is sound. Theorem 5. A C-XQuery view Q over a C-XML model instance C of an external, federated XML Schema can be translated to an XQuery query over an XML Schema instance

5

Concluding Remarks

We have offered Conceptual-XML (C-XML) as an answer to the challenge of modern enterprise modeling. C-XML is equivalent in expressive power to XML Schema (Theorems 1 and 2). In contrast to XML Schema, however, C-XML provides for high level conceptualization of an enterprise. C-XML allows users to view schemas at any level of abstraction and at various levels of abstraction in the same specification (Theorem 3), which goes a long way toward mitigating the complexity of large data sets and complex interrelationships. Along with CXML, we have provided C-XQuery, a conceptualization of XQuery that relieves programmers from concerns about the often arbitrary choice of nesting and arbitrary choice of whether to represent values with attributes or with elements. Using C-XQuery, we have shown how to define views and automatically translate them to XQuery (Theorem 4). We have also shown how to accommodate heterogeneity by defining mapping views over federated data repositories and automatically translate them to XQuery (Theorem 5). Implementing C-XML is a huge undertaking. Fortunately, we have a foundation on which to build. We have already implemented tools relevant to CXML include graphical diagram editors, model checkers, textual model compilers, a model execution engine, and several data integration tools. We are actively continuing development of an Integrated Development Environment (IDE) for modeling-related activities. Our strategy is to plug new tools into this IDE rather than develop stand-alone programs. Our most recent implementation work consists of tools for automatic generation of XML normal form schemes. We are now working on the implementation of the algorithms to translate C-XML to XML Schema, XML Schema to C-XML, and C-XQuery to XQuery.

Enterprise Modeling with Conceptual XML

165

Acknowledgements This work is supported in part by the National Science Foundation under grant IIS-0083127 and by the Kevin and Debra Rollins Center for eBusiness at Brigham Young University.

References 1. J. Biskup and D. Embley. Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3) :169– 212, 2003. 2. S. Camillo, C. Heuser, and R. dos Santos Mello. Querying heterogeneous XML sources through a conceptual schema. In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, pages 186–199, Chicago, Illinois, October 2003. 3. M. Carey. Enterprise information integration – XML to the rescue! In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, page 14, Chicago, Illinois, October 2003. 4. Y. Chen, T. Ling, and M. Lee. Designing valid XML views. In Proceedings of the 21st International Conference on Conceptual Modeling (ER’02), pages 463–477, Tampere, Finland, October 2002. 5. Y. Chen, T. Ling, and M. Lee. Automatic generation of XQuery view definitions from ORA-SS views. In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, pages 158– 171, Chicago, Illinois, October 2003. 6. R. Conrad, D. Scheffner, and J. Freytag. XML conceptual modeling using UML. In Proceedings of the Ninteenth International Conference on Conceptual Modeling (ER2000), Salt Lake City, Utah, October 2000. 558–571. 7. D. Embley, B. Kurtz, and S. Woodfield. Object-oriented Systems Analysis: A Model-Driven Approach. Prentice Hall, Englewood Cliffs, New Jersey, 1992. 8. D. Embley and W. Mok. Developing XML documents with guaranteed ‘good‘ properties. In Proceedings of the 20th International Conference on Conceptual Modeling (ER2001), pages 426–441, Yokohama, Japan, November 2001. 9. S. Liddle, D. Embley, and S. Woodfield. An active, object-oriented, modelequivalent programming language. In M. Papazoglou, S. Spaccapietra, and Z. Tari, editors, Advances in Object-Oriented Data Modeling, pages 333–361. MIT Press, Cambridge, Massachusetts, 2000. 10. M. Mani, D. Lee, and R. Muntz. Semantic data modeling using xml schemas. In Proceedings of the 20th International Conference on Conceptual Modeling (ER2001), pages 149–163, Yokohama, Japan, November 2001. 11. R. Miller, L. Haas, and M. Hernandez. Schema mapping as query discovery. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 77–88, Cairo, Egypt, September 2000. 12. E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10:334–350, 2001. 13. I. Tatarinov and A. Halevy. Efficient query reformulation in peer data management systems. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004. (to appear). 14. XQuery 1.0: An XML Query Language, November 2003. URL: http://www.w3.org/TR/xquery/.

Graphical Reasoning for Sets of Functional Dependencies János Demetrovics1, András Molnár2, and Bernhard Thalheim3 1

MTA SZTAKI, Computer and Automation Institute of the Hungarian Academy of Sciences Kende u. 13-17, H-1111 Budapest, Hungary [emailprotected]

2

Department of Information Systems, Faculty of Informatics, Eötvös Loránd University Budapest Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary [emailprotected]

3

Computer Science and Applied Mathematics Institute, University Kiel, Olshausenstrasse 40, 24098 Kiel, Germany [emailprotected]

Abstract. Reasoning on constraint sets is a difficult task. Classical database design is based on a step-wise extension of the constraint set and on a consideration of constraint sets through generation by tools. Since the database developer must master semantics acquisition, tools and approaches are still sought that support reasoning on sets of constraints. We propose novel approaches for presentation of sets of functional dependencies based on specific graphs. These approaches may be used for the elicitation of the full knowledge on validity of functional dependencies in relational schemata.

1

Design Problems During Database Semantics Specification and Their Solution

Specification of database structuring is based on three interleaved and dependent parts [9]: Syntactics: Inductive specification of structures uses a set of base types, a collection of constructors and an theory of construction limiting the application of constructors by rules or by formulas in deontic logics. In most cases, the theory may be dismissed. Semantics: Specification of admissible databases on the basis of static integrity constraints describes those database states which are considered to be legal. Pragmatics: Description of context and intension is based either on explicit reference to the enterprise model, to enterprise tasks, to enterprise policy, and environments or on intensional logics used for relating the interpretation and meaning to users depending on time, location, and common sense. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 166–179, 2004. © Springer-Verlag Berlin Heidelberg 2004

Graphical Reasoning for Sets of Functional Dependencies

167

Specification of syntactics is based on the database modeling language. Specification of semantics requires a logical language for specification of classes of constraints. Typical constraints are dependencies such as functional, multivalued, and inclusion dependencies, or domain constraints. Specification of pragmatics is often not explicit. The specification of semantics is often rather difficult due to the complexity. For this reason, it must be supported by a number of solutions supporting acquisition and reasoning on constraints. Prerequisites of Database Design Approaches Results obtained during database structuring are evaluated on two main criteria: completeness [8] of and unambiguity of specification. Completeness requires that all constraints that must be specified are found. Unambiguity is necessary in order to provide a reasoning system. Both criteria have found their theoretical and pragmatical solution for most of the known classes of constraints. Completeness is, however, restricted by the human ability to survey large constraint sets and to understand all possible interactions among constraints. Theoretical Approaches to Problem Solution: A number of normalization and restructuring algorithms have been developed for functional dependencies. We do not know simple representation systems for surveying constraint sets and for detecting missing constraints beyond functional dependencies yet. Pragmatical Approaches to Problem Solution: A step-wise constraint acquisition procedure has been developed in [7,10,12]. The approach is based on the separation of constraints into: The set of valid functional dependencies : All dependencies that are known to be valid and all those that can be implied from the set of valid and excluded functional dependencies. The set of excluded functional dependencies : All dependencies that are known to be invalid and all those that are invalid and can be implied from the set of valid and excluded functional dependencies. This approach leads to the following simple elicitation algorithm: 1. Basic step: Design obvious constraints. and do not change. 2. Recursion step: Repeat until the constraint sets Find a functional dependency that is neither in nor in If is valid then add to If is invalid then add to Generate the logical closures of and This algorithm can be refined in various ways. Elicitation algorithms know so far are all variation of this simple elicitation algorithm. However, neither the theoretical solutions nor the pragmatical approach provides a solution to problem 1: Define a pragmatical approach that allows simple representation of and reasoning on database constraints. This problem becomes more severe in association with the following problems.

168

János Demetrovics, András Molnár, and Bernhard Thalheim

Complexity of Semantics Typical algorithms such as normalization algorithms can only generate a correct result if specification is complete. Such completeness is not harmful as long as constraint sets are small. The number of constraints may however be exponential in the number of attributes [3]. Therefore, specification of the complete set of functional dependencies may be a task that is infeasible. This problem is closely related to another well-known combinatoric problem presented by János Demetrovics during MFDBS’87 [11] and that is still only partially solved: Problem 2. What is the size of sets of independent functional dependencies for an n-ary relation schema? Inter-dependence Within a Constraint Set Constraints such as functional dependencies are not independent from each other. Typical axiomatizations use rules such as the union, transitivity and path rules. Developers do not reason this way. Therefore, the impact of adding, deleting or modifying a constraint within a constraint set is not easy to capture. Therefore, we need a system for reasoning on constraint sets. Theoretical Approaches to Problem Solution: [14] and [1] propose to use a graphbased representation of sets of functional dependencies. This solution provides a simple survey as long as constraints are simple, i.e., use singleton sets for the left sides. [13] proposes to use a schema architecture by developing first elementary schema components and constructing the schema by application of composition operations which use these components. [4] propose to construct a collection of interrelated lattices of functional dependencies. Each lattice represents a component of [13]. The set of functional dependencies is then constructed through folding of the lattices. Pragmatical Approaches to Problem Solution: [6] proposes to use a fact-based approach instead of modeling of attributes. Elementary facts are ‘small’ objects that cannot be decomposed without loosing meaning. We, thus, must solve problem 3. Develop a reasoning system that support easy maintenance and development of constraint sets and highlight logical inter-dependence among constraints. Instability of Normalization Normalization is based on the completeness of constraint sets. This is impractical. Althouth database design tools can support completeness, incompleteness of specification should be considered the normal situation. Therefore, normalization approaches should be robust with regard to incompleteness. Problem 4.[12] Find a normalization theory which is robust for incomplete constraint sets or robust according to a class of changes in constraint sets.

Graphical Reasoning for Sets of Functional Dependencies

169

Problems That Currently Defy Solution Dependency theory consists of work on about 95 different classes of dependencies, with a very few classes that have been treated together. Moreover, properties of sets of functional dependencies remain still unknown. In most practical cases several negative results obtained in the dependency theory do not restrict the common utilization of several classes. The reason for this is that the used constraint sets do not have these properties. Therefore, we need other classification principles for describing ‘real life’ constraint sets. Problem 5. [12] Classify ‘real life’ constraint sets which can be easily maintained and specified. This problem is related to one of the oldest problems in database research expressed by Joachim Biskup in the open problems session [11] of MFDBS’87: Problem 6. Develop a specification method that supports consideration of sets of functional dependencies and derivation of properties of those sets. Outline of the Paper1 and the Kernel Problem Behind Open Problems The six problems above can be solved on one common basis: Find a simple and sophisticated representation of sets of constraints that supports reasoning on constraints. This problem is infeasible in general. Therefore, we provide first a mechanism to reason on sets of functional dependencies defined on small sets of attributes. Geometrical figures such as polygons or tetrahedra nicely support reasoning on constraints. Next we demonstrate the representation for attribute sets consisting of three or four attributes. Finally we introduce the implication system for graphical representations and show how these representations lead to a very simple and sophisticated treatment of sets of functional dependencies.

Sets of Functional Dependencies for Small Relation Schemata

2 2.1

Universes of Functional Constraints

Besides functional dependencies (FDs), we use excluded functional constraints (also called negated functional dependencies): eg. states that the functional dependency is not valid. Treating sets of functional constraints becomes simpler if we avoid dealing with obviously redundant constraints. In our notation, a trivial constraint (a functional dependency or an excluded functional constraint) is a constraint with 1

Due to the lack of space, this paper does not contain proofs or the representation of all possible sets of functional dependencies. All technical details as well as some other means of representation can be read in a technical report available under [5].

170

János Demetrovics, András Molnár, and Bernhard Thalheim

at least one attribute of its left-hand side and right-hand side in common or has the empty set as its right-hand side. Furthermore, a canonical (singleton) functional dependency or a singleton excluded functional constraint has exactly one attribute on its right-hand side. We introduce the notations and for the universes of functional dependencies, non-trivial functional dependencies and non-trivial canonical (singleton) functional dependencies, respectively, over a fixed underlying domain of attribute symbols. Similarly, and denote the universes of excluded functional constraints, non-trivial excluded constraints and non-trivial singleton excluded functional constraints (negated non-trivial, canonical dependencies) over the same set of attribute symbols, respectively. The traditional universe of functional constraints (including functional dependencies and excluded constraints) is while our graphical representations deal with sets of constraints over In other words, the graphical representations we present in this paper deal with non-trivial canonical functional dependencies and non-trivial singleton excluded functional constraints only. It will be shown that we do not loose relevant deductive power applying this restriction to the universe of functional constraints. In most of the cases, we focus on closed sets of functional dependencies. A finite set is closed (over iff where is the closure of ie.

2.2

The Notion of Dimension

For the classification of functional constraints and the attributes they refer to, we introduce the notion of dimension first. The dimension of a constraint is simply the size of its left-hand side, i.e. the number of attributes on its left-hand side. For a functional dependency denote by its dimension, defined as (the dimension of an excluded functional constraint can be defined similarly). For a single attribute A, given a set of functional dependencies the dimension of A is denoted by (or just simply [A]) and defined as

This definition is extended with for the case when no exists in The dimensions of attributes classify the sets of functional dependencies.

2.3

Summary of the Number of Closed Sets

Let be the number of attributes of the considered relation schema. Denote by the set of closed sets of (singleton, non-trivial) functional dependencies for this (with constant attributes disallowed). Defining as the equivalence relation on these sets classifying them into different types or cases (for two equivalent sets there exists a permutation of attributes transforming one set to

Graphical Reasoning for Sets of Functional Dependencies

171

another), the set of different classes is We are focusing on these different classes and the size of this set. Another possibility is to let the attributes to be stated as constants. Performing this extension to we get a larger set, denoted by The different cases (types) of functional dependency sets taking these zero-dimensional constraints into account form the set It can be easily verified that holds for each With these notations, Table 1 shows the number of closed sets of functional dependencies for unary, binary, ternary, quaternary and quinary relational schemata and demonstrates the combinatorial of the search space.

3

The Graphical Representation of Sets of Functional Dependencies

There have been several proposals for graphical representation of sets of functional dependencies. Well-known books such as [1] and [14] have used a graphtheoretic notion. Nevertheless, these graphical notations have not made there way into practice and education. The main reason for this failure is the complexity of representation. Graphical representations are simple as long as the set of functional dependencies are not too complex. [2] has proposed a representation for the ternary case based on either assigning an N-notation if nothing is known or assigning a 1-notation to an edge from X to Y at the Y end if the functional dependency is valid. This representation is simple enough but already redundant in the case of ternary relationship types. Moreover, it is not generalizable to cases of n-ary relationship types with We use a simpler notation which reflects the validity of functional dependencies in a simpler and better understandable fashion. We distinguish two kinds of functional dependencies for One-dimensional (singleton left sides): Functional dependencies of the form can be decomposed to canonical functional dependencies and They are represented by endpoints of binary edges (1D shapes) in the triangular representation. Two-dimensional (two-element left sides): Functional dependencies with two-element left-hand sides cannot be decomposed. They are represented in the triangular (2D shape) on the node relating their right side to the corner.

172

János Demetrovics, András Molnár, and Bernhard Thalheim

Fig. 1. Triangular representation of sets of functional dependencies for the ternary case

We may represent also candidates for excluded functional dependencies by crossed circles for the case that we know that the corresponding functional dependency is not valid in applications or by small circles for the case that we do not know whether the functional dependency holds or does not hold. We use now the following notations in the figures: Basic functional dependencies are denoted by filled circles. Implied functional dependencies are denoted by circles. Negated basic functional dependencies are either denoted by dots or by crossed filled circles. Implied negated functional dependencies are either denoted by dots or by crossed circles.

Fig. 2. Examples of the triangular representation

Figure 2 shows some examples of the triangular representation. The functional dependency and the implied functional dependency are shown in the left part. The functional dependencies

Graphical Reasoning for Sets of Functional Dependencies

173

and their implied functional dependencies are pictured in the middle triangle. The negated functional dependency and the implied negated functional dependencies and are given in the right picture. As mentioned above, the triangular representation can be generalized to higher number of attributes. Generalization can be performed in two directions: representation in a higher-dimensional space (3D in the case of 4 attributes, resulting the tetrahedral representation) or constructing a planar (2D, quadratic) representation. We use the same approach as before in the case of three attributes. An example is displayed in Figure 3 (implication is explained later). In this paper we concentrate on the 2D representation.

Fig. 3. The tetraherdal and quadratic representations of the set generated by

This representation can be generalized to the case of 5 attributes.

4

Implication Systems for the Graphical Representations

Excluded functional constraints and functional dependencies are axiomatizable by the following formal system [12].

174

János Demetrovics, András Molnár, and Bernhard Thalheim

The universe of the extended Armstrong implication system is (see Section 2.1) while our graphical and spreadsheet representations deal with sets 2 of constraints over . However, the axiom and rules of the extended Armstrong implication system do not correspond to this restriction. It will be shown that an equivalent implication system can be constructed if these restrictions are applied to the universe of constraints. We develop a new implication system for graphical reasoning:

The rules presented here can directly be applied for deducing consequences of a set of constraints given in terms of the graphical or spreadsheet representation. We use the following two implication systems: the ST implication system over with rules (S) and (T) and no axioms, the PQRST implication system over with all the presented rules and the symbolic axiom which is used for indicating contradiction. These systems are sound and complete for deducing non-trivial, singleton constraints. Theorem 1 The ST system is sound and complete over ie. for each finite subset of and Theorem 2 Let be a finite subset of and The PQRST system without is sound over and complete with the restriction that cannot be contradictory, ie. for each non-contradictory Moreover, can be derived iff is contradictory. The implication systems introduced above have the advantage of the existence of a specific order of rules which provides a complete algorithmic method for getting all the implied functional dependencies and excluded functional constraints starting with an initial set, allowing one to determine the possible types of relationships the initial set of dependencies defines. Theorem 33 1. Let and be finite subsets of If then all elements of can be deduced starting with by using the rules (S) and (T) the way that no application of (T) precede any application of (S). 2

3

For example, is represented as and Excluded functional constraints with more than one attribute on their right-hand sides can not be eliminated this way. However, omitting these can also be achieved (see [5]). Proofs of the theorems are given in [5].

Graphical Reasoning for Sets of Functional Dependencies

175

2. If and are finite subsets of and then all elements of can be deduced starting with by using the rules (S), (T), (R), (P) and (Q) the way that no application of (T) precede any application of (S), no application of (R) precede any application of (T) and no application of (P) or (Q) precede any application of (R). Order of (P) and (Q) is arbitrary. Furthermore, (R) is needed to be applied at most once if

5

Graphical Reasoning

Rules of the PQRST implication system support graphical reasoning. We will discuss first the case of

Fig. 4. Graphical versions of rules (S), (T) and (P), (Q), (R)

Graphical versions of rules are shown on Figure 4 for the triangular representation (case Y = {C}). The small black arrows indicate support (necessary context) while the large grey arrows show the implication effects. Rule (S) is a simple extension rule and rule (T) can be called as “rotation rule” or “reduction rule”. We may call the left-hand side of a functional dependency the determinant of it and the right-hand side the determinate. Rule (S) can be used to extend the determinant of a dependency resulting another dependency with one dimension higher, while rule (T) is used for rotation, that is, to replace the determinate of a functional dependency by the support of another functional dependency with one dimension higher (the small black arrow at B indicates support of Another possible way to interpret rule (T) is for reduction of the determinant of a higher-dimensional dependency by omitting an attribute if a dependency holds among the attributes of the determinant. For excluded functional constraints, rule (Q) acts as the extension rule (needs support of a positive constraint, ie. functional dependency) and (R) as the rotation rule (needs a positive support too). These two rules can also be viewed as negations of rule (T). Rule (P) is the reduction rule for excluded functional con-

176

János Demetrovics, András Molnár, and Bernhard Thalheim

straints, with the opposite effect of rule (Q) (but without the need of support). Rule (Q) is also viewed as the negation of rule (S). These graphical rules can be generalized to higher dimensional cases, where the number of attributes is more than 3. Figure 5 shows the patterns of rules (S) and (T) for the case We use two or three patterns for a single case since we need a way to survey constraint derivation by (not completely symmetric) 2D diagrams. We differentiate between the case that the rules (S) and (T) use functional dependencies consisting of singleton left sides and the case that the minimum dimension of functional dependencies is two.

Fig. 5. Patterns of graphical rules (S) and (T) for the quadratic representation

Theorem 3 in Section 4 shows that for positive dependencies, using (S) first as many times as possible and using (T) as many times as possible afterwards is a complete method for obtaining all non-trivial positive consequences of a given set of constraints. We may call it ST algorithm4. This can be extended for the case with excluded functional constraints. We now present it as an algorithm for FD derivation based on the graphical representation: The STRPQ Algorithm for Sets of Both Positive and Negative Constraints. Rules (P), (Q) and (R) can be applied as complements of rules (S) and (T), resulting the following algorithm called STRPQ algorithm (based on part 2 of Theorem 3): 4

With some modifications, this algorithm has been used for generating and counting all sets of functional dependencies (see Section 2.3) with a PROLOG program.

Graphical Reasoning for Sets of Functional Dependencies

177

1. Starting with the given initial set of non-trivial, singleton functional dependencies and excluded functional constraints as input, 2. extend the determinants of each dependency using rule (S) as many times as possible, then 3. apply rule (T) until no changes occur, 4. apply rule (R) until no changes occur, 5. reduce and extend the determinants of excluded constraints using rules (P) and (Q) as many times as possible. 6. Output the generated set.

The algorithm just presented can be used for reasoning on sets of functional constraints, especially in terms of the graphical representations. The structure of the generalized triangular representations (2D–triangular, 3D–tetrahedral, etc.) may also be used for designing a data structure representing sets of functional constraints for the algorithms.

6

Applying Graphical Reasoning to Sets of Functional Dependencies

Let us consider a more complex example discussed in [12]. We are given a part of the Berlin airport management database for scheduling flights and pilots at one of its airports. Flights depart at one departure time and to one destination. A flight can get different pilots and can be assigned to different gates each day of a week. In the given application we observe the following functional dependencies for the attributes Flight#, (Chief)Pilot#, Gate#, Day, Hour, Destination: { Flight#, Day } { Pilot, Gate#, Hour } { Flight# } { Destination, Hour } { Day, Hour, Gate# } { Flight# } { Pilot#, Day, Hour } { Flight# } As noticed in [12] we can model this database in a five very different ways. Figure 6 displays one of the solutions. All types in Figure 6 are in the third normal form. Additionally, the following constraints are valid for solution in Figure 6: flies : { GateSchedule.Time, Pilot.# } { GateSchedule }. The two schemata have additionally transitive path constraints, e.g.: flies: { GateSchedule.Time,Day, Flight.# } { GateSchedule.Gate.# }. But the types are still in third normal form since for each functional dependency defined for the types either X is a key or Y is a part of a key. The reason for the existence of rather complex constraints is the twofold usage of Hour. For instance, in our solution we find the equality constraint: flies.Flight.Hour = flies.GateSchedule.Time.Hour. We must know now whether the set of functional dependencies is complete. The combinatorial complexity of brute-force consideration of dependency sets is overwhelming. Let us now apply our theoretical findings to cope with the complexity and to reason on the sets of functional dependencies. We may use the following algorithm:

178

János Demetrovics, András Molnár, and Bernhard Thalheim

Fig. 6. An extended ER Schema for the airline database with transitive path constraints

1. Consider attributes which are not used in any left side of a functional dependency whether they are really dangling. This is done by using the STRPQ algorithm with each of the attributes and the rest of other attributes. We may strip out dangling attributes not loosing the reasoning power. In the example we strip out Destination. 2. Combine attributes to groups such that they appear together in left sides of functional dependencies. Consider first the relations among those attribute groups using the STRPQ algorithm. In our example we consider the groups (A) Day, Hour, (B) Flight#, Day, (C) (Chief)Pilot#, and (D) Gate#. The result is shown on Figure 3. 3. Recursively now apply the STRPQ algorithm to decompositions of attribute groups. The example shows how graphical reasoning can be directly applied to larger sets of attributes which have complex relations among them and can be expressed through functional dependencies.

7

Conclusion

The problem whether there exists a simple and sophisticated representation of sets of constraints that supports reasoning on constraints is solved in this paper by introducing a more surveyable means for the representation of constraint sets: the graphical representation. It requires a different implication system than the classical Armstrong system. We, thus, introduced another system and could show (Theorem 1 and 2) its soundness and completeness. This system has another useful property (Theorem 3): Constraint derivation may be ordered on the basis of sequences of rules. Derivation rule application can be described using the regular expression This order of rule application is extremely useful whenever we want to know whether the set of generated functional constraints is full (closed), i.e., consists of all (positive or both positive and negative) dependencies that follow from the given initial system of functional constraints. Based on this, we were able to generate all possible sets of initial functional dependencies for Graphical reasoning supports a simpler style of reasoning on constraint sets. Completeness and soundness of systems of functional dependencies and excluded

Graphical Reasoning for Sets of Functional Dependencies

179

functional dependencies becomes surveyable. Since database design approaches rely on completeness and soundness of constraint sets, our approach enables database designers to obtain better database design results.

Acknowledgements We would like to thank Tibor Ásványi for his help in improving the efficiency of our PROLOG program, which generates the sets of functional dependencies and Zoltán Csaba Regéci for his assistance in running the program at MTA SZTAKI. We are also grateful to Andrea Molnár for her valuable comments on the illustration of the graphical rules and the tetrahedral representation.

References 1. P. Atzeni and V. De Antonellis. Relational database theory. Addison-Wesley, Redwood City, 1993. 2. R. Camps. From ternary relationship to relational tables: A case against common beliefs. ACM SIGMOD Record, 31(2), pages 46–49, 2002. 3. J. Demetrovics and G. O. H. Katona. Combinatorial problems of database models. In Colloquia Mathematica Societatis Janos Bolyai 42, Algebra, Combinatorics and Logic in Computer Science, pages 331–352, Györ, Hungary, 1983. 4. J. Demetrovics, L. O. Libkin, and I. B. Muchnik. Functional dependencies and the semilattice of closed classes. In Proc. MFDBS’89, LNCS 364, pages 136–147, 1989. 5. J. Demetrovics, A. Molnar, and B. Thalheim. Graphical and spreadsheet reasoning for sets of functional dependencies. Technical Report 0402, Kiel University, Computer Science Institute, http://www.informatik.uni-kiel.de/reports/2004/0402.html, 2004. 6. T. A. Halpin. Conceptual schema and relational database design. Prentice-Hall, Sydney, 1995. 7. M. Klettke. Akquisition von Integritätsbedingungen in Datenbanken. DISBIS 51. infix-Verlag, St. Augin, 1998. 8. O.I. Lindland, G. Sindre, and A. Solvberg. Understanding quality in conceptual modeling. IEEE Software, 11(2):42–49, 1994. 9. C.W. Morris. Foundations of the theory of signs. In International Encyclopedia of Unified Science. University of Chicago Press, 1955. 10. V.C. Storey, H.L. Yang, and R.C. Goldstein. Semantic integrity constraints in knowledge-based database design systems. Data & Knowledge Engineering, 20:1– 37, 1996. 11. B. Thalheim. Open problems in relational database theory. Bull. EATCS, 32:336 – 337, 1987. 12. B. Thalheim. Entity-relationship modeling – Foundations of database technology. Springer, Berlin, 2000. See also http://www.informatik.tu-cottbus.de/~thalheim/HERM.htm. 13. B. Thalheim. Component construction of database schemes. In Proc. ER’02, LNCS 2503, pages 20–34, 2002. 14. C.-C. Yang. Relational Databases. Prentice-Hall, Englewood Cliffs, 1986.

ER-Based Software Sizing for Data-Intensive Systems Hee Beng Kuan Tan and Yuan Zhao School of Electrical and Electronic Engineering, Block S2, Nanyang Technological University Nanyang Avenue, Singapore 639798 [emailprotected]

Abstract. Despite the existence of well-known software sizing methods such as Function Point method, many developers still continue to use ad-hoc methods or so called “expert” approaches. This is mainly due to the fact that the existing methods require much implementation information that is difficult to identify or estimate in the early stage of a software project. The accuracy of ad-hoc and “expert” methods also has much problem. The entity-relationship (ER) model is widely used in conceptual modeling (requirements analysis) for data-intensive systems. From our observation, the characteristic of a data-intensive system, and therefore the source code of its software, is well characterized by the ER diagram that models its data. Based on this observation, this paper proposes a method for building software size model from extended ER diagram through the use of regression models. We have collected some real data from the industry to do a preliminary validation of the proposed method. The result of the validation is very encouraging. As software sizing is an important key to software cost estimation and therefore vital to the industry for managing their software projects, we hope that the research and industry communities can further validate the proposed method.

1 Introduction Estimating project size is a crucial task in any software project. Overestimates may lead to the abortion of projects or loss of projects to competitors. Underestimates pressurize project teams and may also adversely affect the quality of projects. Despite the existence of well known software sizing methods such as Function Point method [1], [10] and the more recent Full Function Point method [7], many practitioners and project managers continue to produce estimates based on ad-hoc or so called “expert” approaches [2], [8], [15]. This is mainly due to the fact that existing sizing methods require much implementation information that is not available in the earlier stage of a software project. However, the accuracy of ad-hoc and expert approaches also has much problem that results to questionable project budgets and schedules. The entity-relationship (ER) model originally proposed by Chen [5] is generally regarded as the most widely used tool for the conceptual modeling of data-intensive systems. An ER model is constructed to depict the ideal organization of data, independent of the physical organization of the data and where and how data are used. Indeed, much requirement of data-intensive systems is reflected from their ER models that depict their data conceptually. This paper proposes a novel method for P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 180–190, 2004. © Springer-Verlag Berlin Heidelberg 2004

ER-Based Software Sizing for Data-Intensive Systems

181

building software size model to estimate the size of source code for a data-intensive system based on extended ER diagram. It also discusses the validation effort conducted by us to validate the proposed method for building software size models for data-intensive systems written in Visual Basic and Java languages. The paper is organized as follows. Section 2 gives the background information of the paper. Section 3 discusses our observation and its rationale. Section 4 presents the proposed method for building software size models to estimate the sizes of source codes for data-intensive systems. Section 5 discusses our preliminary validation of the proposed method. Section 6 concludes the paper and compares the proposed method with related methods.

2 Background Entity-relationship (ER) model was originally proposed by Chen [5] for data modeling. And, it has been extended by Chen and others subsequently [17]. In this paper, we refer to the extended ER model that has the same set of concepts as the class diagram in terms of data modeling. In summary, the extended ER model uses the concept of entity, attribute and relationship to model the conceptual data for a problem. Each entity has a set of attributes each of which is an entity’s property or characteristic that is concerned by the problem. Relationships can be classified into three types: association, aggregation and generalization. There are four main stages in developing software systems: requirements capture, requirements analysis, design and implementation. The requirements are studied and specified in the requirements capture stage. They are realized conceptually in the requirements analysis. The design for implementing the requirements with the target environments taken into considerations is constructed in the design stage. In the implementation stage, the design is coded using the target programming language and the resulting code is tested to ensure its correctness. Though UML (Unified Modeling Language) has gained its popularity as a standard software modeling language, many data-intensive systems are still developed in the industry through some form of data-oriented approach. In such an approach, some form of extended entity-relationship (ER) model is constructed to model the data conceptually in the requirements capture and analysis stages. And, the subsequent design and implementation activities are very much based on the extended ER model. For projects that use UML, a class diagram is usually constructed in the requirements analysis stage. Indeed, for a data intensive system, the class diagram constructed can be viewed as an extended ER model with the extension of behavioral properties (processing). Therefore, in the early stage of software development, some form of extended ER model is more readily available than information such as external inputs, outputs and inquiries, and external logical files and external interface files that are required for the computation of function points.

3 Our Observation Data-intensive systems constitute one of the largest domains in software. These systems usually maintain large amount of structured data in a database built using a da-

182

Hee Beng Kuan Tan and Yuan Zhao

tabase management system (DBMS). And, it provides operational, control and management support to end-users through referencing and analyzing these data. The support is usually accomplished through accepting inputs from user, processing inputs, updating databases, printing of reports, and providing inquiries to help users in the management and decision making processes. The proposed method for building software size model for data-intensive systems is based on our observation of these systems. Next, we shall discuss the observation and its rationale. The Observation: Under the same development environment (that is, a particular programming language and tool used), the size of source codes for a data-intensive system usually depends on the extended ER diagram that models its data. Rationale: The constituents of the data-intensive system can be classified into the following: 1) Support business operations through accepting inputs to maintain entities modeling in the ER diagram. 2) Support decision making processes through producing outputs from information possessed by entities modeled in the ER diagram. 3) Implement business logic to support the business operation and control. 4) Reference to entities modeled in the ER diagram to support the first three constituents. Since the first two and the last constituents are based on the ER diagram, as such, they depend on the ER diagram. At the first glance, it seems that the third constituent may not depend on the ER diagram. However, since a data-intensive system usually does not perform complex computation within the source code (any complex computation is usually achieved through calling pre-developed function), business logic in the source code is mainly for the navigation between entities via relationship types with simple computation. For example, for the business logic that if a customer has two overdue invoices, then no further orders will be processed, the source code for implementing the business logic retrieves overdue invoices in the Invoice entity type for the customer in the Customer entity type via the relationship type that associates a customer with its invoices. There is no complex computation involved. Therefore, it is reasonable to assume that usually, the implementation of business logic in a dataintensive system also depends on the ER diagram. This completes the rationale of the observation.

4 The Proposed Software Sizing From the observation discussed in the previous section, the size of the source code for a data-intensive system usually depends and only depends on the structure and size of an extended ER diagram that models its data. Furthermore, ER diagram has been widely and well use in the requirements modeling and analysis stages. Thus, it is more suitable to base on extended ER diagram for the estimation of the size of source code for a data-intensive system. Therefore, we propose a novel method for building software size model based on extended ER diagram. This section discusses the method.

ER-Based Software Sizing for Data-Intensive Systems

183

The proposed method builds software size models through well-known linear regression models. For a data-intensive system, the variables that sufficiently characterize the extended ER diagram for the system form the independent variables. The dependent variable is the size of its source code in thousand lines of code (KLOC). Note that in this case, the extended ER diagram is implemented and is only implemented by the system. That is, the extended ER diagram and the system must coincide and have a one-to-one correspondence. As such, any source code that references or updates the database that is designed from the extended ER diagram must be included as part of the source code. In the proposed approach, a separate software size model should be built for each different development environment (that is, each programming language and tool used). For example, different software size models should be built for systems written in Visual Basic with VB Script and SQL, and systems written in Java with JSP, Java Script and SQL. In the most precise case, the independent variables that characterize the extended ER diagram comprise of the following: 1) Total number of entity types. 2) Total number of attributes. 3) Total numbers of association types classified based on their degrees and multiplicities: Usually, the degrees can be classified in exact for those below an upper limit. The remaining can all be lumped into one. Multiplicities can be classified into zero-or-one, one and many. More precise classification can also be tried. 4) Total numbers of aggregation types classified based on their degrees and multiplicities: Same as the association types. 5) Total numbers of generalization types classified based on the number of subclasses: Usually, the number of sub-classes can be classified in exact for those below an upper limit. The remaining can all be lumped into one. However, we do not propose to build a software size model based on a fixed set of independent variables. It all depends on the kind of ER diagrams used in organizations for which we develop the software size model. Note that the above-mentioned association refers to association that is not aggregation. The separation of relationship types into associations, aggregations and generalizations is because of the differences in their semantics. These differences may result to some differences in navigation and updating needs in the database. We propose that the independent variables should be defined according to the type of ER diagram constructed during the requirements modeling and analysis stages. So, at least the data required for software sizing is readily available in the early stage of requirements analysis. From our experience in building proposed software size models using the data collected from the industry, hardly any relationship type is ternary or higher order. And, most of the ER diagrams do not classify their relationship types into association, aggregation and generalization. The precision of the independent variables depends on the types of extended ER diagram constructed in the requirements modeling and analysis stages in the organization. However, a larger set of independent variables will require a larger set of data for building and evaluating the model.

184

Hee Beng Kuan Tan and Yuan Zhao

The steps for building proposed software size models are as follows: 1) Independent variables identification: Based on the type of data model (a class or ER diagram) constructed during requirements modeling and analysis, we identify a set of independent variables that sufficiently characterize the diagram. 2) Data collection: Collect ER diagrams and sizes of source codes (in KLOC) of sufficient data-intensive systems. A larger set of independent variables will require a larger set of data. There are many free tools available for the automated extraction of source code size. 3) Model building and evaluation: There are quite a number of commonly used regression models [16]. Both linear and non-linear models can be considered. The size of source code (in KLOC) and the independent variables identified in the first step form the dependent and the independent variables respectively for the model. Statistical packages (e.g., SAS) should be used for the model building. Ideally, we should have separate data sets for modeling building and evaluation. However, if the data is limited, the same data set may also be used for model building and evaluation. Let n be the number of data points and k be the number of independent variables. Let and are the real and the estimate values respectively of a project. Let be the mean of all The evaluation of model goodness can be done according to the examination of the following parameters: Magnitude of relative error, MRE, and mean magnitude of relative error, MMRE: They are defined as follows:

If the MMRE is small, then we have a good set of predictions. A usual criterion for accepting a model as good is that the model has a Prediction at level l – Pred(l) – where l is a percentage: It is defined as the ratio of number of cases in which the estimates are within the l absolute limit of the actual values divided by the total number of cases. A standard criteria for considering a model as acceptable is Multiple coefficient of determination, and adjusted multiple coefficient of determination, These are some usual measures in regression analysis, denoting the percentage of variance accounted for by the independent variables used in the regression equations. They are computed as follows:

and

ER-Based Software Sizing for Data-Intensive Systems

where sum of squared errors

185

and

In general, the larger the value of and the better fit of the data. implies a perfect fit of the model passing through every data point. However, can only be used as a measure to access the usefulness of the model if the number of data points is substantially more than the number of independent variables. If the same data set is used for both model building and evaluation, we can further examine the following parameters to evaluate the model goodness: Relative root mean squared error, is defined as follows [6]:

where

A model is considered acceptable if

Prediction sum of squares, PRESS [16]: PRESS is a measure of how well the use of the fitted values for a subset model can predict the observed responses The error sum of squares, is also such a measure. The PRESS measure differs from SSE in that each fitted value for the PRESS is obtained by deleting the i th case from the data set, estimating the regression function for the subset model from the remaining n - 1 cases, and then using the fitted regression function to obtain the predicted for the ith case. That is, it is defined as follows:

Models with smaller PRESS values are considered good candidate models. The PRESS value is always larger than SSE because the regression fit for the i th case is included. A smaller PRESS value supports the validity of the model built.

5 Preliminary Validation As ER diagrams constructed in most projects in the industry do not classify relationship types into associations, aggregations and generalizations, a complete validation of the proposed method is not possible. We have spent much effort to pursue organizations in the industry to supply us their project data for the validation of the proposed software sizing method. As such, the whole validation took about one and a half year. This section discusses our validation.

186

Hee Beng Kuan Tan and Yuan Zhao

Due to the above-mentioned constraint, the independent variables for characterizing an ER diagram in our validation is simplified as follows: 1) Number of entity types (E) 2) Number of attributes (A) 3) Number of relationship types (R) These variables provide a reasonable and concise characterization of the ER diagram. Our validation bases on the following linear regression models [14]: where Size is the total KLOC (thousand lines of code) of all the source code that is developed based on the ER diagram and is a coefficient to be determined.

5.1 The Dataset We collected three datasets from multiple organizations in the industry including software house and end-users such as public organizations and insurance companies. These projects cover a wide range of application domains including freight management, administrative and financial systems. The first dataset comprises 14 projects that were developed using Visual Basic with VB Scripts and SQL. The second dataset comprises 10 projects that were developed using Java with JSP, Java Script and SQL. Table 1 and 2 show the details of the two data sets. The first and second datasets are for the building of software size models for the respective development environments. The third dataset comprises of 8 projects developed using the same Visual Basic development environment as the first dataset. Table 3 shows the details of the third dataset.

5.2 The Resulting Models From the Visual Basic based project data set (Table 1), the resulting first order model that we built for estimating the size of source code (in KLOC) developed using Visual Basic with VB Script and SQL is as follows: Adjusted multiple coefficient of determination

for this model is 0.84. The

value of is reckoned as good. From the Java based project data set (Table 2), the resulting first order model that we built for estimating the size of source code (in KLOC) developed using Java with JSP, Java Script and SQL is as follows: Adjusted multiple coefficient of determination model. The value of

is reckoned as very good.

for this model is 0.99 for this

ER-Based Software Sizing for Data-Intensive Systems

187

5.3 Model Evaluation For the first order model that we built for estimating size of source code (in KLOC) developed using Visual Basic with VB Script and SQL, we managed to collect a separate data set for the evaluation of the model. Note that for this model has already been computed during model building and is 0.84, which is reckoned as good. MMRE and Pred (0.25) computed from the evaluation data set are 0.16 and 0.88 respectively. These values fall well within the acceptable level. The detailed results of the evaluation are shown together with the evaluation dataset in Table 3. Therefore, the evaluation results support the validity of the model built. For the first order model that we built for estimating the size of source code (in KLOC) developed using Java with JSP, Java Script and SQL, we did not manage to collect a separate data set for the evaluation of the model. As such, we used the same

188

Hee Beng Kuan Tan and Yuan Zhao

data set for the evaluation. Note that for this model has already been computed during model building and is 0.99, which is reckoned as very good. MMRE, Pred (0.25), SSE and PRESS computed from the same data set are 0.07, 1.00, 10.04 and 556.84 respectively. The detailed results of the evaluation is shown in Table 4. Both MMRE and Pred(0.25) fall well within the acceptable level. Although there is a difference between SSE and PRESS, the difference is not too substantial too. Note that computed from SSE in this case is 0.02. If we replace SSE by PRESS in the computation of then the value of is 0.18. Both of these values fall well below the acceptable level 0.25. Therefore, the evaluation results support the validity of the model built.

Though we managed to build only simplified software size models from the proposed approach due to the limitation in the industry practice, the evaluation results have already supported the validity of the models built. As such, our empirical validation supports the validity of the proposed method for building software size models.

ER-Based Software Sizing for Data-Intensive Systems

189

6 Comparative Discussion We have proposed a novel method for building software size models for dataintensive systems. Due to the lack of complete data for validating the proposed method from completed projects in the industry, we only managed to do a validation based on building and evaluating simplified proposed software size models. The statistical evaluation supports the validity of the proposed method. Due to the above-mentioned simplification and limited size of our dataset, we do not claim that the models built in this paper are ready for use. However, at least, we believe that our work has shown some promise to study the proposed method for software sizing further. Software size estimation is an important key to project estimation, which in turn is vital for project control and management [3], [4], [11]. There is much problem in existing software size estimation methods. As the software estimation community requires totally new datasets for the building and evaluation of software size models built using the proposed method, we call for collaborations between the industry and the research communities to validate the proposed method further and more comprehensively. From the history in establishing of Function Point method, without such effort, it is not likely to succeed in building usable software size model. As discussed in [15], most of the existing software sizing methods [9], [12], [13], [18] require much implementation information that is not available and is difficult to predict in the early stage of a software project. The information is not even available after the requirements analysis stage. It is only available in the design or implementation stage. For example, Function Point method is based on external inputs, outputs and inquiries, and external logical files and external interface files. Such implementation details are not even available at the end of requirements analysis stage. ER diagram has been well used in the conceptual modeling for developing data-intensive systems. Some proposals for software projects have also included ER diagrams as part of project requirement. As such, ER diagrams are at least practically available after the requirements analysis stage. Once the ER diagram is constructed, the proposed software size model can be applied without much difficulty. Therefore, in the worst case, we can apply the proposed approach after the requirements analysis stage. Ideally, a brief extended ER model should be constructed during the project proposal or planning stage. And, the proposed software size model can be applied to estimate the software size to serve as an input for project effort estimation. Subsequently, when a more accurate extended ER model is available, the model can be reapplied for more accurate project estimation. A final revision of project estimation should be carried out at the end of requirements analysis stage, in which an accurate extended ER diagram should be available. The well-known Function Point method is also mainly for data-intensive systems. As such, the domain of application for the proposed method for software sizing is similar to that of Function Point method.

190

Hee Beng Kuan Tan and Yuan Zhao

Acknowledgement We would like to thank IPACS E-Solution (S) Pte Ltd, Singapore Computer Systems Pte Ltd, NatSteel Ltd, Great Eastern Life Assurance Co. Limited, JTC Corporation and National Computer Systems Pte Ltd for providing the project data. Without their support, this work would not be possible.

References 1. A.J. Albrecht, and J. E. Gaffney, Jr., “Software function, source lines of code, and development effort prediction: a software science validation,” in IEEE Trans. Software Eng., vol. SE-9, no. 6, Nov. 1983, pp. 639-648. 2. P. Armour, “Ten unmyths of project estimation: reconsidering some commonly accepted project management practices,” in Comm. ACM 45(11), Nov. 2002, pp. 15-18. 3. B.W. Boehm, and R. E. Fairley, “Software estimation perspectives,” in IEEE Software, Nov./Dec. 2000, pp. 22-26. 4. B.W. Boehm et al., Software Cost Estimation with COCOMO II, Prentice Hall, 2000. 5. P.P. Chen, “The entity-relationship model - towards a unified view of data,” in ACM Trans. Database Syst. 1(1), Mar. 1976, pp. 9-36. 6. S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, 1986. 7. COSMIC-Full Functions – Release 2.0, September 1999. 8. J.J. Dolado, “A validation of the component-based method for software size estimation,” in IEEE Trans. Software Eng., vol. SE-26, no.10, Oct. 2000, pp. 1006-1021. 9. Daniel. V. Ferens, “Software Size Estimation Techniques,” Proceedings of the IEEE NAECON 1988 701-705. 10. D. Garmus, and D. Herron, Function Point Analysis: measurement practices for successful software projects, Addison Wesley, 2000. 11. C.F. Kemerer, “An empirical validation of software project cost estimation models,” in Comm. ACM 30(5), May 1987, pp. 416-429. 12. R. Lai, and S. J. Huang, “A model for estimating the size of a formal communication protocol application and its implementation,” in IEEE Trans. Software Eng., vol. 29, no. 1, pp. 46-62, Jan, 2003. 13. L.A. Laranjeira, “Software Size Estimation of Object-Oriented Systems,” in IEEE Trans. Software Eng., vol. 16, no. 5, May 1990, pp. 510-522. 14. J.T. McClave, and T. Sincich, Statistics, Ed, Prentice Hall, 2003. 15. E. Miranda, “An evaluation of the paired comparisons method for software sizing,” Proc. Int. Conf. On Software Eng., 2000, pp. 597-604. 16. J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied Linear Regression Models, IRWIN, 1996. 17. T.J. Teorey, D. Yang, and J. P. Fry, “A logical design methodology for relational databases using the extended entity-relationship model,” in ACM Computing Surveys 18(2), June, 1986, pp. 197-222. 18. J. Verner, and G. Tate, “A software size model,” in IEEE Trans. Software Eng., vol. SE-18, no. 4, Apr. 1992, pp. 265-278.

Data Mapping Diagrams for Data Warehouse Design with UML Sergio Luján-Mora1, Panos Vassiliadis2, and Juan Trujillo1 1

Dept. of Software and Computing Systems University of Alicante, Spain {slujan,jtrujillo}@dlsi.ua.es 2

Dept. of Computer Science University of Ioannina, Hellas [emailprotected]

Abstract. In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity including transformation rules at the attribute level. Specifically, we present a disciplined framework for the modeling of the relationships between sources and targets in different levels of granularity (including coarse mappings at the database and table levels to detailed inter-attribute mappings at the attribute level). In order to accomplish this goal, we extend UML (Unified Modeling Language) to model attributes as first-class citizens. In our attempt to provide complementary views of the design artifacts in different levels of detail, our framework is based on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario. Keywords: data mapping, ETL, data warehouse, UML

1 Introduction In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. DWs are usually populated with data from different and heterogeneous operational data sources such as legacy systems, relational databases, COBOL files, Internet (XML, web logs) and so on. It is well recognized that the design and maintenance of these ETL processes (also called DW back stage) is a key factor of success in DW projects for several reasons, the most prominent of which is their critical mass; in fact, ETL development can take up as much as 80% of the development time in a DW project [1,2]. Despite the importance of designing the mapping of the data sources to the DW structures along with any necessary constraints and transformations, P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 191–204, 2004. © Springer-Verlag Berlin Heidelberg 2004

192

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

unfortunately, there are few models that can be used by the designers to this end. The front end of the DW has monopolized the research on the conceptual part of DW modeling, while few attempts have been made towards the conceptual modeling of the back stage [3,4]. Still, to this day, there is no model that can combine (a) the desired detail of modeling data integration at the attribute level and (b) a widely accepted modeling formalism such as the ER model or UML. One particular reason for this, is that both these formalisms are simply not designed for this task; on the contrary, they treat attributes as second-class, weak entities, with a descriptive role. Of particular importance is the problem that in both models attributes cannot serve as an end in an association or any other relationship. One might argue that the current way of modeling is sufficient and there is no real need to extend it in order to capture mappings and transformations at the attribute level. There are certain reasons that we can list against this argument: The design artifacts are acting as blueprints for the subsequent stages of the DW project. If the important details of this design (e.g., attribute interrelationships) are not documented, the blueprint is problematic. Actually, one of the current issues in DW research involves the efficient documentation of the overall process. Since design artifacts are means of communicating ideas, it is best if the formalism adopted is a widely used one (e.g., UML or ER). The design should reflect the architecture of the system in a way that is formal, consistent and allows the what-if analysis of subsequent changes. Capturing attributes and their interrelations as first-class modeling elements (FCME, also known as first-class citizens) improves the design significantly with respect to all these goals. At the same time, the way this issue is handled now would involve a naive, informal documentation through UML notes. In previous lines of research [5], we have shown that by modeling attribute interrelationships, we can treat the design artifact as a graph and actually measure the aforementioned design goals. Again, this would be impossible with the current modeling formalisms. To address all the aforementioned issues, in this paper, we present an approach that enables the tracing of the DW back-stage (ETL processes) particularities at various levels of detail, through a widely adopted formalism (UML). This is enabled by an additional view of a DW, called the data mapping diagram. In this new diagram, we treat attributes as FCME of the model. This gives us the flexibility of defining models at various levels of detail. Naturally, since UML is not initially prepared to support this behavior, we solve this problem thanks to the extension mechanisms that it provides. Specifically, we employ a formal, strict mechanism that maps attributes to proxy classes that represent them. Once mapped to classes, attributes can participate in associations that determine the inter-attribute mappings, along with any necessary transformations and constraints. We adopt UML as our modeling language due to its wide acceptance and the possibility of using various complementary diagrams for modeling different system aspects. Actually, from our point of view, one of the main advantages of the approach presented in this paper is that it is totally integrated

Data Mapping Diagrams for Data Warehouse Design with UML

193

in a global approach that allows us to accomplish the conceptual, logical and the corresponding physical design of all DW components by using the same notation ([6–8]). The rest of the paper is structured as follows. In Section 2, we briefly describe the general framework for our DW design approach and introduce a motivating example that will be followed throughout the paper. In Section 3, we show how attributes can be represented as FCME in UML. In Section 4, we present our approach to model data mappings in ETL processes at the attribute level. In Section 5, we review related work and finally, in Section 6 we present the main conclusions and future work.

2

Framework and Motivation

In this section we discuss our general assumptions around the DW environment to be modelled and briefly give the main terminology. Moreover, we define a motivating example that we will consistently follow through the rest of the paper. The architecture of a DW is usually depicted as various layers of data in which data from one layer is derived from data of the previous layer [9]. Following this consideration, we consider that the development of a DW can be structured into an integrated framework with five stages and three levels that define different diagrams for the DW model, as explained in Table 1.

In previous works, we have presented some of the diagrams (and the corresponding profiles), such as the Multidimensional Profile [6,7] and the ETL Profile [4]. In this paper, we introduce the Data Mapping Profile. To motivate our discussion we will introduce a running example where the designer wants to build a DW from the retail system of a company. Naturally, we

194

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

consider only a small part of the DW, where the target fact table has to contain only the quarterly sales of the products belonging to the computer category, whereas the rest of the products are discarded. In Fig. 1, we zoom-in the definition of the SCS (Source Conceptual Schema), which represents the sources that feed the DW with data. In this example, the data source is composed of four entities represented as UML classes: Cities, Customers, Orders, and Products. The meaning of the classes and their attributes, as depicted in Fig. 1 is straightforward. The “...” shown in this figure simply indicates that other attributes of these classes exist, but they are not displayed for the sake of simplicity (this use of“...” is not a UML notation).

Fig. 1. Source Conceptual Schema (SCS)

Fig. 2. Data Warehouse Conceptual Schema (DWCS)

Finally, the DWCS (Data Warehouse Conceptual Schema) of our motivating example is shown in Fig. 2. The DW is composed of one fact (ComputerSales) and two dimensions (Products and Time). In this paper, we present an additional view of a DW, called the Data Mapping that shows the relationships between the data sources and the DW and between the DW and the clients’ structures. In this new diagram, we need to treat attributes as FCME of the models, since we need to depict their relationships at attribute level. Therefore, we also propose a UML extension to accomplish this goal in this paper. To the best of our knowledge, this is the first proposal of representing attributes as FCME in UML diagrams.

3

Attributes as First-Class Modeling Elements in UML

Both in the Entity-Relationship (ER) model and in UML, attributes are embedded in the definition of their comprising “element” (an entity in the ER or a class in UML), and it is not possible to create a relationship between two attributes. As we have already explained in the introduction, in some situations (e.g., data integration, constraints over attributes, etc.) it is desirable to represent attributes as FCME. Therefore, in this section we will present an extension of UML to accommodate attributes as FCME. We have chosen UML instead of

Data Mapping Diagrams for Data Warehouse Design with UML

195

ER on the grounds of its higher flexibility in terms of employing complementary diagrams for the design of a certain system. Throughout this paper, we frequently use the term first-class modeling elements or first-class citizens for elements of our modeling languages. Conceptually, FCME refer to fundamental modeling concepts, on the basis of which our models are built. Technically, FCME involve an identity of their own, and they are possibly governed by integrity constraints (e.g., relationships must have at least two ends refering to classes.). In a UML class diagram, two kinds of modeling elements are treated as FCME. Classes, as abstract representations of real-world entities are naturally found in the center of the modeling effort. Being FCME, classes stand-alone entities also acting as attribute containers. The relationships between classes are captured by associations. Associations can be also FCME, called association classes. Even though an association class is drawn as an association and a class, it is really just a single model element [10]. An association class can contain attributes or can be connected to other classes. However, the same is not possible with attributes. Naturally, in order to allow attributes to play the same role in certain cases, we propose the representation of attributes as FCME in UML. In our approach, classes and attributes are defined as normally in UML. However, in those cases where it is necessary to treat attributes as FCME, classes are imported to the attribute/class diagram, where attributes are automatically represented as classes; in this way, the user only has to define the classes and the attributes once. In the importing process from the class diagram to the attribute/class diagram, we refer to the class that contains the attributes as the container class and to the class that represents an attribute as the attribute class. In Table 2, we formally define attribute/class diagrams, along with the new stereotypes, and

4

The Data Mapping Diagram

Once we have introduced the extension mechanism that enables UML to treat attributes as FCME, we can proceed in defining a framework on its usage. In

196

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

this section, we will introduce the data mapping diagram, which is a new kind of diagram, particularly customized for the tracing of the data flow, at various degrees of detail, in a DW environment. Data mapping diagrams are complementary to the typical class and interaction diagrams of UML and focus on the particularities of the data flow and the interconnections of the involved data stores. A special characteristic of data mapping diagrams is that a certain DW scenario is practically described by a set of complementary data mapping diagrams, each defined at a different level of detail. In this section, we will introduce a principled approach to deal with such complementary data mapping diagrams. To capture the interconnections between design elements, in terms of data, we employ the notion of mapping. Broadly speaking, when two design elements (e.g., two tables, or two attributes) share the same piece of information, possibly through some kind of filtering or transformation, this constitutes a semantic relationship between them. In the DW context, this relationship, involves three logical parties: (a) the provider entity (schema, table, or attribute), responsible for generating the data to be further propagated, (b) the consumer, that receives the data from the provider and (c) their intermediate matching that involves the way the mapping is done, along with any transformation and filtering. Since a data mapping diagram can be very complex, our approach offers the possibility to organize it in different levels thanks to the use of UML packages. Our layered proposal consists of four levels (see Fig. 3), as it is explained in Table 3.

At the leftmost part of Fig. 3, a simple relationship among the DWCS and the SCS exists: this is captured by a single Data Mapping package and these three design elements constitute the data mapping diagram of the database level (or Level 0). Assuming that there are three particular tables in the DW that we

Data Mapping Diagrams for Data Warehouse Design with UML

197

would like to populate, this particular Data Mapping package abstracts the fact that there are three main scenarios for the population of the DW, one for each of this tables. In the dataflow level (or Level 1) of our framework, the data relationships among the sources and the targets in the context of each of the scenarios, is practically modeled by the respective package. If we zoom in one of these scenarios, e.g., Mapping 1, we can observe its particularities in terms of data transformation and cleaning: the data of Source 1 are transformed in two steps (i.e., they have undergone two different transformations), as shown in Fig. 3. Observe also that there is an Intermediate data store employed, to hold the output of the first transformation (Step 1), before passed on to the second one (Step 2). Finally, at the right lower part of Fig. 3, the way the attributes are mapped to each other for the data stores Source 1 and Intermediate is depicted. Let us point out that in case we are modeling a complex and huge DW, the attribute transformation modelled at level 3 is hidden within a package definition, thereby avoiding the use of cluttered diagrams.

Fig. 3. Data mapping levels

The constructs that we employ for the data mapping diagrams at different levels are as follows: The database and dataflow diagrams (Levels 0 and 1) use traditional UML structures for their purpose. Specifically, in these diagrams we employ (a) packages for the modeling of data relationships and (b) simple dependencies among the involved entities. The dependencies state that the mapping packages are dependent upon the changes of the employed data stores. The table level (Level 2) diagram extends UML with three stereotypes: (a) used as a package that encapsulates the data interrelationships among data stores, (b) and which explain the roles of providers and consumers for the The diagram at the attribute level (Level 3) is also using several newly introduced stereotypes, namely and for the definition of data mappings.

198

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

We will detail the stereotypes of the table level in the next section and defer the discussion for the stereotypes of the attribute level to subsection 4.2.

4.1

The Data Mapping Diagram at the Table Level

During the integration process from data sources into the DW, source data may undergo a series of transformations, which may vary from simple algebraic operations or aggregations to complex procedures. In our approach, the designer can segment a long and complex transformation process into simple and small parts represented by means of UML packages that are materialization of a stereotype and contain an attribute/class diagram. Moreover, packages are linked by and dependencies that represent the flow of data. During this process, the designer can create intermediate classes, represented by the stereotype, in order to simplify or clarify the models. These classes represent intermediate storage that may or may not exist actually, but they help to understand the mappings. In Fig. 4, a schematic representation of a data mapping diagram at the table level is shown. This level specifies data sources and target sources, to which these data are directed. At this level, the classes are represented as usually in UML with the attributes depicted inside the container class. Since all the classes are imported from other packages, the legend (from ...) appears below the name of each class. The mapping diagram is shown as a package decorated with the stereotype and hides the complexity of the mapping, because a vast number of attributes can be involved in a data mapping. This package presents two kinds of stereotyped dependencies: to the data providers (i.e., the data sources) and to the data consumers (i.e., the tables of the DW).

4.2

The Data Mapping Diagram at the Attribute Level

As already mentioned, in the attribute level, the diagram includes the relationships between the attributes of the classes involved in a data mapping. At this level, we offer two design variants: Compact variant: the relationship between the attributes is represented as an association, and the semantic of the mapping is described in a UML note attached to the target attribute of the mapping. Formal variant: the relationship between the attributes is represented by means of a mapping object, and the semantic of the mapping is described in a tag definition of the mapping object. With the first variant, the data mapping diagrams are less cluttered, with less modeling elements, but the data mapping semantics are expressed as UML notes that are simple comments that have no semantic impact. On the other hand, the size of the data mapping diagrams obtained with the second variant is larger, with more modeling elements and relationships, but the semantics are better defined as tag definitions. Due to the lack of space, in this paper we

Data Mapping Diagrams for Data Warehouse Design with UML

199

will only focus on the compact variant. In this variant, the relationship between the attributes is represented as an association decorated with the stereotype and the semantic of the mapping is described in a UML note attached to the target attribute of the mapping.

Fig. 4. Level 2 of a data mapping diagram

The content of the package Mapping diagram from Fig. 4 is defined in the following way (recall that Mapping diagram is a package that contains an attribute/class diagram): The classes DS1, DS2, ..., and Dim1 are imported in Mapping diagram. The attributes of these classes are suppressed because they are shown as classes in this package. The classes are connected by means of association relationships and we use the navigability property to specify the flow of data from the data sources to the DW. The association relationships are adorned with the stereotype to highlight the meaning of this relationship. A UML note can be attached to each one of the target attributes to specify how the target attribute is obtained from the source attributes. The language for the expression is a choice of the designer (e.g., a LAV vs. a GAV approach [11] can be equally followed).

4.3

Motivating Example Revisited

From the DW example shown in Fig.s 1 and 2, we define the corresponding data mapping diagram shown in Fig. 5. The goal of this data mapping is to calculate the quarterly sales of the products belonging to the computer category. The result of this transformation is stored in ComputerSales from the DWCS. The transformation process has been segmented in three parts: Dividing, Filtering, and Aggregating; moreover, DividedOrders and FilteredOrders, two classes, have been defined.

200

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

Fig. 5. Level 2 of a data mapping diagram

Following with the data mapping example shown in Fig. 5, attribute prod_list from Orders table contains the list of ordered products with product ID and (parenthesized) quantity for each. Therefore, Dividing splits each input order according to its prod_list into multiple orders, each with a single ordered product (prod_id) and quantity (quantity), as shown in Fig. 6. Note that in a data mapping diagram the designer does not specify the processes, but only the data relationships. We use the one-to-many cardinality in the association relationships between Orders.prod_list and DividedOrders.prod_id and DividedOrders.quantity to indicate that one input order produces multiple output orders. We do not attach any note in this diagram because the data are not transformed, so the mapping is direct.

Fig. 6. Dividing Mapping

Filtering (Fig. 7) filters out products not belonging to the computer category. We indicate this action with a UML note attached to the prod_id mapping, because it is supposed that this attribute is going to be used in the filtering process. Finally, Aggregating (Fig. 8) computes the quarterly sales for each product. We use the many-to-one cardinality to indicate that many input items are needed to calculate a single output item. Moreover, a UML note indicates how the ComputerSales.sales attribute is calculated from FilteredOrders.quantity

Data Mapping Diagrams for Data Warehouse Design with UML

201

Fig. 7. Filtering Mapping

Fig. 8. Aggregating Mapping

and Products.price. The cardinality of the association relationship between Products.price and ComputerSales.sales is one-to-many because the same price is used in different quarters, but to calculate the total sales of a particular product in a quarter we only need one price (we consider that the price of a product never changes along time).

5

Related Work

There is a relatively small body of research efforts around the issue of conceptual modeling of the DW back-stage.

202

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

In [12,13], the model management, a framework for supporting meta-data related applications where models and mappings are manipulated is proposed. In [13], two scenarios related to loading DWs are presented as case studies: on the one hand, the mapping between the data sources and the DW, on the other hand, the mapping between the DW and a data mart. In this approach, a mapping is a model that relates the objects (attributes) of two other models; each object in a mapping is called a mapping object and has three properties: domain and range, which point to objects in the source and the target respectively, and expr, which is an expression that defines the semantics of that mapping object. This is an isolated approach in which authors propose their own graphical notation for representing data mappings. Therefore, from our point of view, there is a lack of integration with the design of other parts of a DW. In [3] the authors attempt to provide a first model towards the conceptual modeling of the DW back-stage. The notion of provider mapping among attributes is introduced. In order to avoid the problems caused by the specific nature of ER and UML, the authors adopt a generic approach. The static conceptual model of [3] is complemented in [5] with the logical design of ETL processes as data-centric workflows. ETL processes are modeled as graphs composed of activities that include attributes as FCME. Moreover, different kinds of relationships capture the data flow between the sources and the targets. Regarding data mapping, in [14] authors discuss issues related to the data mapping in the integration of data. A set of mapping operators is introduced and a classification of possible mapping cases is presented. However, no graphical representation of data mapping scenarios is provided, thereby making difficult using it in real world projects. The issue of treating attributes as FCME has generated several debates from the beginning of the conceptual modeling field [15]. More recently, some objectoriented modeling approaches such as OSM (Object Oriented System Model) [16] or ORM (Object Role Modeling) [17] reject the use of attributes (attributefree models) mainly because of their inherent instability. In these approaches, attributes are represented with entities (objects) and relationships. Although an ORM diagram can be transformed into a UML diagram, our data mapping diagram is coherently integrated in a global approach for the modeling of DW’s [6,7], and particularly, of ETL processes [4]. In this approach, we have used the extension mechanisms provided by UML to adapt it to our particular needs for the modeling of DW’s. In this case, we always use formal extensions of the UML for modeling all parts of DWs.

6

Conclusions and Future Work

In this paper, we have presented a framework for the design of the DW backstage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity. Specifically, we have presented a disciplined framework for the modeling of the relationships between sources and targets in

Data Mapping Diagrams for Data Warehouse Design with UML

203

different levels of granularity (i.e., from coarse mappings at the database level to detailed inter-attribute mappings at the attribute level). Unfortunately, standard modeling languages like the ER model or UML are fundamentally handicapped in treating low granule entities (i.e., attributes) as FCME. Therefore, in order to formally accomplish the aforementioned goal, we have extended UML to model attributes as FCME. In our attempt to provide complementary views of the design artifacts in different levels of detail, we have based our framework on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario. Although we have developed the representation of attributes as FCME in UML in the context of DW, we believe that our solution can be applied in other application domains as well, e.g., definition of indexes and materialized views in databases, modeling of XML documents, specification of web services, etc. Currently, we are extending our proposal in order to represent attribute constraints such as uniqueness or disjunctive values.

References 1. SQL Power Group: How do I ensure the success of my DW? Internet: http://www.sqlpower.ca/page/dw_best_practices (2002) 2. Strange, K.: ETL Was the Key to this Data Warehouse’s Success. Technical Report CS-15-3143, Gartner (2002) 3. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual Modeling for ETL Processes. In: Proc. of 5th Intl. Workshop on Data Warehousing and OLAP (DOLAP 2002), McLean, USA (2002) 14–21 4. Trujillo, J., Luján-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: Proc. of the 22nd Intl. Conf. on Conceptual Modeling (ER’03). Volume 2813 of LNCS., Chicago, USA (2003) 307–320 5. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL Activities as Graphs. In: Proc. of 4th Intl. Workshop on the Design and Management of Data Warehouses (DMDW’02), Toronto, Canada (2002) 52–61 6. Luján-Mora, S., Trujillo, J., Song, I.: Extending UML for Multidimensional Modeling. In: Proc. of the 5th Intl. Conf. on the Unified Modeling Language (UML’02). Volume 2460 of LNCS., Dresden, Germany (2002) 290–304 7. Luján-Mora, S., Trujillo, J., Song, I.: Multidimensional Modeling with UML Package Diagrams. In: Proc. of the 21st Intl. Conf. on Conceptual Modeling (ER’02). Volume 2503 of LNCS., Tampere, Finland (2002) 199–213 8. Lujá-Mora, S., Trujillo, J.: A Comprehensive Method for Data Warehouse Design. In: Proc. of the 5th Intl. Workshop on Design and Management of Data Warehouses (DMDW’03), Berlin, Germany (2003) 1.1–1.14 9. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. 2 edn. Springer-Verlag (2003) 10. Object Management Group (OMG): Unified Modeling Language Specification 1.4. Internet: http://www.omg.org/cgi-bin/doc?formal/01-09-67 (2001) 11. Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA (2002) 233–246

204

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

12. Bernstein, P., Levy, A., Pottinger, R.: A Vision for Management of Complex Models. Technical Report MSR-TR-2000-53, Microsoft Research (2000) 13. Bernstein, P., Rahm, E.: Data Warehouse Scenarios for Model Management. In: Proc. of the 19th Intl. Conf. on Conceptual Modeling (ER’00). Volume 1920 of LNCS., Salt Lake City, USA (2000) 1–15 14. Dobre, A., Hakimpour, F., Dittrich, K.R.: Operators and Classification for Data Mapping in Semantic Integration. In: Proc. of the 22nd Intl. Conf. on Conceptual Modeling (ER’03). Volume 2813 of LNCS., Chicago, USA (2003) 534–547 15. Falkenberg, E.: Concepts for modelling information. In: Proc. of the IFIP Conference on Modelling in Data Base Management Systems, Amsterdam, Holland (1976) 95–109 16. Embley, D., Kurtz, B., Woodfield, S.: Object-oriented Systems Analysis: A ModelDriven Approach. Prentice-Hall (1992) 17. Halpin, T., Bloesch, A.: Data modeling in UML and ORM: a comparison. Journal of Database Management 10 (1999) 4–13

Informational Scenarios for Data Warehouse Requirements Elicitation Naveen Prakash1, Yogesh Singh2, and Anjana Gosain2 1

JIIT A10, Sector 62, Noida 201307, India [emailprotected]

2

USIT, GGSIP University, Kashmere Gate, Delhi 110006, India [emailprotected], [emailprotected]

Abstract. We propose a requirements elicitation process for a data warehouse (DW) that identifies its information contents. These contents support the set of decisions that can be made. Thus, if the information needed to take every decision is elicited, then the total information determines DW contents. We propose an Informational Scenario as the means to elicit information for a decision. An informational scenario is written for each decision and is a sequence of pairs of the form < Query, Response >. A query requests for information necessary to take a decision and the response is the information itself. The set of responses for all decisions identifies DW contents. We show that informational scenarios are merely another sub class of the class of scenarios.

1

Introduction

In the last decade, great interest has been shown in the development of data warehouses (DWs). We can look at data warehouse development at the design, the conceptual, and the requirements engineering levels. Two different approaches for the development of DWs have been proposed at the design level. These are the data-driven [9], and the requirements-driven [2,12,8,19] approaches. Given data needs, these approaches identify the logical structure of the DW. Jarke et al. [11] propose to add a conceptual layer on top of the logical layer. Whereas they propose the basic notion of the conceptual layer, it is assumed that the conceptual objects represented in the Enterprise Model can be determined but the question of what are useful conceptual objects for a DW and how these are to be determined is not addressed. Thus, the conceptual level does not take into account the larger context in which the DW is to function. A relationship of the Data Warehouse to the organizational context is established at the requirements level. Fabio Rilson and Jaelson Freire [7] adapt traditional requirements engineering techniques to Data Warehouses. This approach starts with Requirements Management Planning phase, for which the authors propose guidelines concerning acquisition, documentation and control of selected requirements. The second phase talks about a) Requirements Specification, which includes Requirements elicitation through, interviews, workshops, P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 205–216, 2004. © Springer-Verlag Berlin Heidelberg 2004

206

Naveen Prakash, Yogesh Singh, and Anjana Gosain

prototyping and scenarios, b) Requirements Analysis and c) Requirements Documentation. In third and fourth phases, requirements are conformed and validated respectively. The proposal of [7] is a “top” level proposal that builds a framework for DW requirements engineering. While providing pointers to RE approaches that may be applicable, this proposal does not establish their feasibility and also does not consider any detailed technical solutions. Boehnein et al. [3] presents a goal-driven approach that is based on the SOM (Semantic object Model) process modeling technique. It starts with the determination of two kinds of goals- one specifies the product and services to be provided whereas the other determines the extent to which the goal is to be pursued. Information requirements are derived by analyzing business processes in increasing details and by transforming relevant data structures of business processes into data structures of the data warehouse. According to [19], since data warehouse systems are developed to support exclusively decision processes, a detailed business process analysis is not feasible for decision processes because the respective tasks are unique and often not structured. Moreover, sometimes knowledge workers refuse to disclose their process in detail. The proposal of [14] aims to identify the decisional information to be kept in the Data Warehouse. This process starts with determination of the goals of an organization, uses these to arrive at its decision-making needs, and finally, identifies the information needed for the decisions to be supported. Therefore, the requirements engineering product is a Goal-Decision-Information (GDI) diagram that uses two associations 1) goal-decision coupling, and 2) decision- information coupling respectively. Whereas this proposal relates DW information contents to its decision-making capability as embedded in organizational goals, it is not backed up by a requirement elicitation process. In this paper, we look at requirement elicitation process for arriving at the GDI diagram. The total process is a two-part one. In the first part, the goaldecision coupling is elicited. That is, the set of decisions that can fulfill the goals of an organization are elicited. Thereafter, in the second part, from elicited decisions, the decision-information coupling can yield decisional information. Here, we deal with the second part of this process. We base our proposal on the notion of scenarios [13,16,6,10,18]. A scenario has been considered as a typical interaction between a user and the system to be developed. We treat this as the generic notion of a scenario. This is shown in Fig.1 as the root node of the scenario typology hierarchy. We refer to traditional scenarios as transactional scenarios since they reveal the system functionality needed in the new system. We propose a second kind of scenarios called Data Warehouse scenarios. In consonance with our two-part process, for goal-decision coupling we propose decisional scenarios and for decision-information coupling we postulate informational scenarios (see Fig. 1). As mentioned earlier, our interest in this paper is in informational scenarios. Informational scenarios reveal the information contents of a system. An informational scenario represents a typical interaction between a decision-maker

Informational Scenarios for Data Warehouse Requirements Elicitation

207

Fig. 1. Scenario Typology.

and the decisional system. This interaction is a sequence of pairs < Q, R >, where Q represents the query input to the system by the decision-maker and R represents the response obtained. This response yields the information to be kept in the decisional system, the data warehouse. In the next section we present the GDI model. In section 3, we define and illustrate informational scenarios. In subsection 3.1, we position them in the 4-dimensional classification system of scenarios. In subsection 3.2, we show elicitation of decisions from an informational scenario. The paper ends with a conclusion.

2

The GDI

The Goal-Decision-information (GDI) model is shown in Fig.2. In accordance with goal-orientation [1,4], we view a goal as an aim or objective that is to be met. A goal is a passive concept and unlike an activity/process/event it cannot perform or cause any action to be performed. A goal is set, and once so defined it needs an active component to realize it. The active component is decision. Further to fulfil the decisions appropriate information is required.

Fig. 2. GDI Diagram.

208

Naveen Prakash, Yogesh Singh, and Anjana Gosain

As shown in Fig.2 a goal can be either simple or complex. A simple goal cannot be decomposed into simpler ones. A complex goal is built out of other goals which may themselves be simple or complex. This makes a goal hierarchy. The component goals of a complex one may be mandatory or optional. A decision is a specification of an active component that causes goal fulfillment. It is not the active component itself: when a decision is selected for implementation then one or more actions may be performed to give effect to it. In other words, a decision is the intention to perform the actions that cause its implementation. Decision-making is an activity that results in the selection of the decision to be implemented It is while performing this activity that information to select the right decision is needed. As shown in Fig.2, a decision can be either simple or complex. A simple decision cannot be decomposed into simpler ones whereas a complex decision is built out of other simple or complex decisions. Fig.2 shows that there is an association ‘is satisfied by’ between goals and decisions. This association identifies the decisions which when taken can lead to goal satisfaction. Knowledge necessary to take decisions is captured in the notion of decisional information shown in Fig.2. This information is a specification of the data that will eventually be stored in the Data Warehouse. Fig.2 shows that there is an association ‘is required for’ between decisions and decisional information. This association identifies the decisional information required to take a decision. An instance of the GDI diagram, the GDI schema is shown in Fig.4. It shows a goal hierarchy (solid lines between ‘Maximize profit’ and ‘increase the no. of customers’, and ‘increase sales’) and a decision hierarchy(solid lines between ‘improve the quality of the product’ and ‘introduce statistical quality control techniques’ and ‘use better quality material’) for a given set of goals and decisions. The figure shows the ‘is satisfied by’ relationship between the goal ‘increase sales’ and decisions ‘open new sales counter’ and ‘improve the quality of the product’ by dashed lines. The ‘is required for’ relationship between decisions and associated information is shown by dotted lines. The dynamics of the interaction between goals, decisions and information is shown in Fig. 3. A goal suggests a set of decisions that lead to its satisfaction. A decision can be taken after consulting the information relevant to it and available in the decisional system. In the reverse direction, information helps in selecting a decision, which in turn satisfies a goal. For example the goal ‘increase sales’ suggests the decisions ‘improve the quality of the product’ and ‘open new sales counter’. These decisions may modify

Fig. 3. The Interaction Cycle.

Informational Scenarios for Data Warehouse Requirements Elicitation

209

Fig. 4. A GDI Schema.

the goal state. To implement the decision ‘open new sales counter’ it consults the information ‘Existing product demand’ and ‘Existing service/customer centers’.

3

Informational Scenario

In this section, we show elicitation of decisional information. The decisioninformation coupling suggests that the information needed to select a decision can be obtained from the knowledge of the decision itself. Thus, if we focus attention on a decision then through a suitable elicitation mechanism, we can obtain the information relevant to it. Our informational scenario is one such elicitation mechanism. It can be seen that the informational scenario is an expression of the ‘is required for’ relationship between a simple decision and information of the GDI diagram(see fig.2 ) An informational scenario is a typical interaction between the decision-maker and the decisional system. An informational scenario is written for each simple decision of the GDI diagram, and is a sequence of pairs < Q, R >, where Q represents the query input to decisional system and R represents the response of the decisional system. An informational scenario is thus of the form

The set of queries, through is an expression of the information relevant to the decision of the scenario. The information contents of the data warehouse can be derived from set of responses through We represent query in SQL and a response is represented as a relation Once a response has been received, it can be used in two ways (a) the relation attributes identify the information type to be maintained in the warehouse, and

210

Naveen Prakash, Yogesh Singh, and Anjana Gosain

(b) the tuple values can suggest the formulation of supplementary queries to elicit additional information. It is possible that, all values in all tuples may be non-null. Therefore, there is full knowledge of the data and a certain supplementary query sequence follows from this. We refer to such a < Q, R > sequence as a normal scenario (see Fig.6). In case a tuple contains a null value then this ‘normal’ sequence will not be followed and the next query may be formulated to explore the null value more deeply. This results in the breaking of the normal thought process and results in a different sequence of < Q, R >. We call this sequence as an exceptional scenario. Fig.5 shows these two types of informational scenarios. Let us illustrate the notion of normal and exceptional scenarios. Let us be given a decision “Open new sales counter”. In order to make this decision, the decision maker would like to know the units sold for different products at the various sales counters of each region. After all, a new sales counter makes sense in an area where (a) units sold is so high that existing counters are overloaded or (b) in a region where units sold is very low and this could be merely due to the absence of a sales outlet. So, the first query is formulated to reveal this information: How many units of different products have been sold at various sales counters in each region? This query shows that Region, Sales counter, Product and Number of units sold must be made available in the data warehouse. Select regions, sales counter, product, units sold From sales, region Let the response be as follows: Regions NR NR SR SR SR SR SR SR ER ER ER CR CR

Sales Counter Null Null Lata Lata Lata Kanika Kanika Kanika Rubina Rubina Rubina Null Null

Product Radio TV

Radio TV

Fridge Radio TV

Fridge Radio TV

Fridge Radio TV

Units Sold Null Null 90 200 200

Null 110 110 80 250 230

Null Null

Let it be that the decision-maker is not interested in exploring ‘null’ for the moment. Instead, he wishes to see if unsold stock exists in some large quantity. If so, then the opening of a sales counter might help in clearing unsold stock. So, the decision-maker may asks for the number of units manufactured. If the manufactured quantity is not sold then he may think of opening new sales counter in a particular region. This query and its response is shown in Fig. 7a. This results in the normal scenario shown in Fig. 6.

Informational Scenarios for Data Warehouse Requirements Elicitation

211

Fig. 5. An Information Scenario.

Fig. 6. Normal and Exceptional Scenario.

Suppose that the decision-maker wishes to explore ‘null’ values found in sales counter of regions. The reasoning followed is that if there are service centers in the regions NR and CR which are, in fact, servicing a number of company products then there is sufficient demand in these regions. This may call for the opening of sales counters. This query and its response is shown in Fig. 7b. The sequence and any further < Q, R > pairs following from this constitute exceptional scenario shown in fig. 6. In fig. 7b, if the response contains null values for service centers for any region, and the decision-maker again wishes to explore ‘null’ values found in services center of regions. The reasoning followed is that if there are sales counter and no service centers in the region CR then to take the decision open new sales counter in CR, he may ask for the number of sales counter of other companies manufacturing the same products. This query and its response is shown in Fig. 7c. The sequence and any further < Q, R > pairs following from this constitute another exceptional scenario. It also shows that an exceptional scenario can lead to another exceptional scenario and so on.

3.1

Positioning of Informational Scenario

Here we show that an informational scenario is a subclass of the class of in the 4-dimensional scenario classification framework proposed by [17].

212

Naveen Prakash, Yogesh Singh, and Anjana Gosain

Fig.7a

Fig.7b

Fig.7c

The 4-dimensional framework considers scenarios along four different views, each view allowing to capture a particular relevant aspect of the scenarios. The

Informational Scenarios for Data Warehouse Requirements Elicitation

213

Form view deals with the expression mode of a scenario. The Contents view concern the kind of knowledge which is expressed in a scenario. The Purpose view is used to capture the role that a scenario is aiming to play in the requirements engineering process. The Lifecycle view suggests to consider scenarios as artifacts existing and evolving in time through the execution of operations during the requirements engineering process. A set of facets is associated with each view. Facets are considered as viewpoints suitable to characterize and classify a scenario according to this view. A facet has a metric attached to it. Each facet is measured by a set of relevant attributes. Table 1 shows the views, facets, attributes and possible values of these attributes in the 4-dimensional framework together with attribute values that our informational scenario takes on. Consider the level of formalism attribute of the Form view. This takes on the value Formal because of the use of SQL and relations in the scenario expression. It is possible to express a scenario less formally by using free format. Were such a scenario to exist, its level of Formalism would have the value Informal. Information scenario proposed by us is also characterized according to these four views.

3.2

Elicitation of Decisions

In this section we show that informational scenarios can help in eliciting decisions as well. These decisions are suggested by an analysis of < Q, R > sequence of the scenario. Let us consider the decision “Open new sales counter” again. Let the decisionmaker makes a query as follows: What are the units sold for different products at various sales counters in each region? Let the response be as follows: Regions SR SR SR SR SR SR ER ER ER

Sales Counter Lata Lata Lata Kanika Kanika Kanika Rubina Rubina Rubina

Product Radio TV Fridge Radio TV Fridge Radio TV Fridge

Units Sold

30 100 90 25 90 90 40 120 100

The response shows that number of product units sold for different products is very low. Now the decision-maker may no longer be interested in continuing with the decision “open new sales counter” any more. Since number of units sold is low, the decision-maker may now be interested in improving product sales. This leads to the elicitation of new decision ‘Improve Product Sales’. Informational scenario is now written out for this decision.

214

Naveen Prakash, Yogesh Singh, and Anjana Gosain

Informational Scenarios for Data Warehouse Requirements Elicitation

215

Thus it is possible to move in both directions in the decision-information coupling. An informational scenario is written is for a given decision, which may lead to elicitation of other decisions which, leads to informational scenarios.

4

Conclusions

Information Systems/Software Engineering moved from early ‘code and fix’ approaches through design to requirements engineering. Thus considerable exploration of the problem space is performed before implementation. We can see the same evolution in DW engineering: as mentioned in the Introduction, attempts have been made to introduce the design and conceptual layers. This evolution has the same expectations as before, namely, development systems that better-fit organisation needs and user requirements. Thus, we expect that today’s practice where analysts understand DW use after it has been implemented /used will give way to a systematic approach satisfying the various stakeholders. Analysts will understand DW use partly through the argumentation and reasoning process of requirements engineering and partly through the use of the prototyping process model. Just as traditional scenarios elicit the functional requirements of transactional sys-tems, informational scenarios elicit the informational requirements of decisional sys-tems. Both these belong to general class of scenarios and represent typical interac-tions between the user and the system to be developed. In traditional scenarios the interest is in functional interaction: if the user does this then the system does that. In informational scenarios the interest is in obtaining information and we have an infor-mation seeking interaction: if I ask for this information, what do I get. Information may be missing or available. Depending upon this the user may for-mulate other information seeking interactions. We have used this to classify scenarios as exceptional or normal. Finally, it is possible that informational scenarios may sug-gest new decisions, thus helping in decision elicitation. We are working on framing of guidelines for informational scenarios. Future work also concerns decision elicitation by exploiting the goal-decision coupling.

References 1. Anton, A.I. : Goal based requirements analysis. Proceedings of the 2nd International Conference on Requirements Engineering ICRE’96, (1996) 136-144. 2. Ballard, C., Herreman, D., Schau, D., Bell, R., Kim, E., Valencic, A. : Data Modeling Techniques for Data Warehousing, redbooks.ibm.com (1998). 3. Boehnlein, M., Ulbrich vom Ende, A.: Business Process Oriented Development of Data Warehouse Structures. Proceedings of Data Warehousing 2000, Physica Verlag (2000). 4. Bubenko, J., Rolland, C., Loucopoulos, P., De Antonellis, V. : Facilitating ‘fuzzy to formal’ requirements modelling. IEEE 1st Conference on Requirements Engineering, ICRE’94 (1994) 154–158.

216

Naveen Prakash, Yogesh Singh, and Anjana Gosain

5. Cockburn, A. : Structuring use cases with goals. Technical report. Human and Technology, 7691 Dell Rd, Salt Lake City, UT 84121, HaT.TR.95.1 (1995). 6. Dano, B., Briand, H., Barbier, F.: A use case driven requirements engineering process. Third IEEE International Symposium On Requirements Engineering RE’97, Antapolis, Maryland, IEEE Computer Society Press (1997). 7. Rilson, F., Freire, J.: DWARF: AN Approach for Requirements Definition and Management of Data Warehouse Systems. Proceeding of the llth IEEE International Requirements Engineering Conference, September 08 - 12 (2003), 10901099. 8. Golfarelli, M., Maio, D., Rizzi, S.: Conceptual Design of Data Warehouses from E/R Schemes. Proceedings of the 31 st HICSS, IEEE Press (1998). 9. Inmon, W.H. : Building the Data Warehouse. John Wiley and Sons, (1996). 10. Jacobson, I. : The use case construct in object-oriented software Engineering. In Scenario-based design: envisioning work and technology in system development, J. M. Carroll (ed.), John Wiley and Sons, (1995) 309–336. 11. Jarke, M., Jeusfeld, A., Quix, C., Vassiliadis, P.: Architecture and Quality in Data Warehouses. Proceedings 10th CAiSE Conference (1998) 93–113. 12. Poe, V. : Building a Data Warehouse for Decision Support. Prentice Hall (1996). 13. Potts, C., Takahashi, K., Anton, A. I. : Inquiry-based requirements analysis. IEEE Software 11(2), (1994) 21-23. 14. Prakash, N., Gosain, A.: Requirements Engineering for Data warehouse Development. Proceedings of CAiSE03 Forum (2003). 15. Bruckner, R. M., List, B. : Developing requirements for data warehouses using use cases. Seventh Americas Conference on Information Systems (2003). 16. Rolland, C., Souveyet, C., Achour, C. B.: Guiding goal modelling using scenarios. IEEE Transactions on Software Engineering, Special Issue on Scenario Management. 24(12) (1998). 17. Rolland, C., Grosz, G., Kla, R.: A proposal for a scenario classification framework. Journal of Requirements Engineering (RE’98), (1998). 18. Rubin, K. S., Golberg, A.: Object behavior analysis. Communications of the ACM. 35(9) (1992) 48–62. 19. Winter, R., Strauch, B.: A Method for demand driven information requirements analysis in data warehouse projects. Proceeding of the Hawai International conference on system sciences. January 6-9, (2003).

Extending UML for Designing Secure Data Warehouses Eduardo Fernández-Medina1, Juan Trujillo2, Rodolfo Villarroel3, and Mario Piattini1 1

Dep. Informática, Univ. Castilla-La Mancha, Spain

{Eduardo.FdezMedina,Mario.Piattini}@uclm.es 2

Dept. Lenguajes y Sistemas Informáticos, Univ. Alicante, Spain [emailprotected]

3

Dept. Comput. e Informática, Univ. Católica del Maule, Chile [emailprotected]

Abstract. Data Warehouses (DW), Multidimensional (MD) Databases, and OnLine Analytical Processing Applications are used as a very powerful mechanism for discovering crucial business information. Considering the extreme importance of the information managed by these kinds of applications, it is essential to specify security measures from early stages of the DW design in the MD modeling process, and enforce them. In the past years, there have been some proposals for representing main MD modeling properties at the conceptual level. Nevertheless, none of these proposals considers security measures as an important element in their models, so they do not allow us to specify confidentiality constraints to be enforced by the applications that will use these MD models. In this paper, we discuss the confidentiality problems regarding DW’s and we present an extension of the Unified Modeling Language (UML) that allows us to specify main security aspects in the conceptual MD modeling, thereby allowing us to design secure DW’s. Then, we show the benefit of our approach by applying this extension to a case study. Finally, we also sketch how to implement the security aspects considered in our conceptual modeling approach in a commercial DBMS. Keywords: Secure data warehouses, UML extension, multidimensional modeling, OCL

1 Introduction Multidimensional (MD) modeling is the foundation of Data Warehouses (DW), MD Databases and On Line Analytical Processing Applications (OLAP). These systems are used as a very powerful mechanism for discovering crucial business information in strategic decision making processes. Considering the extreme importance of the information that a user can discover by using these kinds of applications, it is crucial to specify confidentiality measures in the MD modeling process, and enforce them. On the other hand, information security is a serious requirement which must be carefully considered, not as an isolated aspect, but as an element presented in all stages of the development lifecycle, from the requirement analysis to implementation and maintenance[4, 6]. To achieve this goal, different ideas for integrating security in the system development process are proposed [2, 8], but they only considered information security from a cryptographic point of view, and without considering database and DW specific issues. There are some proposals that try to integrate security into conceptual modeling. UMLSec [9], where UML is extended to develop secure systems, is probably the most P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 217–230, 2004. © Springer-Verlag Berlin Heidelberg 2004

218

Eduardo Fernández-Medina et al.

relevant one. This approach is very interesting, but it only deals with information systems (IS) in general, whilst conceptual database and DW design are not considered. A methodology and a set of models have recently been proposed [5] in order to design secure databases to be implemented with Oracle9i Label Security (OLS) [11]. This approach, based on the UML, is important because it considers security aspects in all stages of the development process, from requirement gathering to implementation. Together with the previous methodology, the proposed Object Security Constraint Language (OSCL) [14], based on the Object Constraint Language (OCL) [19] of UML, allows us to specify security constraints in the conceptual and logical database design process, and to implement these constraints in a concrete database management system (DBMS) such as OLS. Nevertheless, the previous methodology and models do not consider the design of secure MD models for DW’s. In the literature, we can find several initiatives to include security in DW [15, 16]. Many of them are focused on interesting aspects related to access control, multilevel security, its applications to federated databases, applications using commercial tools and so on. These initiatives refer to specific aspects that allow us to improve DW security in acquisition, storage, and access aspects. However, neither of them considers the security aspects comprising all stages of the system development cycle nor considers security in the MD conceptual modeling. Regarding the conceptual modeling of DW’s, various approaches have proposed to represent main MD properties at the conceptual level (due to space constraints, we refer the reader to [1] for a detail comparison between the most relevant ones). These proposals provide their own non-standard graphical notations, and none of them has been widely accepted as a standard conceptual model for MD modeling. Recently, another approach [12, 18] has been proposed as an object-oriented conceptual MD modeling approach. This proposal is a profile of the UML [13], which uses its standard extension mechanisms (stereotypes, tagged values and constraints). However, none of these approaches considers security as an important issue in their conceptual models, so they do not solve the problem of security in DW’s. In this paper, we present an extension of the UML (profile) that allows us to represent main security information of data and their constraints in the MD modeling at the conceptual level. The proposed extension is based on the profile presented in [12] for the conceptual MD modeling because it allows us to consider main MD modeling properties as well as it is based on the UML (designers avoid learning a new specific notation or language). We consider the multilevel security model [17], but focusing on considering aspects regarding read operations because this is the most common operation for final user applications. This model allows us to classify both information and users into security classes, and enforce the mandatory access control [17]. By using this approach, we are able to implement secure MD models with any commercial DBMS that is able to implement multilevel databases, such as OLS [11] or DB2 Universal Database (UDB) [3]. The remainder of this paper is structured as follows: Section 2 briefly summarizes the conceptual approach for MD modeling in which we based on. Section 3 proposes the new UML extension for secure MD modeling. Section 4 presents a case study and apply our UML extension for secure MD modeling, Section 5 sketches some further implementation issues. Finally, Section 6 presents the main conclusions and introduces immediate our future work.

Extending UML for Designing Secure Data Warehouses

219

2 Object-Oriented Multidimensional Modeling In this section, we outline our approach, based on the UML [12, 18], for DW conceptual modeling. This approach has been specified by means of a UML profile that contains the necessary stereotypes to represent all main features of MD modeling at the conceptual level [7]. In this approach, structural properties are specified by a UML class diagram in which information is organized into facts and dimensions. Facts and dimensions are represented by means of fact classes and dimension classes respectively. Fact classes are defined as composite classes in shared aggregation relationships of n dimension classes. The many-to-many relations between a fact and a specific dimension are specified by means of the multiplicity on the role of the corresponding dimension class. In our example in Fig. 1, we can see how the Sales fact class has a many-to-many relationship with the Product dimension. A fact is composed of measures or fact attributes. By default, all measures are considered to be additive. For nonadditive measures, additive rules are defined as constrains. Moreover, derived measures can also be explicitly represented (by /) and their derivation rules are placed between braces near the fact class. Our approach also allows the definition of identifying attributes in the fact class (stereotype OID). In this way degenerated dimensions can be considered [10], thereby representing other fact features in Fig. 1. Multidimensional modeling using the UML addition to the measures for analysis. For example, we could store the ticket number (ticket_number) as degenerated dimensions, as reflected in Fig. 1. Regarding dimensions, each level of a classification hierarchy is specified by a base class (stereotype Base). An association of base classes specifies the relationship between two levels of a classification hierarchy. These classes must define a Directed Acyclic Graph (DAG) rooted in the dimension class (DAG constraint). The DAG structure can represent both multiple and alternative path hierarchies. Every base class must also contain an identifying attribute (OID) and a descriptor attribute1 (stereotype D). These attributes are necessary for an automatic generation process into commercial OLAP tools, as these tools store this information on their metadata. 1

A descriptor attribute will be used as the default label in the data analysis in OLAP tools.

220

Eduardo Fernández-Medina et al.

We can also consider non-strict hierarchies (an object at a hierarchy’s lower level belongs to more than one higher-level object) and complete hierarchies (all members belong to one higher-class object and that object consists of those members only). These characteristics are specified by means of the multiplicity of the roles of the associations and defining the constraint {completeness} in the target associated class role respectively. See Store dimension in Fig. 1 for an example of all kinds of classification hierarchies. Lastly, the categorization of dimensions is considered by means of the generalization / specialization relationships of UML.

3 UML Extension for Secure Multidimensional Modeling The goal of this UML extension is to allow us to design MD conceptual models, but classifying the information in order to define which properties users must own to be entitled to access the information. Therefore, we have to consider three main stages: 1. Defining precisely the organization of the users that will have access to the MD system. We can define a precise level of granularity considering three ways of organizing the users: Security hierarchy levels (which indicates the clearance level of the user), user Compartments (which indicates a horizontal classification of users), and user Roles (which indicates a hierarchical organization of users according to their roles or responsibilities into the organization). 2. Classifying the information into the MD model. We can define the security information for each element of the model (fact class, dimension class, etc.) by using a tuple composed of a sequence of security levels, a set of user compartments, and a set of user roles. We can also specify security constraints considering this security information. This security information and constraints indicate the security properties that users must own to be able to access the information. 3. Enforcing the mandatory access control (AC). The typical operations executed by final users in this type of systems are query operations. So, the mandatory access control has to be enforced for the read operations, whose access control rule is as follows: A user can access to an information only if, a) the security level of the user is greater or equal than the security level of the information, b) all the user compartments that have been defined for the information is owned by the user, and, c) at least one of the user roles defined for the information, is played by the user. In this paper, we will only focus on the second stage by defining a UML extension that allows us to classify the security elements in a conceptual MD model and to specify security constraints. Furthermore, in Section 5, we sketch a prominent work to deal with the third stage by generating the needed structures in the target DBMS to consider all security aspects represented in the conceptual MD model. Finally, let us point out that the first stage concerns with security policies defined in the organization by managers, and it is out of the scope of this paper. We define our UML extension for secure conceptual MD modeling following the schema composed of these elements: description, prerequisite extensions, stereotypes/tagged values, well-formedness rules, and comments. For the definition of the stereotypes, we consider an structure that is composed of a name, the base metaclase, the description, the tagged values and a list of constraints defined by means of OCL. For the definition of tagged values, the type of the tagged values, the multiplicity, the description, and the default value are defined.

Extending UML for Designing Secure Data Warehouses

221

3.1 Description This UML extension reuses a set of stereotypes previously defined in [12], and defines new tagged values, stereotypes, and constraints, which enables us to define secure MD models. The 20 tagged values we have defined are applied to certain components that are specially particular to MD modeling, allowing us to represent them in the same model and in the same diagrams that describe the rest of the system. These tagged values will represent the sensitive information for the different elements of the MD modeling (fact class, dimension class, etc.), and they will allow us to specify security constraints depending on this security information and on the value of certain attributes of the model. The stereotypes will help us identify a special class that will define the profile of the system users. A set of inherent constraints are specified in order to define well-formedness rules. The correct use of our extension is assured by the definition of constraints in both natural language and OCL [19].

Fig. 2. Extension of the UML with stereotypes

Thus, we have defined 7 new stereotypes: one specializes in the Class model element, two specializes in the Primitive model element and four specialize in the Enumeration model element. In Fig. 2, we have represented portions of the UML metamodel2 to show where our stereotypes fit. We have only represented the specialization hierarchies, as the most important fact about a stereotype is the base class that the stereotype specializes. In these figures, new stereotypes are colored in a dark grey, whereas stereotypes we reuse from our previous profile [27] are in a light grey and classes from the UML metamodel remain white.

3.2 Prerequisite Extensions This UML profile reuses stereotypes previously defined in another UML profile [12]. This profile provided the needed stereotypes, tagged values, constraints to accomplish 2

All the metaclasses come from the Core Package, a subpackage of the Foundation Package. We based our extension on the UML 1.5 as this is the current accepted standard. To the best of our knowledge, the current UML 2.0 is not the final accepted standard yet.

222

Eduardo Fernández-Medina et al.

the MD modeling properly, allowing us to represent main MD properties at the conceptual level. To facilitate the comprehension of the UML profile we present and use in this paper, we provide a brief description of the of these stereotypes in Table 1.

3.3 Datatypes First of all, we need the definition of some new data types to be used in our tagged values definitions. The type Level (Fig. 3 (a)) will be the ordered enumeration composed by all security levels that have been considered (these values, tipically are unclassified, confidential, secret and top secret, but they colud be different). The type Levels (Fig. 3 (b)) will be an interval of levels composed by a lower level and an upper level. The type Role (Fig. 3 (c)) will represent the hierarchy of user roles that can be defined for the organization. The type Roles is a set of role trees or subtrees. The type Compartment (Fig. 3 (d)) is the enumeration composed by all user compartments that have been considered for the organization. The type compartments is a set of user compartments. The type Privilege (Fig. 3 (e)) will be an ordered enumeration composed by all different privileges that have been considered (these values, typically are read, inserte, delete, update, and all). The type Attempt Fig. 3 (f) wille be an ordered enumeration composed by all different access attempt that have been considered (these values are typically none, all, frustratedAttempt, sucessfullAccess, but they could be different.

Fig. 3. New Data types

In Fig. 2 we can see the base classes these new stereotypes are specialized from. All the information surrounded in these new stereotypes has to be defined for each

Extending UML for Designing Secure Data Warehouses

223

MD model depending on its confidentiality properties, and on the number of users and complexity of the organization in which the MD model will be operative. Finally, we need some syntactic definitions that are not considered in the standard OCL. Particularly, we need the new collection type Tree with its typical operations.

3.4 Tagged Values In this section, we provide the definition of several tagged values for the model, classes, attributes, instances and constraints.

224

Eduardo Fernández-Medina et al.

Table 2 shows the tagged values of all elements in this extension. All default values of security tagged values of the model are empty collections. On the other hand, the default value of security tagged values for each class is the less restrictive (the lower security level, the security role hierarchy that has been defined for the model and the empty set of compartments). The default value of the security tagged values for attributes is inherited from the class they belong. If we need to specify the situation in which accesses to the information of a class have to be recorded in a log file for future audit, we should use LogType and LogCond tagged values together in that class. By default, the value of LogType is none, so audit is not necessary by default. On the other hand, if we need to specify a security constraint, we can use OCL and the InvolvedClasses tagged value to specify in which situation the constraint has to be enforced. By default, the value of this tagged value is the class to which the constraint is associated. Finally, if we need to specify a special security constraint in which a user/s (depending on a condition) can or cannot access to the corresponding class, independently of the security information of that class, we should use exceptions together with the following tagged values: InvolvedClasses, ExceptSign, ExceptPrivilege and ExceptCond. The default value of InvolvedClasses is the own class. The default value for ExceptSign is +, and for ExceptPrivilege is Read.

3.5 Stereotypes By using all these tagged values, we can specify security constraints on a MD model depending on the values of attributes and tagged values. In this extension, we need to define one stereotype in order to specify other types of security constraints (Table 3). The stereotype UserProfile can be necessary to specify constraints depending on a particular information of a user or a group of users, e.g., depending on citizenship, age, etc. Then, the previously-defined data types and tagged values will be used on the fact, dimension and base stereotypes in order to consider other security aspects.

3.6 Well-Formedness Rules We can identify and specify in both natural language and OCL constraints some wellformedness rules. These rules are grouped in Table 4.

Extending UML for Designing Secure Data Warehouses

225

226

Eduardo Fernández-Medina et al.

3.7 Comments Many of the previous constraints are very intuitive, although we have to ensure its fulfillment, otherwise the system can be inconsistent. Moreover, the designer can specify security constraints with OCL. If the security information of a class or an attribute depends on the value of an instance attribute, it can be specified as an OCL expression (Fig. 4). Normally, security constraints defined for stereotypes of classes (fact, dimension and base) will be defined by using a UML note attached to the corresponding class instance. We do not impose any restriction on the content of these notes in order to allow the designer the greatest flexibility, only those imposed by the

Extending UML for Designing Secure Data Warehouses

227

tagged values definitions. The connection between a note and the element it applies to is shown by a dashed line without an arrowhead as this is not a dependency [13].

4 A Case Study Applying Our Extension for Secure MD Modeling In this section, we apply our extension to the conceptual design of a secure DW in the context of a reduced health-care system. The simplified hierarchy of the system user roles is as follows: HospitalEmployees are classified into health and non-health users, health users can be Doctors or Nurses and non-health users can be Maintenance or Administrative. The defined security levels are unclassified, secret and topSecret. 1. Fig. 4 shows an MD model that includes a fact class (Admission), two dimensions (Diagnosis and Patient), two base classes (Diagnosis_group and City), and a class (UserProfile). UserProfile class (stereotype UserProfile) contains the information of all users who will have access to this MD model. Admission fact class stereotype Fact- contains all individual admissions of patients in one or more hospitals, and can be accessed by all users who have secret or top secret security levels -tagged value SecurityLevels (SL) of classes-, and play health or administrative roles -tagged value SecurityRoles (SR) of classes-. Note that the cost attribute can only be accessed by users who play administrative role -tagged value SR of attributes- Patient dimension contains the information of hospital patients, and can be accessed by all users who have secret security level –tagged value SL-, and play health or administrative roles –tagged value SR-. The Address attribute can only be accessed by users who play administrative role –tagged value SR of attributes-. City base class contains the information of cities, and it allows us to group patients by cities. Cities can be accessed by all users who have confidential security level – tagged value SL-. Diagnosis dimension contains the information of each diagnosis, and can be accessed by users who play health role –tagged value SR-, and have secret security level –tagged value SL-. Finally, Diagnosis_group contains a set of general groups of diagnosis. Diagnosis groups can be accessed by all users who have confidential security level –tagged value SLs-. Several security constraints have been specified by using the previously defined constraints, stereotypes and tagged values (the number of each numbered paragraph corresponds to the number of each note in Fig. 4): 2. The security level of each instance of Admission is defined by a security constraint specified in the model. If the value of the description attribute of the Diagnosis_group which belongs to the diagnosis that is related to the Admission is cancer or AID, the security level –tagged value SL- of this admission will be top secret, otherwise secret. This constraint is only applied if the user makes a query whose the information comes from the Diagnosis dimension or Diagnosis_ group base classes together with the Patient dimension –tagged value involvedClasses-. 3. The security level –tagged value SL- of each instance of Admission can also depend on the value of the cost attribute, which indicates the price of the admission service. In this case, the constraint is only applicable for queries that contain information of the Patient dimension –tagged value involvedClasses-. 4. The tagged value logType has been defined for the Admission class, specifying the value frustratedAttempts. This tagged value specifies that the system has to record, for future audit, the situation in which a user tries to access to information of this fact class, and the system denies it because of lack of permissions.

228

Eduardo Fernández-Medina et al.

5. For confidentiality reasons, we could deny access to admission information to users whose working area is different than the area of a particular admission instance. This is specified by another exception in Admission fact class, considering tagged values involvedClasses, exceptSign and exceptCond. If patients are special users of the system, they could access to their own information as patients (e.g., for querying their personal data). This constraint is specified by using the excepSign and exceptCond tagged values in the Patient class.

Fig. 4. Example of multidimensional model with security information and constraints3

5 Implementation Oracle9i Label Security [11] allows us to implement multilevel databases. It defines labels that are assigned to the rows and users of a database that contain confidentiality information and authorization information for rows and users respectively. Moreover, OLS allows us to specify labeling functions and predicates that are triggered when an operation is executed, and which define the value of security labels. A secure MD model can be implemented by OLS. The two main security elements that we include in this UML extension are confidentiality information of data, and security constraints. The basic concepts of a MD model (facts, dimension and base classes) are implemented as tables in a relational database. The security information 3

Version 2 of OCL considers a special syntaxis for enumerations (EnumTypeName::Enum LiteralValue), but in this example, for the sake of readability, we consider only EnumLiteralValue.

Extending UML for Designing Secure Data Warehouses

229

of the MD model can be implemented by the security labels of OLS, and the security constraints can be implemented by labeling functions and predicates of OLS. For instance, we could consider the table Admission with CodeAdmission, Type, Cost, CodeDiagnosis and PatientSSN columns. This table will have a special column to store the security label for each instance. For each instance, this label will contain the security information that has been specified in Fig. 4 (Security Level=Secret.. TopSecret; SecurityRoles=Health, Admin). But this security information depends on several security constraints that can be implemented by labeling functions. Table 5(1) shows an example in which we implement the security constraints labeled with number 2 in Fig. 4. If the value of Cost column is greater than 10000 then the security label will be composed of TopSecret security level and Health and Admin user roles, else the security label will be composed of Secret security level and the same user roles. Table 5 (2) shows how to link this labeling function with Admission table.

6 Conclusions and Future Work In this paper, we have presented an extension of the UML that allows us to represent main security aspects in the conceptual modeling of Data Warehouses. This extension contains the needed stereotypes, tagged values and constraints for a complete and powerful secure MD modeling. These new elements allow us to specify security aspects such as security levels on data, compartments and user roles on the main elements of a MD modeling such as facts, dimensions and classification hierarchies. We have used the OCL to specify the constraints attached to these new defined elements, thereby avoiding an arbitrary use of them. We have also sketched how to implement a secure MD model designed with our approach in a commercial DBMS. The main relevant advantage of this approach is that it uses the UML, a widely-accepted objectoriented modeling language, which saves developers from learning a new model and its corresponding notations for specific MD modeling. Furthermore, the UML allows us to represent some MD properties that are hardly considered by other conceptual MD proposals. Our immediate future work is to extend the implementation issues presented in this paper to allow us to use the considered security aspects when querying a MD model from OLAP tools. Moreover, we also plan to extend the set of privileges considered in this paper to allow us to specify security aspects in the ETL processes for DWs.

Acknowledgements This research is part of the CALIPO and RETISTIC projects, supported by the Dirección General de Investigación of the Ministerio de Ciencia y Tecnología.

230

Eduardo Fernández-Medina et al.

References 1. Abelló, A., Samos, J., and Saltor, F., A Framework for the Classification and Description of Multidimensional Data Models. 12th International Conference on Database and Expert Systems Aplications. LNCS 2113., 2001: pp. 668-677. 2. Chung, L., Nixon, B., Yu, E., and Mylopoulos, J., Non-functional requirements in software engineering. 2000, Boston/Dordrecht/London: Kluwer Academic Publishers. 3. Cota, S., For Certain Eyes Only. DB2 Magazine, 2004. 9(1): pp. 40-45. 4. Devanbu, P. and Stubblebine, S., Software engineering for security: a roadmap, in The Future of Software Engineering, Finkelstein, A., Editor. 2000, ACM Press. pp. 227-239. 5. Fernández-Medina, E. and Piattini, M., Designing Secure Database for OLS, in Database and Expert Systems Applications: 14th International Conference (DEXA 2003), Marik, V., Retschitzegger, W., and Stepankova, O., Editors. 2003, Springer. LNCS 2736: Prague, Czech Republic. pp. 886-895. 6. Ferrari, E. and Thuraisingham, B., Secure Database Systems, in Advanced Databases: Technology Design, Piattini, M. and Díaz, O., Editors. 2000, Artech House: London. 7. Gogolla, M. and Henderson-Sellers, B. Analysis of UML Stereotypes within the UML Metamodel, in UML’02. Springer, LNCS 2460. pp. 84-99. Dresden, Germany. 8. Hall, A. and Chapman, R., Correctness by Construction: Developing a Commercial Secure System. IEEE Software, 2002.19(1): pp. 18-25. 9. Jürjens, J., UMLsec: Extending UML for secure systems development, in UML’02 Springer. LNCS 2460.: Dresden, Germany. pp. 412-425. 10. Kimball, R., The data warehousing toolkit. 2 edn. 1996: John Wiley. 11. Levinger, J., Oracle label security. Administrator’s guide. Release 2 (9.2). 2002: http://www.csis.gvsu.edu/GeneralInfo/Oracle/network.920/a96578.pdf. 12. Luján-Mora, S., Trujillo, J., and Song, I.Y. Extending the UML for Multidimensional Modeling. in UML’02. Springer, LNCS 2460. pp. 290-304. Dresden, Germany. 13. OMG, Object Management Group: Unified Modeling Language Specification 1.5. 2004. 14. Piattini, M. and Fernández-Medina, E. Specification of Security Constraint in UML. in 35th Annual 2001 IEEE Intl. Carnahan Conf.on Security Technology. London pp. 163-171 15. Priebe, T. and Pernul, G. Towards OLAP Security Design - Survey and Research Issues. in 3rd ACM International Workshop on Data Warehousing and OLAP (DOLAP’00). Washington DC, USA. pp. 33-40 16. Rosenthal, A. and Sciore, E. View Security as the Basic for Data Warehouse Security. in 2nd International Workshop on Design and Management of Data Warehouse (DMDW’00). Sweden. pp. 8.1-8.8 17. Samarati, P. and De Capitani di Vimercati, S., Access control: Policies, models, and mechanisms, in Foundations of Security Analysis and Design, Focardi, R. and Gorrieri, R., Editors. 2000, Springer: Bertinoro, Italy. pp. 137-196. 18. Trujillo, J., Palomar, M., Gómez, J., and Song, I.Y., Designing Data Warehouses with OO Conceptual Models. IEEE Computer, special issue on DWs, 2001(34): pp. 66-75. 19. Warmer, J. and Kleppe, A., The Object Constraint Language Second Edition. Getting Your Models Ready for MDA. 2003: Addison Wesley.

Data Integration with Preferences Among Sources* Gianluigi Greco1 and Domenico Lembo2 1

Dip. di Matematica, Università della Calabria, Italy [emailprotected]

2

Dip. di Informatica e Sistemistica, Università di Roma “La Sapienza”, Italy [emailprotected]

Abstract. Data integration systems represent today a key technological infrastructure for managing the enormous amount of information even more and more distributed over many data sources, often stored in different heterogeneous formats. Several different approaches providing transparent access to the data by means of suitable query answering strategies have been proposed in the literature. These approaches often assume that all the sources have the same level of reliability and that there is no need for preferring values “extracted” from a given source. This is mainly due to the difficulties of properly translating and reformulating source preferences in terms of properties expressed over the global view supplied by the data integration system. Nonetheless preferences are very important auxiliary information that can be profitably exploited for refining the way in which integration is carried out. In this paper we tackle the above difficulties and we propose a formal framework for both specifying and reasoning with preferences among the sources. The semantics of the system is restated in terms of preferred answers to user queries, and the computational complexity of identifying these answers is investigated as well.

1

Introduction

The enormous amount of information even more and more distributed over many data sources, often stored in different heterogeneous formats, had boosted in recent years the interest for data integration systems. Roughly speaking, a data integration system offers transparent access to the data by providing users with the so-called global schema, which they can query in order to extract data relevant for their aims. Then, the system is in charge of accessing each source separately, and combining local results into the global answer. The means that the system exploit to answer users’ queries is the mapping specifying the relationship between the sources and the global schema [16]. However, data at the sources, may result mutually inconsistent, because of the presence of integrity constraints specified on the global schema in order to * This work has been partially be supported by the European Commission FET Programme Project IST-2002-33570 INFOMIX. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 231–244, 2004. © Springer-Verlag Berlin Heidelberg 2004

232

Gianluigi Greco and Domenico Lembo

enhance its expressiveness. To remedy this problem, several papers (see, e.g., [3, 6,4,11]) proposed to handle the inconsistency by suitably “repairing” retrieved data. Basically, such papers extend to data integration systems previous studies focused on a single inconsistent database or on the merging of mutually inconsistent databases in a single consistent theory [2,14,17]. Intuitively, one aspect deserving particular care, which characterizes the inconsistency problem in data integration with respect to the latter works, is the presence of the mapping relating data stored at the sources with the elements of the (virtual) global schema, over which constraints of interest for the integration application are issued. Here, the suitability of a possible repair depends on the underlying semantic assumptions which are adopted for the mapping and on the type of constraints on the global schema. Roughly speaking, the assumptions for the mapping provide the means for interpreting data at the sources with respect to the intended extension of the global schema. In this respect, mappings are in general considered sound, i.e., data that it is possible to retrieve from the sources through the mapping are assumed to be a subset of the intended data of the corresponding global elements [16]. This is for example the mapping interpretation adopted in [3,6,4], where soundness is exploited for constructing those database extensions for the global schema that are enforced by the data stored at the sources and the mapping. Since obtained global databases may result inconsistent with respect to global constraints, suitable repairs (basically deletions and additions of tuples) are performed to restore consistency. None of the above mentioned works takes into account preference criteria when trying to solve inconsistencies among data sources. We could say that they implicitly assume that all the sources have the same level of reliability, and that there is no reason for preferring values coming from a source with respect to data retrieved from another source. On the other hand, in practical applications it often happens that some sources are known to be more reliable than others, thus determining some potentially useful criteria exploitable to establish the suitability of a repair. In other words, besides the semantic assumption on the mapping, also preference criteria expressed among sources should be taken into account when solving inconsistency. Despite the wide interest in this field, few efforts have been paid for enriching the data integration setting with qualitative or quantitative descriptions of the sources. The first (and almost isolated) attempt is in [18], where the authors introduce two parameters for characterizing each source: the soundness, which is used for assessing the confidence we can place in the answers provided by the source, and the completeness, which is used for measuring how many relevant information is stored in the source. However, the framework proposed does not fit the requirements of typical data integration systems, since it does not admit constraints over the global schema, and since it is only focused on the consistency problem, i.e., determining whether a global database exists that is consistent with all the claims of soundness and completeness of individual sources. Other works (see, e.g., [14,10,15]) deal instead with special cases, where preferences are defined among repairs of a single database, and, hence, they do

Data Integration with Preferences Among Sources

233

not capture the many facets of the data integration setting. In other words, such approaches do not tackle inconsistency in the presence of a mapping between the database schema, that has to be maintained consistent, and information sources that provide possibly inconsistent data. This is instead the challenging setting when tackling inconsistency in data integration in the presence of source preferences, which calls for suitable translations and reformulations, in which preferences between sources are mapped into preferences between repairs. In this paper, we face this problem by proposing a formal framework for both specifying and reasoning on preferences among sources. Specifically, the main contributions of this paper are the following. We introduce a new semantics which is based on the repair of data stored at the sources in the presence of global inconsistency, rather than considering the repair of global database instances constructed according to the mapping. This approach is essentially a form of abductive reasoning [19], since it directly resolves the conflicts by isolating their causes at the sources. This part is described in Section 3. We show that our novel repair semantics allow us to properly take into account source preferences. Following the extensive literature (see, e.g., [8,7] and the references therein) from database community, prioritized logics, logic programming, and decision theory, we exploit two different approaches for specifying preferences among sources. Specifically, we consider unary and binary constraints for defining quantitative properties and relationships between sources, respectively. We show how preferences expressed over the sources can be exploited for refining the way of answering queries in data integration systems. To this aim, we introduce the concept of strongly preferred answers, characterizing the answers that can be obtained after the system is repaired according to users’ preferences. Actually, we also investigate a weaker semantics that looks for weakly preferred answers, i.e., answers that are as close as possible to any strong preferred one. This part and the above one are described in Section 4. Finally, the computational complexity of computing both strongly and weakly preferred answers is studied, by considering the most common integrity constraints that can be issued on relational databases. We show that computing strongly preferred answers is co-NP-compete, and hence it is as difficult as computing answers without additional constraints [5]. However, while turning to the weak semantics, we evidence a small increase in complexity that does not lift the problem to higher levels of the polynomial hierarchy. Indeed, the problem is complete for the class Computational complexity is treated in Section 5.

2

Relational Databases

In this section we recall the basic notions of the relation model with integrity constraints. For further background on relational database theory, we refer the reader to [1].

234

Gianluigi Greco and Domenico Lembo

We assume a (possibly infinite) fixed database domain whose elements can be referenced by constants under the unique name assumption, i.e. different constants denote different objects. A relational schema is a pair where: is a set of relation symbols, each with an associated arity that indicates the number of its attributes, and is a set of integrity constraints, i.e., assertions that have to be satisfied by each database instance. We deal with quantified constraints [1], i.e., first order formulas of the form:

where and are positive literals, are built-in literals, is a list of distinct variables, is a list of variables occurring in only. Notice that classical constraints issued on a relational schema, as functional, exclusion, or inclusion dependencies, can be expressed in the form 1. Furthermore, they are also typical of conceptual modelling languages. A database instance (or simply database) for a schema is a set of facts of the form where is a relation of arity in and is an of constants of We denote as the set A database for a schema is said to be consistent with if it satisfies (in the first order logic sense) all constraints expressed on A relational query (or simply query) over is a formula that is intended to extract tuples of elements of We assume that queries over are Union of Conjunctive Queries (UCQs), i.e., formulas of the form where, for each is a conjunction of atoms whose predicate symbols are in and involve and where is the arity of the query, and each and each is either a variable or a constant of is called the head of Given a database for the answer to a UCQ over denoted is the set of of constants such that, when substituting each with the formula evaluates to true in

3

Data Integration Systems

Framework. A data integration system is a triple where is the global (relational) schema of the form is the source (relational) schema of the form i.e., there are no integrity constraints on the sources, and is the mapping between and We assume that the mapping is specified in the global-as-view (GAV) approach [16], where every relation of the global schema is associated with a view, i.e., a query, over the source schema. Therefore, is a set of UCQs expressed over where the predicate symbol in the head is a relation symbol of Example 1 Consider the data integration system where the global schema consists of the relation predicates employee(Name, Dept) and

Data Integration with Preferences Among Sources

boss(Employee, Manager). The associated set of constraints following assertions (quantifiers are omitted)

235

contains the two

stating that managers are never employees. The source schema comprises the relation symbols and the mapping contains the following UCQs

We call any database for the source schema a source database for Based on we specify the semantics of which is given in terms of database instances for called global databases for In particular, we construct a global database by evaluating each view in the mapping over Such a database is called retrieved global database, and is denoted by Example 1 (contd.) Let be a source database. Then, the evaluation of each view in the mapping over is In general, the retrieved global database is not the only database that we consider to specify the semantics of w.r.t. but we account for all global databases that contain This means considering sound mappings: data retrieved from the sources by the mapping views are assumed to be a subset of the data that satisfy the corresponding global relation. This is a classical assumption in data integration, where sources in general do not provide all the intended extensions of the global schema, hence extracted data are to be considered sound but not necessarily complete. Next, we formalize the notion of mapping satisfaction. Definition 1. Given a data integration system and a source database for a global database for satisfies the mapping w.r.t. if Notice that databases that satisfy the mapping might be inconsistent with respect to dependencies in since data stored in local and autonomous sources are not in general required to satisfy constraints expressed on the global schema. Furthermore, cases might arise in which no global database exists that satisfies both mapping and constraints over (for example when a key dependency on is violated by data retrieved from the sources). On the other hand, constraints issued over the global schema must be satisfied by those global databases that we want to consider “legal” for the system [16]. Repairing global databases. In order to solve inconsistency, several approaches have been recently proposed, in which the semantics of a data integration system is given in terms of the repairs of the global databases that the mapping forces to be in the semantic of the system [3,5,4]. Such papers extend to data integration previous proposals given in the field of inconsistent

236

Gianluigi Greco and Domenico Lembo

databases [12,2,14], by considering a sound interpretation of the mapping. In this context, repairs are obtained by means of addition and deletion of tuples over the inconsistent database. Modifications are performed according to minimality criteria that are specific for each approach. Analogously, works on inconsistency in data integration basically propose to properly repair the global databases that satisfy the mapping in order to make them satisfy constraints on the global schema. In this respect, we point out that [3,4] consider local-as-view (LAV) mappings, where, conversely to the GAV approach, each source relation is associated with a query over the global schema. In such papers, the notion of retrieved global database is replaced with the notion of minimal global databases that can be constructed according to the mapping specification and data stored at the sources. Then, a global database satisfies the mapping if it contains at least one minimal global database. Repairs computed in [3,5,4] are in general global databases that do not satisfy the mapping. Furthermore, they cannot always be retrieved through the mapping from a source database instance. According to [16], we could say that in these approaches, constraints are considered strong, whereas the mapping is considered soft. Example 2 Consider the simple situation in which the global schema of a data integration system contains two relation symbols and both of arity 1, that are mutually disjoint, i.e., the constraint is issued over Assume that the mapping comprises the queries and where is a unary source relation symbol. Let be the source database for Then, is inconsistent w.r.t. the global constraint. In this case, the above mentioned approaches propose to repair by eliminating from each database satisfying the mapping either or (but not both), thus producing in the two cases two different classes of global databases that are in the semantics of the system. Notice, however, that each global database that contains only or only does not satisfy the soundness of the mapping, and cannot be retrieved from any source database for Even if effective for repairing global database instances in the presence of inconsistency, the above approaches do not seem appropriate when preferences specified over the sources should be taken into account for solving inconsistency. Indeed, in these cases, one would prefer, for example, to drop tuples coming from less reliable source relations rather than considering all possible repairs to be at the same level of reliability. Nonetheless, it is not always easy to understand how preferences over tuples stored at the sources could be mapped on preferences over tuples of the global schema. Example 3 Consider for example the simple data integration system in which the mapping contains the query and a constraint stating that the first component of the global relation is the key of Assume to have the source database and to know that source relation is more reliable than source relation Then, violates the key constraint on and it seems

Data Integration with Preferences Among Sources

237

reasonable to prefer dropping the fact in order to guarantee consistency, rather than according to source preferences. However, we do not have a preference specified between this two global facts in such a way that we can adopt this choice. The above example shows that we should need some mechanism to infer preferences over tuples of the global schema starting from preferences at the sources. On the other hand, it is not always obvious or easy to define such a mechanism. A different solution could be to move the focus, when repairing, from tuples of the global schema to tuples of the sources, i.e., minimally modify the source database. In this way, we could compare two repairs (at the sources) on the basis of the preferences established over the source relations. Repairing the sources. The idea at the basis of our approach consists in finding the proper set of facts at the sources that imply as a consequence a global database that satisfy the integrity constraints. Basically, such a way of proceeding is a form of abductive reasoning [19]. Notice also that, according to this approach we consider “strong” both the mapping and the constraints, i.e., we take into account only global databases that satisfy both the mapping and the constraints on the global schema. Furthermore, each global database that we consider, can be computed by means of the mapping from a suitable source database. Let us now precisely characterize the ideas informally described above. Definition 2. Given a data integration system and a source database for is satisfiable w. r. t. global database for such that

where if there exists a

satisfies w.r.t. and satisfies the mapping We next introduce a partial order between source databases for which the system in satisfiable. Definition 3. Given a data integration system where Given two source databases for such that is satisfiable w.r.t. and Then, we say that if Furthermore, if and does not hold We say that a source database is minimal w.r.t. if there does not exist such that Furthermore, we indicate with the set of source databases that are minimal w.r.t. Example 1 (contd.) The retrieved global database violates the constraints on the global schema witnessed by the facts employee (Mary, D1) and boss (John, Mary) for which Mary is both an employee and a manager. ThereThen, where is not satisfiable w.r.t. fore, and We are now able to define the semantics of a data integration system.

238

Gianluigi Greco and Domenico Lembo

Definition 4. Given a data integration system and given a source database for a global database w.r.t. if satisfies satisfies the mapping i.e., there exists

w.r.t. a minimal source database such that

where is legal for

w.r.t.

The set of all the legal databases is denoted by We point out that, under the standard cautious semantics, answering a query posed over the global schema amounts to evaluate it on every legal database Example 1 (contd.) The set contains all global databases that satisfy the global schema and that contain either or Then, the answer to the user query which asks for all employees that have a boss, is Summarizing, our approach consists in properly repairing the source database in order to obtain another source database such that is satisfiable w.r.t. Obviously, if is satisfiable w.r.t. we do not need to repair Before concluding, we point out that the set is in general different from the set of global databases that can be obtained by repairing the retrieved global database instead of repairing the source database This results evident in Example 2, in which repairing is performed by dropping at the sources, therefore legal databases exist that neither contain nor We conclude this section by considering the difficulty of checking whether a global database is indeed a repair. Such a difficulty will be evaluated by following the data complexity approach [20], i.e., by considering a given problem instance having as its input the source database — this is, in fact, the approach we shall follow in all the complexity results presented in the paper. Theorem 4 (Repair Checking). Let be a data integration systems with a source database for and a global database for Then, checking whether is legal is feasible in polynomial time.

4

Formalizing Additional Properties of the Sources

In many real world applications, users often have some additional knowledge about data sources besides the mapping with the global schema, which can be modelled in terms of preference constraints specified over source relations. In this scenario, the aim is to exploit preference constraints for query answering in the presence of inconsistency. The framework we have introduced in Section 3 allows us to easily take into account information on such preferences when trying to solve inconsistency, since

Data Integration with Preferences Among Sources

239

repairing is performed by directly focusing on the sources, whose integration has caused inconsistency. Intuitively, when a data integration system is equipped with some additional preference constraints, we can easily exploit these further requirements for identifying, given a source database those elements of which are preferred for answering a certain query. In this respect, we distinguish between unary constraints, i.e., properties which characterize a given data source, and binary constraints, i.e., properties which are expressed over pairs of relations in the source schema

4.1

Unary and Binary Constraints

As already evidenced in [18], in order to provide accurate query answering, each relation can be equipped with two parameters: the soundness measure and the completeness measure. The former is used for assessing the confidence that we place in the answers provided by whereas the latter is used for evaluating how much relevant information is contained in In [18], the problem of querying partially sound and complete data sources has been studied in the context of data integration systems with LAV mapping and without integrity constraints on the global schema. In such a setting, it has been shown that deciding the existence of a global database satisfying some assumptions is NP-complete. Here, we extend such analysis for sound GAV mappings, in our repair semantics for data integration systems. In this framework, we observe that the completeness measure is of no practical interest, since each is such that Therefore, constraints that can be satisfied by adding tuples to can be seen as “automatically repaired”. Indeed, in our repair semantics we do consider addition of tuples at the sources in order to repair constraints on the global schema. Therefore, we are actually interested in bounding only the number of tuple deletions required at the sources in order to repair the system. Then, for each source relation we denote by the value of such bound, also called soundness constraint, whose semantics is as follows. Definition 5. Let be a data integration system, a source database for a relation symbol in and a soundness constraint for Then, a source database satisfies if

Even though in several situations the soundness measure is not directly available for characterizing a source relation in an absolute way, the user might be able to compare the soundness of two different sources. For instance, he might not know the soundness constraint for source relations and but he might have observed that is more reliable than Such intuition is formalized by the notion of binary constraints. Let and be two relation symbols of and let A denote a set of pairs such that with and attributes of and

Gianluigi Greco and Domenico Lembo

240

respectively. Any expression of the form and its semantics is as follows.

is a binary constraint over

Definition 6. Let be a data integration system and a source database for Then, a source database satisfies a binary constraint of the form with for if

where

indicates the projection of the tuple

on

Roughly speaking, satisfies if for each tuple that has been deleted from in order to obtain a tuple sharing the same values on the attributes in A has been deleted from This behavior guarantees that is modified only if there is no way for repairing the data integration system by modifying only. Example 1 (contd.) Assume now to specify the binary constraint over the source schema where $2 indicates the second attribute of and $1 the first attribute of Then, violates the constraint, since it is obtained by dropping from the fact whose second component coincides with the first component of which conversely has not been dropped. On the contrary, it is easy to see that satisfies the constraint.

4.2

Soft Constraints

As defined in the section above, unary and binary constraints often impose very severe restrictions on the possible ways repairs can be carried out. For instance, it might even happen that no minimal source database exists that satisfies such constraints, thereby leading to a data integration system with an empty semantics. In order to face such situations, whenever it is necessary we can also turn to “weak” semantics that looks for repairs as close as possible to the preferred ones. In this respect, preference constraints are interpreted in a soft version, and we aim at minimizing the number of violations, rather than imposing the absence of such violations. Definition 7 (Satisfaction Factor). Let be a data integration system, a source database for and be a source database in Then, the satisfaction factor for a constraint is if is of the form the number of tuples

and

does not satisfies such that if has the form

Finally, the satisfaction factor of a set of constraints

or with with is the value

Data Integration with Preferences Among Sources

4.3

241

Preferred Answers

After unary and binary constraints have been formalized, we next show how they can be practically used for pruning the set of legal databases of a data integration system. Specifically, we first focus on the definition of preferred legal databases, and we next show how this notion can be exploited for defining preferred answers. In the following, given a data integration system we denote by the set of preference constraints defined over Then, the pair is also said to be a constrained data integration system. The semantics of the system is provided in terms of those legal databases that are “retrieved” from source databases of that satisfy Definition 8. Let be a constrained data integration system with and and let be a source database for Then, a global database is a (weakly) preferred legal database for w.r.t. if satisfies where is a minimal source database w.r.t. that no minimal source database exists with If

then

is a strongly preferred legal database for

such w.r.t.

We next provide the notion of answers to a query posed to a constrained data integration system. Definition 9. Given a constrained data integration system with a source database for and a query the set of the weakly preferred answers to denoted

of arity over is

for each weakly preferred legal database The set of the strongly preferred answers to

denoted

is

for each strongly preferred legal database Example 1 (contd.) Consider again the constraint We have already observed that only satisfies such requirement. Then, the set of strongly preferred databases is Therefore, for the query We conclude this section by observing that the constraints we have defined can be evaluated in polynomial time on a given global database. However, they suffice for blowing up the intrinsic difficulty of identifying (preferred) global databases. Theorem 5 (Preferred Repair Checking). Let and let be a source database for Then, checking whether a global database is (strongly) preferred for w.r.t. is NP-hard. Proof (Sketch). NP-hardness can be proven by a reduction of three colorability problem to our problem. Indeed, given a graph G we can build a data integration system a source database and a legal database for such that is preferred G is 3-colorable.

242

5

Gianluigi Greco and Domenico Lembo

Complexity Results

We next study the computational complexity of query answering in a constrained data integration system under the novel semantics proposed in the paper. Our aim is to point out the intrinsic difficulty of dealing with constraints at the sources. Specifically, given a source database for we shall face the following problems: StronglyAnswering: given a UCQ of arity over and an of constants of is WeaklyAnswering: given a UCQ of arity over and an of constants of is where are constant of the domain which occur also in tuples of We shall consider the (common) case in which is such that contains only key dependencies (KDs), functional dependencies (FDs) and exclusion dependencies (EDs). We recall that these are classical constraints issued on a relational schema, and that they can be expressed in the form (1) introduced in Section 2. We also point out that violations of constraints of such form, e.g., two global tuples violating a key constraint, lead always to inconsistency in our framework, since they can be repaired only by means of tuple deletions from the source database. We are now ready to provide the first result of this section. Theorem 6. Let be a constrained data integration system with where in which contains only FDs and EDs, and let be a source database for Then, the StronglyAnswering problem is co-NP-complete. Hardness holds even if contains either only KDs or only EDs, and if is empty. Proof (Sketch). As for the membership, we consider the dual problem of deciding whether and we show that it is feasible in NP. In fact, we can guess a source database obtained by removing tuples of only. Then, we can show how to verify that is minimal w.r.t. in polynomial time. Hardness for the general case can be derived from the results reported in [5] and in [9] (where the problem of query answering under different semantics in the presence of KDs is studied). Hardness for EDs only can be proven in an analogous way by a reduction from the three colorability problem to the complement of our problem. The above result suggests that adding constraints to the sources, enriches the representation features of a data integration systems, and it is well-behaved from a computational viewpoint. In fact, selecting preferred answers is as difficult as selecting answers without additional preference constraints, whose complexity has been widely studied in [5]. We next turn to the WeaklyAnswering problem, in which a weaker semantics is considered. Intuitively, this scenario provides an additional source of complexity, since finding weakly preferred global databases amount at solving an implicit (NP) optimization problem. Interestingly, the increase in complexity is rather small and does not lift the problem to higher levels of the polynomial hierarchy.

Data Integration with Preferences Among Sources

243

Actually, the problem stays within the polynomial time closure of NP, i.e., More precisely, it is complete for the class in which the NP oracle access is limited to queries, where is the size of the source database in input. Theorem 7. Let with where and let be a source database for

be a constrained data integration system in which contains only FDs and EDs, Then, the WeaklyAnswering problem is

Proof (Sketch). For the membership, we can preliminary compute the maximum value, say max, that the satisfaction factor for any source database may assume. Then, by a binary search in [0, max], we can compute the best satisfaction factor, say at each step of this search, we are given a threshold and we call an NP oracle to know whether there exists a source database such that Finally, we ask an other NP oracle for checking whether there exists a source database with satisfaction factor such that does not belong to for the minimal satisfying the constraints in Hardness can be proved by a reduction from the following problem: Given a formula in conjunctive normal form on the variables a subset and a variable decide whether is true in all the models, where a model M (satisfying assignment) is if it has the largest i.e., if the number of variables in the set that are true w.r.t. M is the maximum over all the satisfying assignments.

6

Conclusions

In this paper we have introduced and formalized the problem of enriching data integration systems with preferences among sources. Our approach is based on a novel semantics which relies on repairing the data stored at the sources in the presence of global inconsistency. Repairs performed at the sources allow us to properly take into account preference expressed over the sources when trying to solve inconsistency. Exploiting the presence of preference constraints, we have introduced the notion of (strongly and weakly) preferred answers. Finally, we have studied the computational complexity of computing both strongly and weakly preferred answers for classes of key, functional end exclusion dependencies, which are relevant classes of constraints for relational databases as well as conceptual modelling languages. Complexity results given in this paper can be easily extended to the presence of also inclusion dependencies on the global schema in the cases in which the problem of query answering is decidable, which have been studied in [5]. To the best of our knowledge, the present work is the first one that provide formalizations and complexity results to the problem of dealing with inconsistencies by taking into account preferences specified among data sources in a pure GAV data integration framework. Only recently, the same problem has been studied for LAV integration systems in [13].

244

Gianluigi Greco and Domenico Lembo

References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publ. Co., Reading, Massachussetts, 1995. 2. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In Proc. of PODS’99, pages 68–79, 1999. 3. L. Bertossi, J. Chomicki, A. Cortes, and C. Gutierrez. Consistent answers from integrated data sources. In Proc. of FQAS’02, pages 71–85, 2002. 4. L. Bravo and L. Bertossi. Logic programming for consistently querying data integration systems. In Proc. of IJCAI’03, pages 10–15, 2003. 5. A. Calì, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proc. of PODS’03, pages 260–271, 2003. 6. A. Calì, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In Proc. of IJCAI’03, pages 16–21, 2003. 7. J. Chomicki. Preference formulas in relational queries. Technical Report cs.DB/0207093, arXiv.org e-Print archive. ACM Trans. on Database Systems. 8. J. Chomicki. Querying with intrinsic preferences. In Proc. of EDBT’02, pages 34–51, 2002. 9. J. Chomicki and J. Marcinkowski. On the computational complexity of consistent query answers. Technical Report cs.DB/0204010 vl, arXiv.org e-Print archive, Apr. 2002. Available at http://arxiv.org/abs/cs/0204010. 10. P. Dell’Acqua, L. M. Pereira, and A. Vitória. User preference information in query answering. In Proc. of FQAS’02, pages 163–173, 2002. 11. T. Eiter, M. Fink, G. Greco, and D. Lembo. Efficient evaluation of logic programs for querying data integration systems. In Proc. of ICLP’03, volume 2237 of Lecture Notes in Artificial Intelligence, pages 348–364. Springer, 2003. 12. R. Fagin, J. D. Ullman, and M. Y. Vardi. On the semantics of updates in databases. In Proc. of PODS’83, pages 352–365, 1983. 13. G. D. Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tackling inconsistencies in data integration through source preferences. In In Proc. of the SIGMOD Int. Workshop on Infomration Quality in Information Systems, 2004. 14. G. Greco, S. Greco, and E. Zumpano. A logic programming approach to the integration, repairing and querying of inconsistent databases. In Proc. of ICLP’01, volume 2237 of Lecture Notes in Artificial Intelligence, pages 348–364. Springer, 2001. 15. S. Greco, C. Sirangelo, I. Trubitsyna, and E. Zumpano. Preferred repairs for inconsistent databases. In Proc. of IDEAS’03, pages 202–211, 2003. 16. M. Lenzerini. Data integration: A theoretical perspective. In Proc. of PODS’02, pages 233–246, 2002. 17. J. Lin and A. O. Mendelzon. Merging databases under constraints. Int. J. of Cooperative Information Systems, 7(l):55–76, 1998. 18. A. O. Mendelzon and G. A. Mihaila. Querying partially sound and complete data sources. In Proc. of PODS’01, pages 162–170, 2001. 19. C. S. Peirce. Abduction and induction. In Philosophical Writings of Peirce, pages 150–156, 1955. 20. M. Y. Vardi. The complexity of relational query languages. In Proc. of STOC’82, pages 137–146, 1982.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He and Tok Wang Ling School of Computing, National University of Singapore {heqi,lingtw}@comp.nus.edu.sg

Abstract. In schema integration, schematic discrepancies occur when data in one database correspond to metadata in another. We define this kind of semantic heterogeneity in general using the paradigm of context that is the meta information relating to the source, classification, property etc of entities, relationships or attribute values in entity-relationship (ER) schemas. We present algorithms to resolve schematic discrepancies by transforming metadata into entities, keeping the information and constraints of original schemas. Although focusing on the resolution of schematic discrepancies, our technique works seamlessly with existing techniques resolving other semantic heterogeneities in schema integration.

1 Introduction Schema integration involves merging several schemas into an integrated schema. More precisely, [4] defines schema integration as “the activity of integrating the schemas of existing or proposed databases into a global, unified schema”. It is regarded as an important work to build a heterogeneous database system [6, 22] (also called multidatabase system or federated database system), to integrate data in a data warehouse, or to integrate user views in database design. In schema integration, people have identified different kinds of semantic heterogeneities among component schemas: naming conflict (homonyms and synonyms), key conflict, structural conflict [3, 15], and constraint conflict [14, 21]. A less touched problem is schematic discrepancy, i.e., the same information is modeled as data in one database, but metadata in another. This conflict arises frequently in practice [11, 19]. We adopt a semantic approach to solve this issue. One of the outstanding features of our proposal is that we preserve the cardinality constraints in the transformation/integration of ER schemas. Cardinality constraints, in particular, functional dependencies (FDs) and multivalued dependencies (MVDs), are useful in verifying lossless schema transformation [10], schema normalization and semantic query optimization [9, 21] in multidatabase systems. The following example illustrates schematic discrepancy in ER schemas. To focus our contribution and simplify the presentation, in the example below, schematic discrepancy is the only kind of conflicts among schemas. Example 1. Suppose we want to integrate supply information of products from several databases (Fig. 1). These databases record the same information, i.e., product numbers, product names, suppliers and supplying prices in each month, but have discrepant schemas. In DB1, suppliers and months are modeled as entity types. In DB2, months are modeled as meta-data of entity types, i.e., each entity type models P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 245–258, 2004. © Springer-Verlag Berlin Heidelberg 2004

246

Qi He and Tok Wang Ling

the products supplied in one month, and suppliers are modeled as meta-data of attributes, e.g., the attribute S1_PRICE records the supplying prices by supplier s11. In DB3, months are modeled as meta-data of relationship types, i.e., each relationship type models the supply relation in one month. We propose (in Section 4) to resolve the discrepancies by transforming the metadata into entities, i.e., transforming DB2 and DB3 into a form of DB1. The statements on the right side of Fig. 1 provide the semantics of the constructs of these schemas using ontology, which will be explained in Section 3.

Fig. 1. Schematic discrepancy: months and suppliers modeled differently in DB1, DB2 and DB3

Paper organization. The rest of the paper is organized as follows. Section 2 is an introduction to the ER approach. Section 3 and 4 are the main contributions of this paper. In Section 3, we first introduce the concepts of ontology and context, and the mappings from schema constructs of ER schemas onto types of ontology. Then we define schematic discrepancy in general using the paradigm of context. In Section 4, we present algorithms to resolve schematic discrepancies in schema integration, without any loss of information and cardinality constraints. In Section 5, we compare our work with related work. Section 6 concludes the whole paper. 1

Without causing confusion, we blur the difference on entities and identifiers of entities. E.g., we use supplier number s1 to refer to a supplier with identifier S# = s1, i.e., s1 plays both the roles of an attribute value of S# and an entity of supplier.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

247

2 ER Approach In the ER model, an entity is an object in the real world and can be distinctly identified. An entity type is a collection of similar entities that have the same set of predefined common attributes. Attributes can be single-valued, i.e., 1:1 (one-to-one) or m:l (many-to-one), or multivalued, i.e., l:m (one-to-many) or m:m (many-to-many). A minimal set of attributes of an entity type E which uniquely identifies E is called a key of E. An entity type may have more than one key and we designate one of them as the identifier of the entity type. A relationship is an association among two or more entities. A relationship type is a collection of similar relationships that satisfy a set of predefined common attributes. A minimal set of attributes (including the identifiers of participating entity types) in a relationship type R that uniquely identifies R is called a key of R. A relationship set may have more than one key and we designate one of them as the identifier of the relationship type. The cardinality constraints of ER schemas incorporate FDs and MVDs. For example, given an ER schema below, let Kl, K2 and K3 be the identifiers of El, E2 and E3, we have: and as Al is a 1:1 attribute of El; as A2 is a m: 1 attribute of E2; as A3 is a m:m attribute of E3; as the cardinality of E3 is 1 in R; as B is a m: 1 attribute of R.

3 Ontology and Context In this section, we first represent the constructs of ER schemas using ontology, then define schematic discrepancy in general based on the schemas represented using ontology. In this paper, we treat ontology as the specification of a representational vocabulary for a shared domain of discourse which includes the definitions of types (representing classes, relations, and properties) and their values. We present ontology at a conceptual level, which could be implemented by an ontology language, e.g., OWL [20]. For example, suppose ontology SupOnto describes the concepts in the universe of product supply. It includes the following types: product, month, supplier, supply (i.e., the supply relations among products, months and suppliers), price (i.e. the supplying prices of products), p#, pname, s#, etc. It also includes the values of these types, e.g. jan, ..., dec for month, and s1, ..., sn for supplier. Note we use lower case italic words to represent types and values of ontology, in contrast to capitals for schema constructs of an ER schema. By use of OWL expression, product, month, supplier and supply would be declared as classes, p# and pname as properties of product, s# as a property of supplier, and price as a property of supply. Conceptual modeling is always done within a particular context. In particular, the context of an entity type, relationship type or attribute is the meta-information relating to its source, classification, property etc. Contexts are usually at four levels: database, object class, relationship type and attribute. An entity type may “inherit” a context from a database (i.e., the context of a database applies to the entities), and so on. In general, the inheritance hierarchy of contexts at different levels is:

248

Qi He and Tok Wang Ling

We’ll give a formal representation of context below. Note as the context of a database would be handled in the object classes which inherit it, we will not care database level contexts any more in the rest of the paper. Definition 1. Given an ontology, we represent an entity type (relationship type, or attribute) E as: where T,

are types in the ontology, and each is a value of for respectively have a value of which are not explicitly given. This representation means that each instance of E is a value of T, and satisfies the conditions for each with the values constitute the context within which E is defined; we call them meta-attributes, and their values metadata of E. Furthermore, with the values are from the context at a higher level (i.e. the context of a database if E is an entity type, the contexts of entity types if E is a relationship type, or the context of an entity type/relationship type if E is an attribute). We call E inherits the meta-attributes with the values. If E inherits all the meta-attributes with values of the higher level context, we simply represent it as: For easy reference, we call the set the self context, and the inherited context of E. In the above representation of E, either self or inherited context could be empty. Specifically, when the context of E is empty, we have E = T. In the example below, we represent the entity types, relationship types and attributes in Fig. 1 using the ontology SupOnto. Example 2. In Fig. 1, using the ontology SupOnto, the entity type JAN_PROD of DB2 is represented as: That is, the context of JAN_PROD is month=‘jan’. This means that each entity of JAN_PROD is a product supplied in Jan. Also in DB2, given an attribute S1_PRICE of the entity type JAN_PROD, we represent it as: That is, the self context of S1_PRICE is supplier=‘s1’, and the inherited context (from the entity type) is month= ‘jan’. This means that each value of S1_PRICE of the entity type JAN_PROD is a price of a product supplied by supplier s1 in the month of Jan.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

249

In DB3, given a relationship type JAN_SUP, we represent it as: This means that each relationship of JAN_SUP is a supply relationship in the month of Jan. Also in DB3, given an attribute PRICE of the relationship type JAN_SUP, we represent it as: PRICE inherits the context month=‘jan’ from the relationship type. This means that each value of PRICE of the relationship type JAN_SUP is a supplying price in Jan. In contrast to original ER schemas, we call an ER schema whose schema constructs are represented using ontology symbols elevated schema, as the ER schemas with the statements given in Fig. 1. The mapping from an ER schema onto an elevated schema should be specified by users. Our work is based on elevated schemas. Now we can define schematic discrepancy in general as follows. Definition 2. Two elevated schemas are schematic discrepant, if metadata in one database correspond to attribute values or entities in the other. We call meta-attributes whose values correspond to attribute values or entities in other databases discrepant meta-attributes. For example, in Fig. 1, in DB2, month and supplier are discrepant meta-attributes as their values correspond to entities in DB1, so is the meta-attribute month in DB3. Before ending this section, we define the global identifier of a set of entity types. In general, two entity types (or relationship types) El and E2 are similar, if E1=T[Cnt1] and E2=T[Cnt2] with T an ontology type, and Cntl and Cnt2 two sets (possibly empty sets) of meta-attributes with values. Intuitively, a global identifier identifies the entities of similar entity types, independent of context. Definition 3. Given a set of similar entity types let K be an identifier of each entity type in We call K a global identifier of the entity types of provided that if two entities of the entity types of refer to the same real world object, then the values of K of the two entities are the same, and vice versa. For example, in Fig. 1, the PROD entity types of DB1 and DB3, and the entity types JAN_PROD, ..., DEC_PROD of DB2 are similar entity types, for they all correspond to the ontology type product without or with a context. Suppose P# is a global identifier of these entity types, i.e., P# uniquely identifies products from all the three databases. Similarly, we suppose S# is a global identifier of the SUPPLIER entity types of DB1 and DB3. In [13], Lee et al proposes an ER based federated database system where local schemas modeled in the relational, object-relational, network or hierarchical models are first translated into the corresponding ER export schemas before they are integrated. Our approach is an extension to theirs by using ontology to provide semantics necessary for schema integration. In general, local schemas could be in different data models. We first translate them into ER or ORASS schemas (ORASS is an ER-like model for semi-structured data [25]). Then map the schema constructs of ER schemas onto the types of ontology and get elevated schemas with the help of semi-automatic tools. Finally, integrate the elevated schemas using the semantics of ontology; semantic heterogeneities among elevated schemas are resolved in this step. Integrity constraints on the integrated schema are derived from the constraints on the elevated schemas at the same time.

250

Qi He and Tok Wang Ling

4 Resolving Schematic Discrepancies in the Integration of ER Schemas In this section, we resolve schematic discrepancies in schema integration. In particular, we present four algorithms to resolve schematic discrepancies for entity types, relationship types, attributes of entity types and attributes of relationship types respectively. This is done by transforming discrepant meta-attributes into entity types. The transformations keep the cardinalities of attributes and entity types, and therefore preserve the FDs and MVDs. Note in the presence of context, the values of an attribute depend on not only the identifier of an entity type/relationship type, but also the metadata of the attribute. To simplify the presentation, we only consider the discrepant meta-attributes of entity types, relationship types and attributes, leaving the other meta-attributes out as they will not change in schema transformation. In the rest of this section, we first present Algorithm TRANS_ENT and TRANS_REL, the resolutions of discrepancies for entity types and relationship types in Section 4.1, and then TRANS_ENT_ATTR and TRANS_REL_ATTR, the resolutions for attributes of entity types and attributes of relationship types in Section 4.2. Examples are provided to understand each algorithm.

4.1 Resolving Schematic Discrepancies for Entity Types/Relationship Types In this sub-section, we first show how to resolve discrepancies for entity types using the schema of Fig. 1, then present Algorithm TRANS_ENT in general. Finally, we describe the resolution of discrepancies for relationship types by an example, omitting the general algorithm which is similar to TRANS_ENT. As an example to remove discrepancies for entity types, we transform the schema of DB2 in Fig. 1 below. Example 3 (Fig. 2). In Step 1, for each entity type of DB2, say JAN_PROD, we represent the meta-attribute month as an entity type MONTH consisting of the only entity jan that is the metadata of JAN_PROD. We change the entity type JAN_PROD into PROD after removing the context, and construct a relationship type R to associate the entities of PROD and the entity of MONTH. Then we handle the attributes of JAN_PROD. As PNAME has nothing to do with the context month = ‘jan’ of the entity type, it becomes an attribute of PROD. However, S1_PRICE, ..., SN_PRICE inherit the context of month; they become the attributes of the relationship type R. Then in Step 2, the corresponding entity types, relationship types and attributes are merged respectively. The merged entity type of MONTH consists of all the entities {jan, ..., dec} of the original MONTH entity types, so do the entity type PROD, relationship type R and their attributes. Then we give the general algorithm below. Algorithm TRANS_ENT Input: an elevated schema DB. Output: a schema DB’ transformed from DB such that all the discrepant meta-attributes of entity types are transformed into entity types.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

251

Step 1: Resolve the discrepant meta-attributes of an entity type. Let be an entity type of DB, for a type in the ontology and discrepant meta-attributes with the values Let K be the global identifier of E. Step 1.1: Transform discrepant meta-attributes into entity types. Construct an entity type with the global identifier K. E’ consists of the entities of E without any context. Construct entity types with identifier for each meta-attribute Each contains one entity //Construct a relationship type to represent the associations among the entities of E and the values of Construct a relationship type R connecting the entity types E’ and Step 1.2: Handle the attributes of E. Let A be an attribute (not part of the identifier) of E, and selfCnt, a set of metaattributes with values, be the self context of A. If A is a m:1 or m:m attribute, then Case 1: attribute A has nothing to do with the context of E. Then A becomes an attribute of E’. Case 2: attribute inherits all the context from E. Then becomes an attribute of R. Case 3: attribute inherits some discrepant metaattributes with the values from E, Then construct a relationship type connecting E’ and those for each meta-attribute Attribute becomes an attribute of Else //A is a 1:1 or 1:m attribute, i.e., the values of A determine the entities of E in the context. In this case, A should be modeled as an entity type to preserve the cardinality constraint. We keep the discrepant meta-attributes of A, and delay the resolution in Alg. TRANS_ENT_ATTR, the resolution for attributes of entity types. Construct an attribute of E’, with Cnt the (self and inherited) context of A as the (self) context of A’. Step 1.3: Handle relationship types involving entity type E in DB. Let R1 be a relationship type involving E in DB. Case 1: R1 has nothing to do with the context of E. Then replace E with E’ in R1. Case 2: R1 inherits all the context from E. Then replace E with R (i.e., treat R as a high level entity type) in R1. Case 3: R1 inherits some discrepant meta-attributes with the values from E, Then construct a relationship type R2 connecting E’ and those for each meta-attribute Replace E with R2 in R1. Step 2: Merge the entity types, relationship types and attributes respectively which correspond to the same ontology type with the same context, and union their domains. In the resolution of schematic discrepancies for relationship types, we should deal with a set of entity types (participating in a relationship type) instead of individual ones. The steps are similar to those of Algorithm TRANS_ENT, but without Step 1.3.

252

Qi He and Tok Wang Ling

Fig. 2. Resolve schematic discrepancies for entity types

We omit the resolution algorithm TRANS_REL for lack of space, but explain it by any example below, i.e., transforming the schema of DB3 in Fig. 1. Example 4 (Fig. 3). In Step 1, for each relationship type of DB3, say JAN_SUP, we represent the meta-attribute month as an entity type MONTH consisting of the only entity jan that is the metadata of JAN_SUP. We change JAN_SUP into the relationship type SUP after removing the context, and relate the entity type MONTH to SUP. Attribute PRICE of JAN_SUP inherits the context month=‘jan’ from the relationship type, and therefore it becomes an attribute of SUP in the transformed schema. Then in Step 2, the MONTH entity types are merged into one consisting of all the entities {jan, ..., dec}; the SUP relationship types are also merged, and get the schema of DB l in Fig. 1.

4.2 Resolving Schematic Discrepancies for Attributes In this sub-section, we first show how to resolve discrepancies for attributes of entity types using an example, then present Algorithm TRANS_ENT_ATTR in general. Finally, we describe the resolution of discrepancies for attributes of relationship types by an example, omitting the general algorithm which is similar to TRANS_ENT_ ATTR.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

253

Fig. 3. Resolve schematic discrepancies for relationship types

The following example shows how to resolve discrepancies for attributes of entity types. Note the discrepancies of entity types should be resolved before this step. Example 5 (Fig. 4). Suppose we have another database DB4 recording the supplying information, in which all the suppliers and months are modeled as contexts of the attributes in an entity type PROD. The transformation is given in Fig. 4. In Step 1, for each attribute with discrepant meta-attributes, say S1_JAN_PRICE, the metaattributes supplier and month are represented as entity types SUPPLIER and MONTH consisting of one entity s1 and jan respectively. A relationship type SUP is constructed to connect PROD, MONTH and SUPPLIER. After removing the context, we

Fig. 4. Resolve schematic discrepancies for attributes of entity types

254

Qi He and Tok Wang Ling

change S1_JAN_PRICE into PRICE, an attribute of the relationship type SUP. Then in Step 2, we merge all the corresponding entity types, relationship types and attributes, and get the schema of DB1 in Fig. 1. Then we give the general algorithm below. Algorithm TRANS_ENT_ATTR Input: an elevated schema DB. Output: a schema DB’ transformed from DB such that all the discrepant meta-attributes of attributes of entity types are transformed into entity types. Step 1: Resolve the discrepant meta-attributes of an attribute in an entity type. Given an entity type E of DB, let be an attribute (not part of the identifier) of E, for a type in the ontology, and the discrepant meta-attributes with the values //Note A has no inherited context which has been removed in Algorithm TRANS_ENT if any. //Represent the discrepant meta-attributes as entity types. Construct entity types with identifiers for each meta-attribute Each

contains one entity

If A is a m:1 or m:m attribute, then //Construct a relationship type to represent the associations among the entities of E and the values of Construct a relationship type R connecting the entity types E and Attribute becomes an attribute of R. Else // A is a 1:1 or 1:m attribute, i.e., the values of A determines the entities of E in the context. A should be modeled as an entity type to preserve the cardinality constraint. Construct with the identifier Construct a relationship type R connecting the entity types E, and

Represent the as the cardinality constraint on R. If A is a 1:1 attribute, also represent the on R. Step 2: Merge the entity types, relationship types and attributes respectively which correspond to the same ontology type with the same context, and union their domains. The resolution of schematic discrepancies for the attributes of relationship types is similar to that for the attributes of entity types, as a relationship type could be treated as a high level entity type. We omit the resolution algorithm TRANS_REL_ATTR for lack of space, but explain it by an example below. Example 6 (Fig. 5). Given the transformed schema of Fig. 2, we transform the attributes of the relationship type R as follows. In Step 1, for each attribute of R, say S1_PRICE, we represent the meta-attribute supplier as an entity type SUPPLIER with one entity s1, and construct a relationship type SUP to connect the relationship type R

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

255

and entity type SUPPLIER. After removing the context, we change S1_PRICE into PRICE, an attribute of SUP. Then in Step 2, we merge the SUPPLIER entity types and SUP relationship types respectively. In the merged schema, the relationship type R is redundant as it is a projection of SUP and has no attributes. Consequently, we remove R and get the schema of DB1 in Fig. 1.

Fig. 5. Resolve schematic discrepancies for attributes of relationship types

The transformations of the algorithms (in Section 4.1 and 4.2) correctly preserve the FDs/MVDs in the presence of context, as shown in the following proposition. Proposition 1. Let be a set of similar entity types (or relationship types) with the same set of discrepant meta-attributes, and K be the global identifier of (or a set of global identifiers of entity types if is a set of relationship types). Suppose each entity type (or relationship type) of has a set of attributes with the same cardinality: Then in the transformed schema, are modeled as entity types, and the following FDs/MVDs hold: Case 1: are m:1 attributes. Then is modeled as an attribute and a holds. Case 2: are m:m attributes. Then is modeled as an attribute and a holds. Case 3: are 1:1 attributes. Then is modeled as an entity type with the identifier and and hold.

256

Qi He and Tok Wang Ling

Case 4: are 1 :m attributes. Then fier and a

is modeled as an entity type with the identiholds.

For lack of space, we only prove Case 1 when is a set of entity types. In a transformed schema, given two relationships with values on A': and for k and k' values (or value sets) of K, and values of and a and a' values of A'. If k=k', then in the original schemas, the two relationships correspond to the same entity and same attribute, say As A is a m: 1 attribute, we have a=a’. That is, the holds in the transformed schema. In schema integration, schematic discrepancies of different schema constructs should be resolved in order, i.e., first for entity types, then relationship types, finally attributes of entity types and attributes of relationship types. The resolutions for most of the other semantic heterogeneities (introduced in Section 1) follow the resolution of schematic discrepancies.

5 Related Work Context is the key component in capturing the semantics related to the definition of an object or association. The definition of context as a set of meta-attributes with values is originally adopted in [7, 23], but is used to solve different kinds of semantic heterogeneities. Our work complements rather than competes with theirs. Their work is based on the context at the attribute level only. We consider the contexts at different levels, and the inheritance of context. A special kind of schematic discrepancy has been studied in multidatabase interoperability, e.g. [11, 12, 16, 17, 19], and [2]. They dealt with the discrepancy when schema labels (e.g., relation names or attribute names) in one database correspond to attribute values in another. However, we use contexts to capture meta-information, and solve a more general problem in the sense that schema constructs could have multiple (instead of single) discrepant meta-attributes. Furthermore, their works are at the “structure level”, i.e., they did not consider the constraint issue in the resolution of schematic discrepancies. However, the importance of constraints can never be overestimated in both individual and multidatabase systems. In particular, we preserve FDs and MVDs during schema transformation, which are expressed as cardinality constraints in ER schemas. The purposes are also different. Previous works focused on the development of a multidatabase language by which users can query across schematic discrepant databases. However, we try to develop an integration system which can detect and resolve schematic discrepancies automatically given the meta-information on source schemas. The issue of inferring view dependencies was introduced in [1, 8]. However, their works are based on the views defined using relational algebra. In other words, they did not solve the inference problem in the transformations between schematic discrepant schemas. In [14, 21, 24], people have begun to focus on the derivation of constraints for integrated schemas from constraints of component schemas in schema integration. However, these works failed to consider schematic discrepancy in schema integration. Our work complements theirs.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

257

6 Conclusions and Future Works Information integration provides a competitive advantage to businesses, and becomes a major area of investment by software companies today [18]. In this paper, we resolve a common problem in schema integration, schematic discrepancy in general, using the paradigm of context. We define context as a set of meta-attributes with values, which could be at the levels of databases, entity types, relationship types, and attributes. We design algorithms to resolve schematic discrepancies by transforming discrepant meta-attributes into entity types. The transformations preserve information and cardinality constraints which are useful in verifying lossless schema transformation, schema normalization and query processing in multidatabase systems. We have implemented a schema integration tool to semi-automatically integrate schematic discrepant schemas from several relational databases. Next, we’ll try to extend our system to integrate databases in different models and semi-structured data.

References 1. S. Abiteboul, R. Hull, and V. Vianu: Foundations of databases. Addison-Wesley, 1995, pp 216-235 2. R. Agrawal, A. Somani, Y. Xu: Storing and querying of e-commerce data. VLDB, 2001, pp 149-158 3. C. Batini, M. Lenzerini: A methodology for data schema integration in the EntityRelationship model. IEEE Trans, on Software Engineering, 10(6), 1984 4. C. Batini, M. Lenzerini, S. B. Navathe: A comparative analysis of methodologies for database schema integration, ACM Computing Surveys, 18(4), 1986, pp 323-364 5. P. P. Chen: The entity-relationship model: toward a unified view of data. TODS 1(1), 1976 6. A. Elmagarmid, M. Rusinkiewicz, A. Sheth: Management of heterogeneous and autonomous database systems. Morgan Kaufmann, 1999 7. C. H. Goh, S. Bressan, S. Madnick, and M. Siegel: Context interchange: new features and formalisms for the intelligent integration of information. ACM Transactions on Information Systems, 17(3), 1999, pp 270-293 8. G. Gottlob: Computing covers for embedded functional dependencies. SIGMOD, 1987 9. C. N. Hsu and C. A. Knoblock: Semantic query optimization for query plans of heterogeneous multidatabase systems. TKDE 12(6), 2000, pp 959-978 10. Qi He, Tok Wang Ling: Extending and inferring functional dependencies in schema transformation. Technical report, TRA3/04. School of Computing, National University of Singapore, 2004 11. R. Krishnamurthy, W. Litwin, W. Kent: Language features for interoperability of databases with schematic discrepancies. SIGMOD, 1991, pp 40-49 12. V. Kashyap, A. Sheth: Semantic and schematic similarity between database objects: a context-based approach. The VLDB Journal 5,1996, pp 276-304 13. Tok Wang Ling, Mong Li Lee: Issues in an entity-relationship based federated database system, CODAS, 1996, pp 60-69 14. Mong Li Lee, Tok Wang Ling: Resolving constraint conflicts in the integration of entityrelationship schemas. ER, 1997, pp 394-407 15. Mong Li Lee, Tok Wang Ling: A methodology for structural conflicts resolution in the integration of entity-relationship schemas. Knowledge and Information Sys., 5, 2003, pp 225247 16. L. V. S. Lakshmanan, F. Sadri, S. N. Subramanian: On efficiently implementing schemaSQL on SQL database system. VLDB, 1999, pp 471-482 17. L. V. S. Lakshmanan, F. Sadri, S. N. Subramanian: SchemaSQL – an extension to SQL for multidatabase interoperability. TODS, 2001, pp 476-519

258

18. 19. 20. 21. 22. 23. 24. 25.

Qi He and Tok Wang Ling N. M. Mattos: Integrating information for on demand computing. VLDB, 2003, pp 8-14 R. J. Miller: Using schematically heterogeneous structures. SIGMOD, 1998, pp 189-200 Web ontology language, W3C recommendation. http://www.w3.org/TR/owl-guide/ M. P. Reddy, B.E.Prasad, Amar Gupta: Formulating global integrity constraints during derivation of global schema. Data & Knowledge Engineering, 16,1995, pp 241-268 A. P. Sheth and S. K. Gala: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing surveys, 22(3), 1990 E. Sciore, M. Siegel, A. Rosenthal: Using semantic values to facilitate interoperability among heterogeneous information systems, TODS, 19(2), 1994, pp 254-290 M. W. W. Vermeer and P. M. G. Apers: The role of integrity constraints in database interoperation. VLDB, 1996, pp 425-435 Xiaoying Wu, Tok Wang Ling, Mong Li Lee, and Gillian Dobbie: Designing Semistructured Databases Using ORA-SS Model. WISE, 2001, pp 171-180

Managing Merged Data by Vague Functional Dependencies* An Lu and Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology Hong Kong, China {anlu,wilfred}@cs.ust.hk

Abstract. In this paper, we propose a new similarity measure between vague sets and apply vague logic in a relational database environment with the objective of capturing the vagueness of the data. By introducing a new vague Similar Equality for comparing data values, we first generalize the classical Functional Dependencies (FDs) into Vague Functional Dependencies (VFDs). We then present a set of sound and complete inference rules. Finally, we study the validation process of VFDs by examining the satisfaction degree of VFDs, and the merge-union and merge-intersection on vague relations.

1

Introduction

The relational data model [8] has been extensively studied for over three decades. This data model basically handles precise and exact data in an information source. However, many real life applications such as merging data from many sources involve imprecise and inexact data. It is well known that Fuzzy database models [11, 2], based on the fuzzy set theory by Zadeh [13], have been introduced to handle inexact and imprecise data. In [5], Gau et al. point out that the drawback of using the single membership value in fuzzy set theory is that the evidence for and the evidence against are in fact mixed together. (Here U is a classical set of objects, called the universe of discourse. An element of U is denoted by Therefore, they propose vague sets, which is similar to that of intuitionistic fuzzy sets proposed in [1]. A true membership function and a false membership function are used to characterize the lower bound on (Here V means a vague set and F means a fuzzy set.) The lower bounds are used to create a subinterval of the unit interval [0,1], where in order to generalize the membership function of fuzzy sets. There have been many studies which discuss the topic concerning how to measure the degree of similarity or distance between vague sets or intuitionistic fuzzy sets [3,4,7,9,12,6]. However, the proposed methods have some limitations. *

This work is supported in part by grants from the Research Grant Council of Hong Kong, Grant Nos HKUST6185/02E and HKUST6165/03E.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 259–272, 2004. © Springer-Verlag Berlin Heidelberg 2004

260

An Lu and Wilfred Ng

For example, Hong’s similarity measure in [7] means that the similarity measure between the vague value with the most imprecise evidence (the precision of the evidence is 0) and the vague value with the most precise evidence (the precision of the evidence is 1) is equal to 0.5. In this case, the similarity measure should be equal to 0. Our view is that the similarity measure should include two factors of vague values. One is the difference between the evidences contained by the vague values; another is the difference between the precisions of the evidences. However, the proposed measures or distances consider only one factor (e.g. in [3,4]) or do not combine both the factors appropriately (e.g. in [7,9,12,6]). Our new similarity measure is able to return a more reasonable answer. In this paper, we extend the classical relational data model to deal with vague information. Our first objective is to extend relational databases to include vague domains by suitably defining the Vague Functional Dependencies (VFDs) based on our notion of similarity measure. A set of sound and complete inference rules for VFDs is then established. We discuss the satisfaction degree of VFDs and apply VFDs in merged vague relations as the second objective. The main contributions of the paper are as follows: (1) A new similarity measure between vague sets is proposed to remedy some problems for similar definitions in literature. We argue that our measure gives a more reasonable estimation; (2) A VFD is proposed in order to capture more semantics in vague relations; (3) The satisfaction degree of VFDs in merged vague relations is studied. The rest of the paper is organized as follows. Section 2 presents some basic concepts related to databases and the vague set theory. In Section 3, we propose a new similarity measure between vague sets. In Section 4, we introduce the concept of a Vague Functional Dependency (VFD) and the associated inference rules. We then explain the validation process which determines the satisfaction degree of VFDs in vague relations. In Section 5, we give the definitions of merge operators of vague relations and discuss the satisfaction degree of VFDs after merging. Section 6 concludes the paper.

2

Preliminaries

In this section, some basic concepts related to the classical relational data model and the vague set theory are given.

2.1

Relational Data Model

We assume the readers are familiar with the basic concepts of the relation data model [8]. There are two operations on relations that are particularly relevant in subsequent discussion: projection and natural join. The projection of a relation of R(XYZ) over the set of attributes X is obtained by taking the restriction of the tuples of to the attributes in X and eliminating duplicate tuples in what remains. This operation is denoted by Let and be two relations of R(XY) and R(XZ), respectively. The natural join is a relation over R(XYZ) defined by

Managing Merged Data by Vague Functional Dependencies

261

and Functional Dependencies (FDs) are important integrity constraints in relational databases. An FD is a statement, where X and Y are sets of attributes. A relation satisfies the FD, if for all and in implies

2.2

Vague Data Model

Let U be a classical set of objects, called the universe of discourse, where an element of U is denoted by Definition 1. (Vague Set) A vague set V in a universe of discourse U is characterized by a true membership function, and a false membership function, as follows: and where is a lower bound on the grade of membership of derived from the evidence for and is a lower bound on the grade of membership of the negation of derived from the evidence against Suppose A vague set V of the universe of discourse U can be represented by where and This approach bounds the grade of membership of to a subinterval of [0,1]. In other words, the exact grade of membership of may be unknown, but is bounded by where We depict these ideas in Fig. 1. Throughout this paper, we simply use and for if no ambiguity of V arising.

Fig. 1. The true

and false

membership functions of a vague set

For a vague set we say that the interval is the vague value to the object For example, if then we can see that and It is interpreted as “the degree that object belongs to the vague set V is 0.6, the degree that object does not belong to the vague set V is 0.1.” In a voting process, the vague value [0.6,0.9] can be interpreted as “ the vote for resolution is 6 in favor, 1 against, and 3 neutral (abstentious).”

262

An Lu and Wilfred Ng

The precision of the knowledge about is characterized by the difference If this is small, the knowledge about is relatively precise; if it is large, we know correspondingly little. If is equal to the knowledge about is exact, and the vague set theory reverts back to fuzzy set theory. If and are both equal to 1 or 0, depending on whether does or does not belong to V, the knowledge about is very exact and the theory reverts back to ordinary sets. Thus, any crisp or fuzzy value can be regarded as a special case of a vague value. For example, the ordinary set can be presented as the vague set while the fuzzy set (the membership of is 0.8) can be presented as the vague set Definition 2. (Empty Vague Set) A vague set V is an empty vague set, if and only if, its true membership function and false membership function for all We use to denote it. Definition 3. (Complement) The complement of a vague set V is denoted by and is defined by and Definition 4. (Containment) A vague set A is contained in another vague set B, if and only if, and Definition 5. (Equality) Two vague sets A and B are equal, written as A = B, if and only if, and that is and Definition 6. (Union) The union of two vague sets A and B is a vague set C, written as whose true membership and false membership functions are related to those of A and B by and Definition 7. (Intersection) The intersection of two vague sets A and B is a vague set C, written as whose true membership and false membership functions are related to those of A and B by and Definition 8. (Cartesian Product) Let be the Cartesian product of m universes, and be the vague sets in their respectively, corresponding universe of discourse The Cartesian product is defined to be a vague set of where the memberships are defined as follows: and

2.3

Vague Relations

Definition 9. (Vague Relation) A vague relation is a vague subset of tuple in subset of

on a relation scheme A is a vague

Managing Merged Data by Vague Functional Dependencies

263

A relation scheme R is denoted by or simply by R if the attributes are understood. Corresponding to each attribute name the domain of is written as However, unlike classical and fuzzy relations, in vague relations, we define as a set of vague sets. Vague relations may be considered as an extension of classical relations and fuzzy relations, which can capture more information about imprecision. Example 1. Consider the vague relation over Product(ID, Weight, Price) given in Table 1. In Weight and Price are vague attributes. To make the attribute ID simple, we express it as the ordinary value. The first tuple in means the product with ID = 1 has the weight of [1,1]/10 and the price of [0.4, 0.6]/50 + [1,1]/80, which are vague sets. In the vague set [1,1]/10, [1,1] means the evidence in favor “the weight is 10” is 1 and the evidence against it is 0.

3

Similarity Measure Between Vague Sets

In this section, we review the notions of similarity measures between vague sets proposed by Chen [3, 4], Hong [7] and Li [9], together with distances between intuitionistic fuzzy sets proposed by Szmidt [12] and Grzegorzewski [6]. We show by some examples that these measures are not able to reflect our intuitions. A new similarity measure between vague sets is proposed to remedy the limitations.

3.1 Let

Similarity Measure Between Two Vague Values and

be two vague values to a certain object such that In general, there are two factors should be considered in measuring the similarity between two vague values. One is the difference between the difference of the true and false membership values, which is given by such that another is the difference between the sum of the true and false membership values, which is given by such that The first factor implies the difference between the evidences contained by the vague values, and the second factor implies the difference between the precisions of the evidences. In [3, 4], Chen defines a similarity measure between two vague values and as follows:

264

An Lu and Wilfred Ng

which is equal to This similarity measure ignores the difference between the precisions of the evidences For example, consider

This means that and are equal. On the one hand, means and that is to say, we have no information about the evidence, and the precision of the evidence is zero. On the other hand, means and that is to say, we have some information about the evidence, and the precision of the evidence is not zero. So it is not intuitive to have the similarity measure of and being equal to 1. In order to solve this problem, Hong et al. [7] propose another similarity measure between vague values as follows:

However, this definition also has some problems. Here is an example. Example 2. The similarity measure between [0,1] and is equal to 0.5. This means that the similarity measure between the vague value with the most imprecise evidence (the precision of the evidence is equal to zero) and the vague value with the most precise evidence (the precision of the evidence is equal to one) is equal to 0.5. However, our intuition shows that the similarity measure in this case should be equal to 0. Li et al. in [9] also give a similarity measure in order to remedy the problems in Chen’s definition as follows:

It can be checked that This means Li’s similarity measure is just the arithmetic mean of Chen’s and Hong’s. So Li’s similarity measure still contains the same problems. [12, 6] adopt Hamming distance and Euclidean distance to measure the distances between intuitionistic fuzzy sets as follows: 1. Hamming distance is given by

2. Euclidean distance is given by

These methods also have some problems. Here is an example.

Managing Merged Data by Vague Functional Dependencies

265

Example 3. We still consider the vague values and in Example 2. For the Hamming distance, it can be calculated that This means that the Hamming distance between and are equal to that between and In a voting process, as mentioned in Example 2, since both and have identical votes in favor and against, the Hamming distance between and should be less than that between and For the Euclidean distance, consider the Euclidean distance between [0,1] and which is equal to This means that the distance between the vague value with the most imprecise evidence and the vague value with the most precise evidence is not equal to 1. (Actually, the Euclidean distance in this case is in the interval However, our intuition shows that the distance in this case should always be equal to 1. In order to solve all the problems mentioned above, we define a new similarity measure between the vague values and as follows: Definition 10. (Similarity Measure Between Two Vague Values)

Furthermore, we define a distance between the vague values

and

as

The similarity measure given in Definition 10 takes into account of both the difference between the evidences contained by the vague values and the difference between the precisions of the evidences. Here is an example. Example 4. We still consider the vague values and in Example 2. It can be calculated that So This means that the similarity measures between and are less than that between and As mentioned in Example 2, this result is accordant to our intuition. Another example is the similarity measure between [0,1] and which is equal to 0. This means that the similarity measure between the vague value with the most imprecise evidence and the vague value with the most precise evidence is equal to 0. This result is also accordant to our intuition. From Definition 10, we can obtain the following theorem. Theorem 1. The following statements are true: 1. The similarity measure is bounded, i.e., 2. if and only if, the vague values if and only if, the vague values 3. [0,1] and 4. The similarity measure is commutative, i.e.,

and are equal (i.e., and are [0,0] and [1,1] or

266

3.2

An Lu and Wilfred Ng

Similarity Measure Between Two Vague Sets

We generalize the similarity measure to two given vague sets. Definition 11. (Similarity Measure Between Two Vague Sets) Let X and Y be two vague sets, where and The similarity measure between the vague sets X and Y can be evaluated as follows:

Similarly, we give the definition of distance between two vague sets as D(X, Y) = 1–M(X,Y). From Definition 11, we obtain the following theorem for vague sets, which is similar to Theorem 1. Theorem 2. The following statements related to M(X, Y) are true: 1. The similarity measure is bounded, i.e., 2. M(X, Y) = 1, if and only if, the vague sets X and Y are equal (i.e., X = Y); 3. M(X, Y) = 0, if and only if, all the vague values and are [0, 0] and [1, 1] or [0, 1] and 4. The similarity measure is commutative, i.e., M(X, Y) = M(Y, X}.

4 Vague Functional Dependencies and Inference Rules In this section, we first give the definition of Similar Equality of vague relations, which can be used to compare vague relations. Then we present the definition of a Vague Functional Dependency (VFD). Next, we present a set of sound and complete inference rules for VFDs, which is an analogy to Armstrong’s Axiom for classical FDs.

4.1 Similar Equality of Vague Relations Similar Equality of vague relations defined below can be used as a vague similarity measure to compare elements of a given domain. Suppose and are any two tuples in a relation over the scheme R. Definition 12. (Similar Equality of Tuples) The Similar Equality of two vague tuples and on the attribute in a vague relation is given by:

Managing Merged Data by Vague Functional Dependencies

The Similar Equality of two vague tuples and in a vague relation is given by:

267

on attributes

From Definition 12 and Theorem 2, we have the following theorem. Theorem 3. The following statements of the properties of are true: 1. The similar equality is bounded: 2. if and only if, all vague sets and are equal (i.e., 3. if and only if, if and only if, all the vague values and are [0, 0] and [1, 1], or [0, 1] and

where

4. The similar equality is commutative:

4.2

Vague Functional Dependencies

Informally, a VFD captures the semantics of the fact that, for given two tuples, Y values should not be less similar than X values. We now give the following definition of a VFD. Definition 13. (Vague Functional Dependency) Given a relation over a relation schema where are sets of vague sets, a Vague Functional Dependency (VFD) where holds over if for all tuples and in we have In the database literature [8], a set of inference rules is generally used to derive new data dependencies from the given set of dependencies. We now present a set of sound and complete inference rules for VFDs, which is similar to Armstrong’s Axiom for FDs. Definition 14. (Inference Rules) Let us consider a relation scheme and a set of VFDs F. Let X, Y, and Z be subsets of the relation scheme R. We define a set of inference rules as follows: 1. Reflexivity: If 2. Augmentation: If 3. Transitivity: If

then holds, then and hold, then

also holds; holds.

The following theorem follows by assuming that there are at least two elements and in each data domain such that Theorem 4. The inference rules given in Definition 14 are sound and complete. The Union, Decomposition, Pseudotransitivity rules follow from these three rules, as in the case of functional dependencies [8]. We skip the proof due to space limitation.

268

An Lu and Wilfred Ng

4.3

Validation of VFDs

In this section, we study the validation issues of VFDs. We relax the notion that if a VFD does not hold for a pair of tuples in then the VFD does not hold. We allow the VFD to hold with a certain satisfaction degree over The validation process and the calculation of the satisfaction degree of the VFD are given as follows: 1. For every attribute in we calculate between every pair of tuples and in by constructing two is the cardinality of upper triangular matrices X and A. The row and column represent a comparison of different tuples. We ignore the lower part of the matrix and the diagonal, since is commutative. Thus we get entries in the matrix. Each entry is the comparison of a pair of tuples; 2. We check for every in If true, then we say that the VFD holds (with the satisfaction degree of 1). We construct a matrix W = X – A to check this; 3. If the result in Step 2 is not true, in the matrix W, we count the number of entries (denoted by s) which are less than or equal to 0. The satisfaction degree SD of the VFD in can be calculated as follows:

Obviously, if the inequality given in Definition 13 holds for all tuples in the satisfaction degree calculated by (11) is equal to 1. Suppose there are many VFDs hold over relation say with the satisfaction degrees respectively. We use a VFD set to present this. Then the satisfaction degree of the VFD set F over relation can be calculated by the arithmetic mean of the satisfaction degrees of F as follows:

Here is an example to illustrate the validation process and the calculation of the satisfaction degree of the VFD. Example 5. Consider the vague relation presented in Table 1, it can be checked that the VFD Weight Price holds to a certain satisfaction degree. In step 1, we calculate for attributes X = Weight and A = Price and the results are shown by matrix X and A or Tables 2 and 3. In step 2, we check by taking the difference between the two matrices X and A. The result is shown by matrix W or Table 4. Since does not hold for every we go to step 3. In step 3, we get So the satisfaction degree SD can be calculated as follows:

Managing Merged Data by Vague Functional Dependencies

269

Therefore, the VFD Weight Price over relation holds with the satisfaction degree 0.5. Furthermore, for the zero entries in W, we check the corresponding values in the matrix X. If the values are equal to 1, all vague sets and in X) are equal according to Theorem 3. Thus, we can remove some redundancies by decomposing the original relation into two relations. For instance, there is a value in position (3,2) is 0 in W above. We check the corresponding value in position (3,2) in matrix X, and find the value is 1. So the vague relation in Table 1 can be decomposed into two relations IW(ID, Weight), WP(Weight,Price) (Tables 5 and 6), and some redundancies have been removed.

5

Merge Operations of Vague Relations

In this section, we first give the definition of merge operators of vague relations and then discuss the evaluation of the satisfaction degree of VFDs over the merged vague relations.

270

5.1

An Lu and Wilfred Ng

Merge Operators

Generally speaking, when multiple data sources merge together, the result may contain objects of three cases [10]: (1) an attribute value is not provided; (2) an attribute value is provided by exactly one source; (3) an attribute value is provided by more than one source. When merging vague data, in the first case, we use an empty vague set to express the unavailable value; in the second case, we keep the original vague set; in the third case, we take the union of the vague sets provided by the source. We now define two new merge operators to serve our purpose. Definition 15. (Join Merge Operator) Let be a tuple in the vague relation over scheme and be a tuple in the vague relation over scheme and have a common ID attribute The attributes are common in both vague relations. Then we define the join merge of and denoted by as follows: with where means the union of two vague sets as defined in Definition 6. Definition 16. (Union Merge Operator) Let Then we define the union merge of and denoted by with with , where means an empty vague set.

as follows:

Since vague sets have the property of associativity given in [5], the join merge operator and the union merge operator also have the property of associativity. That is to say, and (recall that are vague relations). We can also generalize Definitions 15 and 16 to more than two data sources. Definition 16 guarantees that every tuple is contained in the new merged relation. For example, consider the following vague relations and given in Tables 7 and 8. We then have and as given in Tables 9 and 10.

5.2

Satisfaction Degree of Merged Relations

Suppose we have Each relation

data sources represented by the vague relations has a set of VFDs,

with

Managing Merged Data by Vague Functional Dependencies

271

the satisfaction degree defined in (12). By the union merge operator, we get a new relation We can also get a new VFD set over For each VFD in F, we can calculate the new satisfaction degree over by the validation process proposed in Sect. 4. Then the satisfaction degree of the new VFD set F over relation can be calculated by (12). In the case of non-overlapping sources, we can simplify the calculation as follows. Assume two data sources represented by the vague relations, and which have the same VFD on a common schemas. We let the satisfaction degree be and and the cardinalities of and are and (As the sources are non-overlapping, there exists no tuple which has the same value of (the ID attribute) in both and This implies that the cardinality of is In order to calculate the new SD of over we need to construct two new matrices, and to calculate the of every pair of tuples between and Then we need to construct a matrix and count the number of entries (denoted by which are less than or equal to 0 in According to (11), the satisfaction degree SD of the VFD over where can be calculated as follows:

272

6

An Lu and Wilfred Ng

Conclusions

In this paper, we incorporate the notion of vagueness into the relational data model, with an objective to provide a generalized approach for treating imprecise data. We propose a new similarity measure between vague sets, which gives more reasonable estimation than those proposed in literature. We apply Similar in vague relations. The equality measure can be used to compare elements of a given vague data domain. Based on the concept of similar equality of attribute values in vague relations, we develop the notion of Vague Functional Dependencies (VFDs), which is a simple and natural generalization of classical or fuzzy functional dependencies. In spite of this generalization, the inference rules for VFDs share the simplicity of Armstrong’s axiom for classical FDs. We also present the validation process of VFDs and the formula to determine the satisfaction degree of VFDs. Finally, we give the definition of merge operators of vague relations and discuss the satisfaction degree of VFDs over the merged vague data. As a future work, we plan to extend the merge operations over vague data, which provide a flexible means to merge data in modern applications, such as querying internet sources and merging the returned result. We are also studying the notion of Vague Inclusion Dependencies, which is useful to generalize the foreign keys in vague relations.

References 1. Atanassov, K.: Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems 20(1) (1986) 87–96 2. Buckles, B.P., Petry F.E.: A Fuzzy Representation of Data for Relational Databases. Fuzzy Sets and Systems 7 (1982) 213–226 3. Chen, S.M.: Similarity Measures Between Vague Sets and Between Elements. IEEE Transactions on System, Man and Cybernetics 27(1) (1997) 153–159 4. Chen, S.M.: Measures of Similarity Between Vague Sets. Fuzzy Sets and Systems 74(2) (1995) 217–223 5. Gau, W.L., Danied, J.B.: Vague Sets. IEEE Transactions on Systems, Man, and Cybernetics 23(2) (1993) 610–614 6. Grzegorzewski, P.: Distances Between Intuitionistic Fuzzy Sets and/or Intervalvalued Fuzzy Sets Based on the Hausdorff Metric. Fuzzy Sets and Systems (2003) 7. Hong, D.H., Kim, C.: A Note on Similarity Measures Between Vague Sets and Between elements. Information Sciences 115 (1999) 83–96 8. Levene, M., Loizou, G.: A Guided Tour of Relational Databases and Beyond. Springer-Verlag, Berlin Heidelberg New York (1999) 9. Li, F., Xu, Z.: Measures of Similarity Between Vague sets. Journal of Software 12(6) (2001) 922–927 10. Naumann, F., Freytag, J.C.: Completeness of Information Sources. Ulf Leser Workshop on Data Quality in Cooperative Information Systems 2003 (DQCIS) (2003) 11. Raju, K.V.S.V.N., Majumdar, A.K.: Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems. ACM Transactions on Database Systems 13(2) (1988) 129–166 12. Szmidt, E., Kacprzyk, J.: Distances Between Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems 114 (2000) 505–518 13. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(3) (1965) 338–353

Merging of XML Documents Wanxia Wei, Mengchi Liu, and Shijun Li School of Computer Science, Carleton University, Ottawa, Ontario, Canada, K1S 5B6 {wwei2,mengchi,shj_li}@scs.carleton.ca

Abstract. How to deal with the heterogeneous structures of XML documents, identify XML data instances, solve conflicts, and effectively merge XML documents to obtain complete information is a challenge. In this paper, we define a merging operation over XML documents that can merge two XML documents with different structures. It is similar to a full outer join in relational algebra. We design an algorithm for this operation. In addition, we propose a method for merging XML elements and handling typical conflicts. Finally, we present a merge template XML file that can support recursive processing and merging of XML elements.

1

Introduction

Information about real world objects may spread over heterogeneous XML documents. Moreover, it is critical to identify XML data instances representing the same real world object when merging XML documents, but each XML document may have different elements and/or attributes to identify objects. Furthermore, conflicts may emerge when merging these XML documents. In this paper, we present a new approach to merging XML documents. Our main contributions are as follows. First, we define a merging operation over XML documents that is similar to a full outer join in relational algebra. It can merge two XML documents with different structures. We design an algorithm for this operation. Second, we propose a method for merging XML elements and handling typical conflicts. Finally, we present a merge template XML file that can support recursive processing and merging of XML elements. The rest of the paper is organized as follows. Section 2 defines the merging operation and presents the algorithm for this operation. Section 3 studies the mechanism for identifying XML instances. Section 4 examines XML documents that this algorithm produces. Section 5 demonstrates the method for merging elements and handling conflicts. Section 6 describes the merge template XML file. Section 7 discusses related work. Finally, Section 8 concludes this paper.

2

Our Approach

The merging operation to be defined can merge two XML documents that have different structures, and create one single XML document. We assume that two P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 273–285, 2004. © Springer-Verlag Berlin Heidelberg 2004

274

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 1.

the first XML document to be merged.

XML documents to be merged share many tag names and also have some tags with different tag names. We also assume that two tags that share the same tag name in these two XML documents describe the same kind of objects in the real world but their corresponding elements may have different structures. This merging operation can be formally represented as: where and are two input XML documents to be merged and is the merged XML document; and are the DTDs of and and are absolute location paths (paths for short) in XPath that designate the elements to be merged in and respectively; and are Boolean expressions that are used to control merging of XML elements in and Boolean expression is used to identify XML instances when merging and Also, it is used for merging of XML elements and handling conflicts. It consists of a number of conditional expressions connected by Boolean operator Let be one of the elements whose path is in and one of the elements whose path is in determines if in and in describe the same object. As long as is true, in and in describe the same object and they are merged. We say that in and in are matching elements if they describe the same object. Boolean expression is used to determine if in that does not have a matching in will be incorporated into It consists of several conditional expressions connected by Boolean operator

Merging of XML Documents

Fig. 2.

275

the second XML document to be merged.

Example 1. The two input XML documents and in Figures 1 and 2 have different structures. They describe employees by different elements: Employee elements in and Person elements in and are merged into shown in Figure 3. The merge conditions are as follows: /Factory /Department /Employees /Employee. / FactoryInfo /People /Person. (:: Department/@DName = WorkIn/ Unit) (Name = @PName). (:: Department/@DName = WorkIn/Unit). According to the above and Employee elements in and Person elements in are merged into the result XML document Thus, for this example, is any Employee element in and is any Person element in In the above :: Department/@ DName represents the attribute DName of the ancestor Department of Employee element in and WorkIn/ Unit denotes the child Unit of the child WorkIn of Person element in (:: and @ denote an ancestor and an attribute respectively). According to an Employee element in and a Person element in describe the same employee and are merged into an Employee element in if the value of the attribute DName of the ancestor Department of an Employee is the same as the content of the descendant Unit of a Person, and the content of the child Name of an Employee is the same as the value of the attribute PName of a Person. Note that the child Name of an Employee cannot identify an Employee in because two Department elements may have Employee descendants that have the same content for the child Name.

276

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 3.

the resulting single XML document.

According to if there exists a Department in that has an attribute DName whose value is the same as the content of the descendant Unit of a nonmatching Person, this non-matching Person is incorporated into Otherwise, this non-matching Person cannot be incorporated into because no element in can have this Person as a descendant. In relational algebra, a full outer join extracts the matching rows of two tables and preserves non-matching rows from both tables. Analogously, the merging operation defined merges XML documents and that have different structures and creates an XML document It merges in and its matching in according to and It incorporates each modified non-matching in and some modified non-matching elements in based on and Moreover, it incorporates the elements in that do not need merging. Path is the prefix path of path if is the left part of or is equal to For example, is the prefix path of It is obvious that the path of any

Merging of XML Documents

Fig. 4.

277

the XML document procedure LeftOuterJoin produces for Example 1.

ancestor of an element is the prefix path of the path of this element. Path is the parent path of path if is the prefix path of and contains one more element name than For Example 1, /Factory/Department/Employees is the parent path of The algorithm for the merging operation is as follows. Algorithm xmlmerge Input: and Output: the root element of call LeftOuterJoin is the XML document generated by procedure LeftOuterJoin the root element of call FullOuterJoin End of algorithm xmlmerge Algorithm xmlmerge merges and and generates an XML document which contains every element merged from in and its matching in each modified non-matching in and some modified non-matching elements. Also, incorporates the elements in that do not need merging.

278

Wanxia Wei, Mengchi Liu, and Shijun Li

Algorithm xmlmerge calls two recursive procedures LeftOuterJoin and FullOuterJoin. We explain FullOuterJoin in Section 4. LeftOuterJoin is as follows. Procedure LeftOuterJoin if the path of in is not equal to then output the start tag of to output all the attributes of to for each child element of if the path of is the prefix path of then call LeftOuterJoin else copy to output the end tag of to else if has a matching element in then output the start tag of to for every attribute of call processa1 for every attribute of call processa2 for every child element of call processc1 for every child element of call processc2 output the end tag of to else output the start tag of to output all the attributes of to for every child element of if has a semantically corresponding attribute that is an attribute of then output an attribute to whose attribute name is that of and whose value is the content of for every child element of if does not have a semantically corresponding attribute that is an attribute of then copy to output the end tag of to End of procedure LeftOuterJoin

Procedure LeftOuterJoin merges in and its matching in and resolves conflicts by calling procedures processa1, processa2, processc1, and processc2, and produces an XML document which contains every element merged from in and its matching in every modified non-matching in and the elements in that do not need merging. For Example 1, XML document that LeftOuterJoin produces is presented in Figure 4.

Merging of XML Documents

3

279

Instance Identification

A Skolem function returns a value for an object as the identifier of this object [4]. The computation of Boolean expression has the equivalent effects as a Skolem function does. For Example 1, the constructed Skolem function concatenates the attribute DName of the ancestor Department and the child Name of an Employee element in or the descendant Unit and the attribute PName of a Person element in and returns this concatenated value for an object as the identifier. As long as two identifiers for two objects described in and are equivalent, these two objects are the same object.

4

The Generated XML Documents

In LeftOuterJoin is the currently processed element in and it always has the property: is one of the elements in that need merging, or does not need merging but some of the descendants of need merging. Assume is an element in that needs merging, and is an element in that does not need merging and is not a descendant of We consider the relationship between and in There are four cases: (1) is an ancestor of (2) is a sibling of an ancestor of (3) and are siblings. (4) is a descendant of a sibling of For these four cases, is merged with its matching element in and is incorporated into When does not have a matching element in is modified and incorporated into We consider Example 1. The Employee in that has “Paul Smith” as the attribute PName and “Production” as the attribute DName of the ancestor Department is merged from the Employee in that has “Paul Smith” as the child Name and “Production” as the attribute DName of the ancestor Department and the matching Person in that has “Paul Smith” as the attribute PName and “Production” as the descendant Unit. The Employee in that has “Paul Smith” as the child Name and “Sales” as the attribute DName of the ancestor Department is a non-matching Employee. It is modified and incorporated into Its child Name is changed to attribute PName to obey the structure of the merged Employee in Department and Employees do not need merging and they are incorporated into FullOuterJoin incorporates every element in into XML document and modifies some non-matching elements and inserts the modified nonmatching elements into as child elements of some elements whose path is the parent path of FullOuterJoin modifies some non-matching elements in order to resolve conflicts and make the non-matching elements obey the structure of the merged element in Let us examine Example 1. The Person in that has “Alice Bush” as the attribute PName and “Production” as the descendant Unit is a non-matching Person. This non-matching Person in and the Employees in that has the ancestor Department that has the attribute DName with value “Production” make Boolean expression true. Therefore,

280

Wanxia Wei, Mengchi Liu, and Shijun Li

this non-matching Person is modified and embodied into as a child element of this Employees. The element name of this non-matching Person is changed to Employee. The child WorkIn of this non-matching Person is modified.

5

Merging XML Elements and Handling Typical Conflicts

First, we rephrase the assumptions about (1) (2)

(3) (4) (5)

and

and share many tag names and also have some tags with different tag names. Two tags that share the same tag name in and describe the same kind of objects in the real world, but the corresponding elements can have the same structure or have different structures. For tags with different tag names in and some of them can still describe the same kinds of objects. In this case, Boolean expression indicates that they describe the same kinds of objects. For two tags in and that describe the same kind of objects, the corresponding elements have the same cardinality. For two elements whose tags describe the same kind of objects in and their two attributes have the same attribute type and the same default value if these two attributes have the same attribute name in and

Then, we introduce several notions. Elements whose tags describe the same kind of objects in and can be classified into two categories: semantically identical elements and semantically corresponding elements. Two elements in and are semantically identical elements if their tags describe the same kind of objects and they have the same structure. Two semantically identical elements can have different element names. In this case, indicates they describe the same kind of objects. Two elements in and are semantically corresponding elements if their tags describe the same kind of objects but they have different structures. Also, two semantically corresponding elements can have different element names. In this case, indicates they describe the same kind of objects. It is true that in and in are semantically corresponding elements because they actually express the same kind of objects and they describe the same object if they make true. Two attributes in and are said to be semantically identical attributes if they have the same name, and one is an attribute of an element in and the other is an attribute of the semantically identical or corresponding element of this element in Similarly, two semantically identical attributes can have different names. In this case, they are specified as semantically identical attributes in An attribute in one XML file to be merged can have a semantically corresponding element in another XML file to be merged. An attribute and an element are a pair of semantically corresponding attribute and element if the name of this attribute is the same as the name of this element, this attribute is

Merging of XML Documents

281

an attribute of element in one XML file and this element is a child element of the semantically corresponding element of element in another XML file, this attribute is a required attribute and of type CDATA, and this element is specified as a parsed character data element with cardinality 1 and it does not have any attribute. Also, the attribute name and element name of a pair of semantically corresponding attribute and element can be different. In this case, indicates they are a pair of semantically corresponding attribute and element. We present the method for merging elements and handling conflicts. Conflicts may emerge when LeftOuterJoin merges in and its matching in into an element in Let be an attribute of and an attribute of Let be a child element of and a child element of Typical conflicts are: conflicts between and conflicts between and conflicts between and conflicts between or a descendant of and or a descendant of and conflicts between or a descendant of and an ancestor of If attribute of in has a semantically identical attribute that is an attribute of in and its semantically identical attribute should be merged into an attribute. If and its semantically identical attribute are consistent with each other, redundancy is eliminated by merging them into one attribute; otherwise, a conflict is indicated in the merged attribute. Similarly, if attribute of in has a semantically corresponding element that is a child element of in and its semantically corresponding element are merged into an attribute. Procedure processa1 accomplishes these tasks. In Example 1, the child element Name of Employee in and the attribute PName of Person in are semantically corresponding element and attribute because (Name = @PName) is specified in Boolean expression They are combined into the attribute PName of the merged Employee element in The relationship between a descendant of and a descendant of is illustrated in Figure 5 where (e) shows no correspondence of an element and its semantically identical or corresponding element is found. Assume that is a descendant of is a descendant of and and are semantically corresponding or identical elements. Based on the assumptions about the two XML documents to be merged, and have the same cardinality. If the cardinality is not greater than 1, and are merged into an element and conflicts between them are reported. Otherwise, and cannot be simply merged into an element. When and describe the same object, they are merged into an element; conversely, both and are incorporated into the merged element in Moreover, when and are semantically corresponding elements that represent the same object, if has some attributes and/or descendants that does not have, an element that has both the attributes and descendants of and the extra attributes and/or descendants of is incorporated into the merged element in as a descendant. Recursive procedure processc1 is responsible for completing the above tasks. We consider Example 1 again. The child Age of Employee in and the child Age of Person in are semantically identical elements with cardinality 1. They

282

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 5. The relationship between

and

are combined into the child Age of the merged Employee in reports a conflict: the child Age of the merged Employee in has content The content is an or-value and it implies it is not clear which one is the correct one [7]. The child Contact of Employee in contains Phone, Address, Email child elements. The child Phone of Contact and the child Phone of Person are semantically identical elements. The cardinality of Phone is greater than 1, so the child Phone of the child Contact and the child Phone of Person are usually fused into the Phone child elements of the child Contact of the merged Employee element in The child Address of the child Contact of Employee and the child Address of Person are semantically corresponding elements with cardinality 1. There are no conflicts between them, and the child Address of Person has a child PostCode that the child Address of Contact does not have, and as a result, this child PostCode is added to the child Address of the child Contact of the merged Employee in The child Email of the child Contact of Employee is embodied in the child Contact of the merged Employee in Assume that descendant of has a semantically corresponding or identical element that is an ancestor of in To deal with two solutions are possible. One is to simply include into the merged element in This results in a typical conflict: a conflict between or a descendant of and an ancestor of Another is to simply exclude This also has a problem: if contains some descendants that are not semantically corresponding or identical elements of any ancestors of the information about these descendants of is lost in the merged element in It is appropriate to reconcile these two opposing solutions by modifying and incorporating this modified into the merged element in Recursive procedure processc2 carries out the tasks described above. Let us examine Example 1. The child WorkIn of Person in has Factory, Unit, and Group child elements. The child Factory of the child WorkIn of Person in and the ancestor Factory of Employee in are semantically corresponding

Merging of XML Documents

283

elements. Also, the child Unit of the child WorkIn of Person in and the ancestor Department of Employee are semantically corresponding elements because (::Department/@DName = WorkIn/ Unit) is specified in Boolean expression Consequently, WorkIn that has only child element Group is included into the merged Employee in as a child element.

6

A Merge Template XML File

In our implementation, a merge template XML file is created to express and Figure 6 shows an example merge template XML file for Example 1 where MergeTemplate has three child elements: P1, P2, and Key. P1 and P2 indicate the paths of elements to be merged in and respectively. Key gives the information for identifying XML instances and handling typical conflicts. The order of element names in or is significant. The first one is the name of the root element of the corresponding XML document and the last

Fig. 6. An example merge template XML file.

one indicates the name of the elements to be merged. Moreover, each pair of consecutive element names in a path is associated with a pair of a parent and a child in the corresponding XML document, and the child element in each pair of a parent and a child in associated with a pair of consecutive element names in is the only kind child that needs merging or has descendant elements that require merging. All these characteristics are used to support recursive processing of XML elements in and merging of designated elements in and Each child Factor of Key describes a conditional expression in and a Factor that has a Selected attribute with value “Yes” also describes a conditional expression in In Sections 2 and 3, we assume that XML data in and XML data in specified in each conditional expression in have the same representations. In fact, they may have different formats. We define Boolean functions to solve this problem. Consequently, the mechanism presented combines Skolem function and user-defined Boolean functions to identify XML instances. Boolean function samename specified in Figure 6 returns true if and actually refer to the same name although they have different formats.

7

Related Work

Bertino et al. point out that XML data integration involves reconciliation at data model level, data schema level, and data instance level [2]. In this paper, we

284

Wanxia Wei, Mengchi Liu, and Shijun Li

mainly focus on reconciliation at data instance level to merge XML documents that have different structures. A lot of research in semantic integration of XML data has been conducted [3, 10]. Castano et al. propose a semantic approach to integration of heterogeneous XML data by building a domain ontology [3]. Rodríguez-Gianolli et al. present a framework that can provide a tool to integrate DTDs into a common conceptual schema [10]. Several systems for processing XML or XML streams are developed [8, 9]. The Niagara system focuses on providing query capabilities for XML documents and can handle infinite streams [9]. Lore is a semi-structured data repository that builds a database system to query XML data [8]. The merging operation defined in this paper is not available in any of these works or systems. A lot of research in merging or integration of XML data that has similar or identical structures has been done [6, 7]. A data model for semi-structured data is introduced and an integration operator is defined in [7]. This operator integrates similarly structured XML data. Lindholm designs a 3-way merging algorithm for XML files that comply with an identical DTD [6]. The mechanism proposed in this paper can merge two XML documents that have different structures. Merge Templates that specify how to recursively combine two XML documents are introduced by Tufte et al. [12]. Our work is different from their work in several aspects. First, the Merge operation proposed by Tufte et al. combines two similarly structured XML documents to create aggregates over streams of XML fragments. Second, a method for merging XML elements and handling typical conflicts is proposed in this paper. When merging XML documents, it is critical to identify XML data instances representing the same object of the real world. Albert uses the term instance identification to refer to this problem [1]. This problem has been investigated [1, 5]. These papers propose different methods to deal with this problem. A universal key is used in [1]. Lim et al. define the union of keys of the data sources [5]. However, these works deal with databases and support typed data. Skolem function is introduced in [4]. It returns a value for an object as the identifier of this object. Saccol et al. present a proposal for instance identification based on Skolem function [11]. The mechanism presented in this paper combines Skolem function and Boolean functions defined for designers [11] to identify XML instances.

8

Conclusion

We have defined a merging operation over XML documents that is similar to a full outer join in relational algebra. It can merge two XML documents with different structures. We have implemented a prototype to merge XML documents. We plan to investigate other operations over XML documents, such as intersection and difference.

Merging of XML Documents

285

References 1. J. Albert. Data Integration in the RODIN Multidatabase System. In Proceedings of the First IFCIS International Conference on Cooperative Information Systems (CoopIS’96), pages 48–57, Brussels, Belgium, June 19-21 1996. 2. E. Bertino and E. Ferrari. XML and Data Integration. IEEE Internet Computing, 5(6):75–76, 2001. 3. S. Castano, A. Ferrara, G. S. Kuruvilla Ottathycal, and V. De Antonellis. Ontology-based Integration of Heterogeneous XML Datasources. In Proceedings of the 10th Italian Symposium on Advanced Database Systems - SEBD’02, pages 27–41, Isola d’Elba, Italy, June 2002. 4. J. L. Hein. Discrete Structures, Logic, and Computability. Jones and Bartlett Publishers, USA, 1995. 5. E. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity Identification in Database Integration. In Proceedings of the Ninth International Conference on Data Engineering, pages 294–301, Vienna, Austria, April 19-23 1993. IEEE Computer Society. 6. T. Lindholm. A 3-way Merging Algorithm for Synchronizing Ordered Trees–the 3DM Merging and Differencing Tool for XML. Master’s thesis, Helsinki University of Technology, Department of Computer Science, 2001. 7. M. Liu and T. W. Ling. A Data Model for Semi-structured Data with Partial and Inconsistent Information. In Proceedings of the Seventh International Conference on Extending Database Technology (EDBT 2000), pages 317–331, Konstanz, Germany, March 27-31 2000. Springer. 8. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3):54–66, September 1997. 9. J. Naughton, D. DeWitt, D. Maier, and et al. Niagara Internet Query System. IEEE Data Engineering Bulletin, 24(2):27–33, June 2001. 10. R. Rodríguez-Gianolli and J. Mylopoulos. A Semantic Approach to XML-based Data Integration. In Proceedings of the 20th International Conference on Conceptual Modelling (ER), pages 117–132, Yokohama, Japan, November 27-30 2001. 11. D. B. Saccol and C. A. Heuser. Integration of XML Data. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web: VLDB 2002 workshop EEXTT and CAiSE 2002 workshop DIWeb, pages 68–80. Springer, 2003. 12. K. Tufte and D. Maier. Merge as a Lattice-Join of XML Documents. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.

Schema-Based Web Wrapping Sergio Flesca and Andrea Tagarelli DEIS, University of Calabria, Italy {flesca,tagarelli}@deis.unical.it

Abstract. An effective solution to automate information integration is represented by wrappers, i.e. programs which are designed for extracting relevant contents from a particular information source, such as web pages. Wrappers allow such contents to be delivered through a selfdescribing and easily processable representation model. However, most existing approaches to wrapper designing focus mainly on how to generate extraction rules, while do not weigh the importance of specifying and exploiting the desired schema of the extracted information. In this paper, we propose a new wrapping approach which encompasses both extraction rules and the schema of required information in wrapper definitions. We investigate the advantages of suitably exploiting extraction schemata, and we define a clean declarative wrapper semantics by introducing (preferred) extraction models for source HTML documents with respect to a given wrapper.

1 Introduction Information available on the Web is mainly encoded into the HTML format. Typically, HTML pages follow source-native and fairly structured styles, thus are ill-suited for automatic processing. However, the need for extracting and integrating information from different sources into a structured format has become a primary requirement for many information technology companies. For example, one would like to monitor appealing offers about books concerning specific topics. Here, an interesting offer may consist in finding highly-rated books. In this context, an effective solution to automate information integration is related to the exploitation of wrappers. Essentially, wrappers are programs designed for extracting relevant contents from a particular information source (e.g. HTML pages), and for delivering such contents through a self-describing and easily processable representation model. XML [19] is widely known as the standard for representing and exchanging data through the web, therefore successfully fulfills the above requirements for a wrapping environment. Generally, a wrapper consists of a set of extraction rules which are used both to recognize relevant content portions within a document and to map them to specific semantics. Several wrapping technologies have been recently developed: we mention here TSIMMIS [8], FLORID [15], DEByE [13], W4F [18], XWrap [14], RoadRunner [2], and Lixto [1] as exemplary systems proposed by the research community. Traditional issues concerning wrapper systems are the P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 286–299, 2004. © Springer-Verlag Berlin Heidelberg 2004

Schema-Based Web Wrapping

287

development of powerful languages for expressing extraction rules and the capability of generating these rules with the lowest human effort. Such issues can be addressed by a number of approaches, such as wrapper induction based on learning from annotated examples [6, 9, 12, 17] and the visual specification of wrappers [1]. The first approach suffers from negative theoretical results on the expressive power of learnable extraction rules, while visual wrapper generation allows the definition of more expressive rules [7]. However, although the schema of the required information should be carefully defined at the time of wrapper generation, most existing wrapper designing approaches focus mainly on how to specify extraction rules. Indeed, while generating wrappers, such approaches ignore the potential advantages coming from the specification and usage of the extraction schema, that is the desired schema of the documents to be created to contain the extracted information. A specific extraction schema can aid to recognize and discard irrelevant or noisy information from documents resulting from the data extraction, thus improving the accuracy of a wrapper. Furthermore, the extracted information can be straightforwardly used in the data integration process, since it follows a specific organization best reflecting user requirements. As a running example, consider an excerpt of Amazon page displayed in Fig.1, and suppose we would like to extract the title, the author (s), the customer rate (if available), the price proposed by the Amazon site, and the publication year, for any book listed in the page. The extraction schema for the above information can be suitably represented by the following DTD:

It is easy to see that such a schema allows the extraction of structured information with multi-value attributes (operator +), missing attributes (operator ?), and variant attribute permutations (operator As mentioned above, existing wrappers are not able to specify and exploit extraction schemata. Some full-fledged systems describe a hierarchical structure of the information to be extracted [1, 17], and they are mostly capable of specifying constraints on the cardinality of the extracted sub-elements. However, no such system allows complex constraints to be expressed: for instance, it is not possible to require that element customer_rate may occur alternatively to element no_rate. As a consequence, validating the extraction of elements with complex contents is not allowed. Two preliminary attempts of exploiting information on extraction schema have been recently proposed in the information extraction [10] and wrapping [16] research areas. In the former work, schemata represented as tree-like structures do not allow alternative subexpressions to be expressed. Moreover, a heuristic approach is used to make a rule fit to other mapping rule instances: as a con-

288

Sergio Flesca and Andrea Tagarelli

Fig. 1. Excerpt of a sample Amazon page from www.amazon.com

sequence, rule refinement based on user feedback is needed. In [10], DTD-style extraction rules exploiting enhanced content models are used in both learning and extracting phases. [3] is related to a particular direction of research: turning the schema matching problem into an extraction problem based on inferring the semantic correspondence between a source HTML table and a target HTML schema. The proposed approach differs from the previous ones related to schema mapping since it entails elements of table understanding and extraction ontologies. In particular, table understanding strategies are exploited to form attribute-value pairs, then an extraction ontology performs data extraction. It is worth noticing that all the above approaches lack a rigorous formalism for the specification of extraction rules. Moreover, they do not define any model for the construction of the documents into which the extracted information has to be inserted. Our contributions can be summarized as follows. We propose a novel wrapping approach which improves standard approaches based on hierarchical extraction by introducing the presence of extraction schema in the wrapper generation. Indeed, a wrapper is defined by specifying, besides a set of extraction rules, the desired schema of the XML documents to be built from the extracted information. The schema availability not only allows the extracted XML documents to be effectively used for further processing, but also allows the exploitation of simpler rules for extracting the desired information. For instance, to extract customer_rate from a book, a standard approach should express a rule extracting the third row of a book table only if this row contains an image displaying

Schema-Based Web Wrapping

289

the “rate”. The presence of the extraction schema allows the definition of two simple rules, one for customer_rate element and one for its rate subelement: the former extracts the third row of the book table, while the latter extracts an image. Moreover, our approach in principle does not rely on any particular form of extraction rules, that is any preexisting kind of rules can be easily plugged in; however, we show that XPath extraction rules are particularly suitable for our purposes. Finally, we define a clean declarative semantics of schema-based wrappers: this is accomplished by introducing the concept of extraction models for source documents with respect to a given wrapper, and by identifying a unique preferred model.

2

Preliminaries

Any XML document can be associated with a document type definition (DTD) that defines the structure of the document and what tags might be used to encode the document. A DTD is a tuple where: i) El is a finite set of element names, ii) P is a mapping from El to element type definitions, and iii) is the root element name. An element type definition is a one-unambiguous regular expression defined as follows1:

where #PCDATA is an element whose content is composed of character data, EMPTY is an element without content, and ANY is an element with generic content. An element type definition specifies an element-content model that constrains the allowed types of the child elements and the order in which they are allowed to appear. A recursive DTD is a DTD with at least a recursive element type definition, i.e. an element whose definition refers to itself or an element that can be its ancestor. In other terms, a recursive DTD admits documents such that an element may contain (directly or indirectly) an element of the same type. For the sake of presentation clarity, we refer to DTDs which do not contain attribute lists. As a consequence, we consider a simplified version of XML documents, whose elements have no attributes. In our domain, the application of a wrapper to a source document can produce several candidate document results. A desirable property of a wrapping framework should be that of producing results that are ordered with respect to some criteria in order to identify a unique preferred extraction document. We accomplish this objective by exploiting partially ordered regular expressions [4], i.e. an extension of regular expressions where a partial order between strings holds. A partially ordered language over a given alphabet is a pair where L is a (standard) language over (a subset of and 1

The symbol denotes different productions with the same left part. Here we do not consider mixed content of elements [19].

290

Sergio Flesca and Andrea Tagarelli

is a partial order on the strings of L. Ordered regular expressions are defined by adapting classical operations for standard languages to partially ordered languages. In particular, a new set of strings and a partial order on this set can be defined for the operations of prioritized union, concatenation, and prioritized closure between languages [4]. Let be an alphabet. The ordered regular expressions over and the sets that they denote, are defined recursively as follows: 1. is a regular expression and denotes the empty language 2. for each is a regular expression and denotes the language and are regular expressions denoting languages and 3. if respectively, then i) denotes the prioritized union language ii) denotes the concatenation language iii) notes the prioritized closure language

de-

Proposition 1. Let be a one-unambiguous ordered regular expression. The language is linearly ordered.

3

Schema-Based Wrapping Framework

In the following we describe our proposal to extend traditional hierarchical wrappers in such a way they can effectively benefit from exploiting extraction schemata. To this purpose, we do not focus on a particular extraction language, but investigate how to build documents, for the extracted information, that are valid with respect to a predefined schema. Indeed, our approach can profitably employ different kinds of extraction rules. Therefore, before describing the schema-based wrapping approach in more detail, we introduce a general notion of extraction rule. We assume any source HTML document is represented by its parse tree, also called as XHTML document. Generally, each extraction rule works on a sequence of nodes of an HTML parse tree, providing a sequence of sequences of nodes. Notice that working on a tree-based model for HTML data is not a strong requirement, and can be easily relaxed. However, for the sake of simplicity, we do not refer to string-based extraction rules like those introduced in [1, 11, 17]. Definition 1 (Extraction rule). Given an HTML parse tree doc and a sequence of nodes in doc, an extraction rule is a function associating with a sequence S of node sequences. Extraction rules so defined can be seen as a generalization of Lixto extraction filters. The main difference with respect to Lixto filters is that our rules allow the extraction of non-contiguous portions of an HTML document. However, an extraction rule is not able to contain references to elements extracted by different rules. Moreover, we define a special type of extraction rules which turn out to be particularly useful to address the problem of wrapper evaluation [5].

Schema-Based Web Wrapping

291

Definition 2 (Monotonic extraction rule). Given a sequence of nodes in an HTML parse tree doc, a monotonic extraction rule is a function associating with a sequence S of node sequences such that, for each sequence and for each node there exists which is ancestor of Let us now introduce our notion of wrapper. A wrapper is essentially composed of: i) the desired schema of the information to be extracted from HTML documents, and ii) a set of extraction rules. As in most earlier approaches (such as [1,17]), the extraction of the desired information proceeds in a hierarchical way. The formal definition of a wrapper is provided below. Definition 3 (Wrapper). Let be a DTD, traction rules, and be a function associating each pair with a rule A wrapper is defined as

be a set of exof elements

In practice, a wrapper associates the root element of the DTD with the root of the HTML parse tree to be processed, then it recursively builds the content of by exploiting the extraction rules to identify the sequences of nodes that should be extracted. In other terms, once an element has been associated with a sequence of nodes of the source document, an extraction rule is applied to to identify the sequences that can be associated with the children of In order to devise a complete specification of a wrapper, we further propose an effective implementation of extraction rules based on the XPath language [20].

3.1

XPath Extraction Rules

The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an ordered collection of nodes without duplicates, i.e. a sequence of nodes. In this work, we consider XPath expressions with variables. The evaluation of an XPath expression occurs with respect to a context and a variable binding. Variable bindings represent mappings from variable names to sequences of objects. Formally, given a variable binding and a variable name we denote with the sequence associated to by Moreover, given two disjoint variable bindings and we denote with a variable binding such that, for each (resp. if (resp. is defined, otherwise is undefined. Given an XPath expression an XHTML document doc, a sequence of nodes and a variable binding denotes the sequence of nodes provided by when is evaluated on doc, starting from and according to The relation between the result of an XPath expression and a variable is represented by the concept of XPath predicate, which is formally defined as follows. Definition 4 (XPath predicate). Given a set of variables and an XPath expression using the variables we denote an XPath predicate with Moreover, we denote a subsequence XPath predicate with

292

Sergio Flesca and Andrea Tagarelli

Given an XHTML document doc and a variable binding an XPath predicate is true with respect to if Analogously, a subsequence XPath predicate is true with respect to if is a subsequence of Moreover, we consider an order on node sequences which is defined according to the document order. Given two sequences and precedes if there exists an index such that and for each or is a prefix of Given an XHTML document doc, a variable binding and a subsequence XPath predicate we denote with the sequence of node sequences such that for each and is true with respect to for each XPath predicates are the basis of more complex concepts, such as extraction filters and extraction rules. An extraction filter is defined over both a target predicate and a set of other predicates which act as filter conditions. Definition 5 (XPath extraction filter). Given a set of variables an XPath extraction filter is defined as a tuple where: is a target predicate, that is a subsequence XPath predicate defining variable on the empty set of variables; is a conjunction of predicates defined on variables The application of an XPath extraction filter to a sequence of nodes yields a sequence of node sequences where: 1) for each 2) for each and 3) there exists a substitution which is disjoint with respect to such that each XPath predicate in is true with respect to We devise any extraction rule as a composition of two kinds of filters: extraction filters and external filters. The latter specify conditions on the size of the extracted sequences. In particular, we consider the following external filters: an absolute size condition filter as specified by bounds (min, max) on the size of a node sequence that is is true if a relative size condition filter rs specified by policies {minimize, maximize}, that is, given a sequence S of node sequences and a sequence is true if rs = minimize (resp. rs = maximize) and there not exists a sequence such that (resp. Definition 6 (XPath extraction rule). An XPath extraction rule is defined as where is a disjunction of extraction filters, as and rs are external filters. For any sequence of nodes, the application of an XPath extraction rule to yields a sequence of node sequences which is constructed as follows. Firstly, we build the ordered sequence that is the sequence obtained by merging the sequences produced by each extraction filter applied to Secondly, we derive the sequence of node

Schema-Based Web Wrapping

sequences such that all the sequences

by removing from is false. Finally, we obtain such that is false.

293

all the sequences by removing from

Example 1. Suppose we are given an extraction rule where filters and are defined respectively as:

Consider now the document tree sketched below, and suppose we apply the rule to the sequence of nodes

The target predicate of returns the sequence [[5], [5,7], [5,7,8], [7], [7,8], [8]], which is turned into [[5], [5,8], [8]] by applying conditions in Analogously, the target predicate of returns the sequence [[11], [11,13], [11,13,14], [11,13,14, 16], [13], [13,14], [13,14,16], [14], [14,16], [16]], which is simplified in [13,14]. The union between and is computed as By applying the external filters it can straightforwardly derived that the resulting sequence is [[5,8], [13,14]].

4

Wrapper Semantics

In this section we provide a clean declarative semantics for schema-based wrappers. This is accomplished by introducing the notion of extraction models for source HTML documents with respect to a given wrapper. Extraction models are essentially collections of extraction events. An extraction event models the extraction of a subsequence by means of an extraction rule which is applied to a context, that is a specific sequence of nodes. However, not all the extraction events turn out to be useful for the construction of the XML document dedicated to contain the extracted information: extraction models are able to identify those events that can be profitably exploited in building an XML document.

294

Sergio Flesca and Andrea Tagarelli

4.1

Extraction Events and Models

The notion of extraction model relies strictly on the notion of extraction event. An extraction event happens whenever an extraction rule is applied. We assume that each extraction event is associated with a unique identifier. Definition 7 (Extraction event). Given a target element name and an associated node sequence an extraction event is a tuple where id and pid denote the identifiers of the current and parent extraction event, respectively, and pos denotes the position of relative to event pid. In order to build an XML document to be extracted by a wrapper, we have to consider sets of extraction events. However, only some sets of extraction events correspond to a valid document. Therefore, we have to carefully characterize such sets of extraction events. To this purpose, let us introduce some preliminary definitions on properties of sets of extraction events. To begin with, a set of extraction events is said to be well-formed if the following conditions hold: there not exist two events and in such that i.e. an extraction event must have a unique identifier; there not exist two events and such that i.e. two sibling events cannot refer to the same position; there not exist two events and such that i.e. two identical node sequences cannot be associated to the same element. Notations for handling well-formed sets of extraction events are introduced next. Given a set of extraction events and a specific event identified by pid, we denote with the set containing all the extraction events which are children of i.e. We further describe two simple functions, namely elnames and linearize, that provide flat versions of a set of extraction events. Given an event identifier pid and a set of extraction events, we denote with the list of extraction events in such that the events are ordered by position. Moreover, we denote with the sequence of element names corresponding to formally, where Extraction events need to be characterized with respect to their conformance to a given regular expression specifying an element type. Given a regular expression on an alphabet of element names, and an event identifier pid, we say that is valid for if spells i.e. the string formed by concatenating element names in belongs to the language We are now able to characterize the validity of a set of extraction events with respect to the definition of an element. Let be a DTD and be a wrapper. We say that a well-formed set of extraction events is valid for an element name if the following conditions hold:

Schema-Based Web Wrapping

295

Fig. 2. Sketch of HTML parse tree of page in Fig.1

or or for each extraction event is valid for and for each event there not exist two extraction events in such that and and contains extraction events such that

and and does not precede

in

An extraction model is essentially a well-formed set of extraction events that conform to the definition of all the elements appearing in the DTD specified within a wrapper. Moreover, an extraction model can be represented by a tree of extraction events. Definition 8 (Extraction Model). Let be a DTD, be a wrapper, doc be an XHTML document, and be a well-formed set of extraction events. is said to be an extraction model of doc with respect to (for short, is an extraction model of if: corresponds to a tree where N is the set of extraction events, E is formed by pairs such that and is the parent event of and is a function associating an identifier to each extraction event; for each extraction event is valid for Example 2. Consider again the Amazon page displayed in Fig.1, and suppose that such a page is subject to a wrapper based on the DTD presented in the Introduction. The extraction rules used by this wrapper are reported on the third column of Table 1; we assume that (1,1) and minimize are adopted as default

296

Sergio Flesca and Andrea Tagarelli

external filters. The first column reports the target element names associated to each rule, whereas the parent element names can be deduced by the DTD. Extraction events occurring in the example model are reported on the second column of the table. For the sake of simplicity, we focus only on a portion of the document doc corresponding to the page of Fig.1; the parse tree associated with doc is sketched in Fig.2. Therefore, we consider only some events, according to the portion of page we have chosen. Event occurs implicitly in the model under consideration, thus it is not extracted by any rule. Offered books are stored into a unique table, which is extracted by event using filter This filter fulfills the requirement that the book table has to be preceded by a simpler table containing a selection list. Information about any book is stored into a separate table which consists of two parts: the first one contains a book picture, while the second one is another table divided into eight rows, one for each specific information about the book. Let us consider the first instance of book, whose subtree is rooted in node 25 of the parse tree. The book, which is identified by event using filter has information on title, (one) author, year, customer rate, and price. The set of events which are children of is built as Even though information on customer rate is available from the first instance of book, we can observe that event happens for element no_rate: however, such an event cannot appear in the model, because would not be a valid content for an element of type book. It is worth noting that rules for extracting information on both availability and unavailability of customer rate have been intentionally defined as identical in this example. However, both kinds of extraction events occur only in any book having customer rate, while only event for element no_rate is extracted from any book which has not customer rate. This happens since it is not possible that an event for rate occurs as a child of an event for no_rate. An extraction model is implicitly associated with a unique XML document, which is valid with respect to a previously specified schema. Given a DTD a wrapper an XHTML document doc, and an extraction model of we define the function buildDoc which takes and an event as input and returns the XML fragment relative to For any event is recursively defined as follows: if then if then if is a regular expression then where In the above definitions, denotes the concatenation of the string values of the nodes in and symbol‘+’ is used to indicate the concatenation of strings. Moreover, we denote with the application of buildDoc to the root extraction event in Definition 9 (Extracted XML document). Given a wrapper and an XHTML document doc, an XML document xdoc is extracted from doc

Schema-Based Web Wrapping

297

by applying (hereinafter referred to as if there exists an extraction model of such that Moreover, we denote with the set of all the XML documents xdoc such that Theorem 1. Let be a wrapper and doc be an XHTML document. If is not recursive and all the extraction rules in are monotonic then: 1. each extraction model of is finite, and the cardinality of bounded by a polynomial with respect to the size of doc; is finite. 2. the set

4.2

is

Preferred Extraction Models

Extraction models provide us a characterization of the set of XML documents that encode the information extracted by a wrapper from a given XHTML document doc, i.e. the set Each document in this set represents a candidate result of the application of to doc. However, this should not be a desirable property for a wrapping framework. In this section we investigate the requirements to identify a unique document which is preferred with respect to all the candidate XML extracted documents.

298

Sergio Flesca and Andrea Tagarelli

Firstly, we introduce an order relation between sets of extraction events having the same parent element type. Consider two extraction models and of and two events and We say that precedes (hereinafter referred to as if the following conditions hold: precedes in the language or is equal to and there exists a position pos such that for each if and then and and if and then precedes in or is equal to and there exists a position pos such that for each if and then and and and if and then The above order relation allows us to define an order relation between sets of extraction events, and consequently between extracted documents. Given two extraction models and of we have that precedes if Moreover, given two XML documents and generated from and respectively, we say that precedes if, for each model of there exists a model of such that Definition 10 (Preferred extracted document). Let be a DTD, be a wrapper, doc be an XHTML document and xdoc be an XML document in xdoc is preferred in if, for each document holds. Theorem 2. Let be a DTD, be a wrapper, and doc be an XHTML document. There exists a unique preferred extracted document pxdoc in

5

Conclusions and Future Work

In this work, we posed the theoretical basis for exploiting the schema of the information to be extracted in a wrapping process. We provided a clean declarative semantics for schema-based wrappers, through the definition of extraction models for source HTML documents with respect to a given wrapper. We also addressed the issue of wrapper evaluation, developing an algorithm which works in polynomial time with respect to the size of a source document; the reader is referred to [5] for detailed information. We are currently developing a system that implements the proposed wrapping approach. As ongoing work, we plan to introduce enhancements to extraction schema. In particular, we are interested in considering XSchema constraints and relaxing the one-unambiguous property.

Schema-Based Web Wrapping

299

References 1. R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. 27th VLDB Conf., pages 119–128, 2001. 2. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proc. 27th VLDB Conf., pages 109–118, 2001. 3. D. W. Embley, C. Tao, and S. W. Liddl. Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure. In Proc. 21st ER Conf., pages 322–337, 2002. 4. S. Flesca and S. Greco. Partially Ordered Regular Languages for Graph Queries. In Proc. 26th ICALP, pages 321–330, 1999. 5. S. Flesca and A. Tagarelli. Schema-based Web Wrapping. Tech. Rep., DEIS - University of Calabria, 2004. Available at http://www.deis.unical.it/tagarelli/. 6. D. Freitag and N. Kushmerick. Boosted Wrapper Induction. In Proc. 17th AAAI Conf., pages 577–583, 2000. 7. G. Gottlob and C. Koch. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. In Proc. 21st PODS Symp., pages 17–28, 2002. 8. J. Hammer, J. McHugh, and H. Garcia-Molina. Semistructured Data: The TSIMMIS Experience. In Proc. 1st ADBIS Symp., pages 1–8, 1997. 9. C.-H. Hsu and M.-T. Dung. Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information Systems, 23(8):521–538, 1998. 10. D. Kim, H. Jung, and G. Geunbae Lee. Unsupervised Learning of mDTD Extraction Patterns for Web Text Mining. Information Processing and Management, 39(4):623–637, 2003. 11. N. Kushmerick. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118(1–2):15–68, 2000. 12. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proc. 15th IJCAI, pages 729–737, 1997. 13. A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva. DEByE - Data Extraction By Example. Data & Knowledge Engineering, 40(2):121–154, 2002. 14. L. Liu, C. Pu, and W. Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In Proc. 16th ICDE, pages 611–621, 2000. 15. W. May, R. Himmeröder, G. Lausen, and B. Ludäscher. A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. In ER Workshops, pages 307–320, 1999. 16. X. Meng, H. Lu, H. Wang, and M. Gu. Data Extraction from the Web Based on Pre-Defined Schema. JCST, 17(4):377–388, 2002. 17. I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001. 18. A. Sahuguet and F. Azavant. Building Intelligent Web Applications Using Lightweight Wrappers. Data & Knowledge Engineering, 36(3):283–316, 2001. 19. World Wide Web Consortium – W3C. Extensible Markup Language 1.0, 2000. 20. World Wide Web Consortium – W3C. XML Path Language 2.0, 2003.

Web Taxonomy Integration Using Spectral Graph Transducer Dell Zhang1,2, Xiaoling Wang3, and Yisheng Dong4 1

Department of Computer Science, School of Computing, National University of Singapore S15-05-24, 3 Science Drive 2, Singapore 117543 2 Computer Science Programme, Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3, Singapore 117576 [emailprotected] 3

Department of Computer Science & Engineering, Fudan University 220 Handan Road, Shanghai, 200433, China [emailprotected]

4

Department of Computer Science & Engineering, Southeast University 2 Sipailou, Nanjing, 210096, China [emailprotected]

Abstract. We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. In this paper, we attempt to use a powerful transductive learning algorithm, Spectral Graph Transducer (SGT), to attack this problem. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.

1

Introduction

A taxonomy, or directory or catalog, is a division of a set of objects (documents, images, products, goods, services, etc.) into a set of categories. There are a tremendous number of taxonomies on the web, and we often need to integrate objects from a source taxonomy into a master taxonomy. This problem is currently pervasive on the web, given that many websites are aggregators of information from various other websites [1]. A few examples will illustrate the scenario. A web marketplace like Amazon1 may want to combine goods from multiple vendors’ catalogs into its own. A web portal like NCSTRL2 may want to combine documents from multiple libraries’ directories into its own. A company may want to merge its service taxonomy with its partners’. A researcher may want to 1 2

http://www.amazon.com/ http://www.ncstrl.org/

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 300–312, 2004. © Springer-Verlag Berlin Heidelberg 2004

Web Taxonomy Integration Using Spectral Graph Transducer

301

merge his/her bookmark taxonomy with his/her peers’. Singapore-MIT Alliance3, an innovative engineering education and research collaboration among MIT, NUS and NTU, has a need to integrate the academic resource (courses, seminars, reports, softwares, etc.) taxonomies of these three universities. This problem is also important to the emerging semantic web [2], where data has structures and ontologies describe the semantics of the data, thus better enabling computers and people to work in cooperation. On the semantic web, data often come from many different ontologies, and information processing across ontologies is not possible without knowing the semantic mappings between them. Since taxonomies are central components of ontologies, ontology mapping necessarily involves finding the correspondences between two taxonomies, which is often based on integrating objects from one taxonomy into the other and vice versa [3,4]. If all taxonomy creators and users agreed on a universal standard, taxonomy integration would not be so difficult. But the web has evolved without central editorship. Hence the correspondences between two taxonomies are inevitably noisy and fuzzy. For illustration, consider the taxonomies of two web portals Google4 and Yahoo5: what is “Arts/ Music/ Styles/” in one may be “Entertainment/ Music/ Genres/” in the other, category “Computers_and_Internet/ Software/ Freeware” and category “Computers/ Open_Source/ Software” have similar contents but show non-trivial differences, and so on. It is unclear if a universal standard will appear outside specific domains, and even for those domains, there is a need to integrate objects from legacy taxonomy into the standard taxonomy. Manual taxonomy integration is tedious, error-prone, and clearly not possible at the web scale. A straightforward approach to automating this process would be to formulate it as a classification problem which has being well-studied in machine learning area [5]. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. In this paper, we attempt to use a powerful transductive learning algorithm, Spectral Graph Transducer (SGT) [6], to attack this problem. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration. The rest of this paper is organized as follows. In §2, we review the related work. In §3, we give the formal problem statement. In §4, we present our approach in detail. In §5, we conduct experimental evaluations. In §6, we make concluding remarks.

2

Related Work

Most of the recent research efforts related to taxonomy integration are in the context of ontology mapping on semantic web. An ontology specifies a conceptualization of a domain in terms of concepts, attributes, and relations [7]. The concepts in an ontology 3 4 5

http://web.mit.edu/sma/ http://www.google.com/ http://www.yahoo.com/

302

Dell Zhang, Xiaoling Wang, and Yisheng Dong

are usually organized into a taxonomy: each concept is represented by a category and associated with a set of objects (called the extension of that concept). The basic goal of ontology mapping is to identify (typically one-to-one) semantic correspondences between the taxonomies of two given ontologies: for each concept (category) in one taxonomy, find the most similar concept (category) in the other taxonomy. Many works in this field use a variety of heuristics to find mappings [8-11]. Recently machine learning techniques have been introduced to further automate the ontology mapping process [3, 4, 12-14]. Some of them derive similarities between concepts (categories) based on their extensions (objects) [3, 4, 12], therefore they need to first integrate objects from one taxonomy into the other and vice versa (i.e., taxonomy integration). So our work can be utilized as a basic component of an ontology mapping system. As explained later in §3, taxonomy integration can be formulated as a classification problem. The Rocchio algorithm [15, 16] has been applied to this problem in [3]; and the Naïve Bayes (NB) algorithm [5] has been applied to this problem in [4], without exploiting information in the source taxonomy. In [1], Agrawal and Srikant proposed the Enhanced Naïve Bayes (ENB) approach to taxonomy integration, which enhances the Naïve Bayes (NB) algorithm [5]. In [17], Zhang and Lee proposed the CS-TSVM approach to taxonomy integration, which enhances the Transductive Support Vector Machine (TSVM) algorithm [18] by the distance-based Cluster Shrinkage (CS) technique. They later proposed another approach in [19], CB-AB, which enhances the AdaBoost algorithm [20-22] by the Co-Bootstrapping (CB) technique. In [23], Sarawagi, Chakrabarti and Godboley independently proposed the Co-Bootstrapping technique (which they named CrossTraining) to enhance the Support Vector Machine (SVM) [24, 25] for taxonomy integration, as well as an Expectation Maximization (EM) based approach EM2D (2Dimensional Expectation Maximization). This paper is actually an straightforward extension of [17]. Basically, the approach proposed in this paper is similar to ENB [1] and CS-TSVM [17], in the sense that they are all motivated by the same idea: to bias the learning algorithm against splitting source categories. In this paper, we compare these two state-of-the-art approaches with ours both analytically and empirically. Comparisons with other approaches are left for future work.

3

Problem Statement

Taxonomies are often organized as hierarchies. In this work, we assume for simplicity, that any objects assigned to an interior node really belong to a leaf node which is an offspring of that interior node. Since we now have all objects only at leaf nodes, we can flatten the hierarchical taxonomy to a single level and treat it as a set of categories [1]. Now we formally define the taxonomy integration problem that we are solving. Given two taxonomies: a master taxonomy with a set of categories each containing a set of objects, and a source taxonomy with a set of categories each containing a set of objects, we need to find the category in for each object in

Web Taxonomy Integration Using Spectral Graph Transducer

303

To formulate taxonomy integration as a classification problem, we take as classes, the objects in as training examples, the objects in as test examples, so that taxonomy integration can be automatically accomplished by predicting the class of each test example. It is possible that an object in belongs to multiple categories in Besides, some objects in may not fit well in any existing category in so users may want to have the option to form a new category for them. It is therefore instructive to create an ensemble of binary (yes/no) classifiers, one for each category C in When training the classifier for C, an object in

is labeled as a positive example if it is

contained by C or as a negative example otherwise. All objects in are unlabeled and wait to be classified. This is called the “one-vs-rest” ensemble method.

4

Our Approach

Here we present our approach in detail. In §4.1, we review transductive learning and explain why it is suitable to our task. In §4.1, we review Spectral Graph Transducer (SGT). In §4.3, we propose the similarity-based Cluster Shrinkage (CS) technique to enhance SGT classifiers. In §4.4, we compare our approach with ENB and CSTSVM.

4.1

Transductive Learning

Regular learning algorithms try to induce a general classifying function which has high accuracy on the whole distribution of examples. However, this so-called inductive learning setting is often unnecessarily complex. For the classification problem in taxonomy integration situations, the set of test examples to be classified are already known to the learning algorithm. In fact, we do not care about the general classifying function, but rather attempt to achieve good classification performance on that particular set of test examples. This is exactly the goal of transductive learning [26]. The transductive learning task is defined on a fixed array of n examples Each example has a desired classification where for binary classification. Given the labels for a subset of (training) examples, a transductive learning algorithm attempts to predict the labels of the remaining (test) examples in X as accurately as possible. Several transductive learning algorithms have been proposed. A famous one is Transductive Support Vector Machine (TSVM), which was introduced by [26] and later refined by [18, 27]. Why can transductive learning algorithms excel inductive learning algorithms? Transductive learning algorithms can observe the examples in the test set and potentially exploit structure in their distribution. For example, there usually exists a clustering structure of examples: the examples in same class tend to be close to each other in feature space, and such kind of knowledge is helpful to learning, especially when there are only a small number of training examples.

304

Dell Zhang, Xiaoling Wang, and Yisheng Dong

Most machine learning algorithms assume that both the training and test examples come from the identical data distribution. This assumption does not necessarily hold in the case of taxonomy integration. Intuitively, transductive learning algorithms seem to be more robust than inductive learning algorithms to the violation of this assumption, since transductive learning algorithms takes the test examples into account for learning. This interesting issue needs to be stressed in the future.

4.2

Spectral Graph Transducer

Recently, Joachims introduced a new transductive learning method, Spectral Graph Transducer (SGT) [6], which can be seen as a transductive version of the k nearestneighbor (kNN) classifier. SGT works in three steps. The first step is to build the k nearest-neighbor (kNN) graph G on the set of examples X. The kNN graph G is similarity-weighted and symmetrized: its adjacency matrix is defined as where

The function sim(·,·) can be any reasonable similarity measure. In the following, we will use a common similarity function

where represents the angle between and The second step is to decompose G into spectrum, specifically, compute the smallest 2 to d + 1 eigenvalues and corresponding eigenvectors of G’s normalized Laplacian where B is the diagonal degree matrix with The third step is to classify the examples. Given a set of training labels SGT makes predictions by solving the following optimization problem which minimizes the normalized graph cut with constraints:

where and denote the set of examples (vertices) with and respectively, and the cut-value is the sum of the edge weights across the cut (bi-partitioning) defined by and Although this optimization problem is known to be NP-hard, there are highly efficient methods based on the spectrum of the graph that give good approximation to the global optimal solution [6].

Web Taxonomy Integration Using Spectral Graph Transducer

305

For example, consider a classification problem with 6 examples whose kNN graph G is shown in Figure 1 (adopted from [6]) with line thickness indicating edge weight. Given a set of training labels and SGT predicts and to be positive whereas predicts and to be negative, because cutting G into and gives the minimal normalized cut-value while keeping

and

Fig. 1. SGT does classification through minimizing the normalized graph cuts with constraints

Unlike most other transductive learning algorithms, SGT does not need any additional heuristics to avoid unbalanced splits [6]. Furthermore, since SGT has a meaningful relaxation that can be solved globally optimally with efficient spectral methods, it is more robust and promising than existing methods.

4.3

Similarity-Based Cluster Shrinkage

Applying SGT to taxonomy integration, we can effectively use the objects in (test examples) to boost classification performance. However, thus far we have completely ignored the categorization of Although

and

are usually not identical, their categorizations often have

some semantic overlap. Therefore the categorization of knowledge about the categorization of same category S in

contains valuable implicit

For example, if two objects belong to the

they are more likely to belong to the same category C in

rather than to be assigned into different categories. We hereby propose the similarity-based Cluster Shrinkage (CS) technique to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. 4.3.1 Algorithm Since SGT models the learning problem as a similarity-weighted kNN-graph, it offers a large degree of flexibility for encoding prior knowledge about the relationship between individual examples in the similarity function. Our proposed similarity-based CS technique takes all categories as clusters and shrinks them by substituting the regular similarity function sim(·,·) with the CS similarity function cs-sim(·,·). Definition 1. The center of a category S is

306

Dell Zhang, Xiaoling Wang, and Yisheng Dong

Definition 2. The CS similarity function cs-sim(·,·) for two examples is defined as and are the centers of and respectively,.

and where

When an example x belongs to multiple categories whose centers are respectively, its corresponding category center in the above formula should be amended to We name our approach that uses SGT classifiers enhanced by the similarity-based CS technique as CS-SGT. 4.3.2 Analysis Theorem 1. For any pair of examples

and

in the same category S ,

Proof: Suppose the center of S is c , we get Since

and

we get

therefore

i.e. From the above theorem, we see that CS-SGT increases the similarity between examples that are known in the same category, consequently puts more weight to the edge between them in the kNN graph. Since SGT seeks the minimum normalized graph cut, stronger connection among examples in the same category directs SGT to avoid splitting that category, in other words, to reserve the original categorization of the taxonomies to some degree while doing classification. Through substituting the regular similarity function with the CS similarity function, the CS-SGT approach can not only make effective use of the objects in like SGT, but also make effective use of the categorization of The CS similarity function

is actually a linear interpolation of

and The linear interpolation parameter controls the influence of the original categorization on the classification. When CS-SGT classifies all objects belonging to one category in as a whole into a specific category in When CS-SGT is just the same as SGT. As long as the value of is set appropriately, CS-SGT should never be worse than SGT because it includes SGT as a special case. The optimal value of can be found using a tune set (a set of objects whose categories in both taxonomies are known). The tune set can be made available via random sampling or active learning, as described in [1].

Web Taxonomy Integration Using Spectral Graph Transducer

4.4

307

Comparison with ENB and CS-TSVM

Both ENB and CS-TSVM outperform conventional machine learning methods in taxonomy integration, because they are able to leverage the source taxonomy data to improve classification. CS-SGT also follows this idea to enhance SGT for taxonomy integration. ENB [1] is based on NB [5] which is an inductive learning algorithm. In contrast, CS-TSVM is based on TSVM [18] which is a transductive learning algorithm. It has been shown that CS-TSVM is more effective than ENB [17] in taxonomy integration. However, CS-TSVM is not as efficient as ENB because TSVM runs much slower than NB. CS-SGT is based on the recently proposed transductive learning algorithm SGT [6]. We think CS-SGT should achieve similar performance as CS-TSVM, because in theory SGT connects to a simplified version of TSVM, and both of them attempt to incorporate the affinity information present in the taxonomy data into learning. This has been confirmed by our experiments. On the other hand, CS-SGT is much more efficient than CS-TSVM because of the following three reasons. (1) CS-TSVM is based on TSVM that uses computational-expensive greedy search to get a local optimal solution. In contrast, CS-SGT is based on SGT that uses efficient spectral methods to get the global optimal solution. (2) CS-TSVM must run SVM first to get a good estimation of the fraction of the positive examples in the test set [17] because TSVM requires that fraction to be fixed a priori [18]. In contrast, CS-SGT does not need this kind of extracomputation due to the merit of SGT in automatically avoiding unbalanced splits [6]. (3) CS-TSVM requires training a TSVM classifier from scratch for each master category, using the “one-vs-rest” ensemble method for multi-class multi-label classification (as stated in §2). In contrast, CS-SGT (or SGT) needs to build and decompose the kNN graph only once for a specific set of examples (dataset), hence saves a lot of time. It has been observed that construction of the kNN graph is the most time-consuming step of SGT, but it can be sped up using appropriate data structures like inverted indices or kd-trees [6]. The CS-SGT approach’s prominent advantage in efficiency has also been confirmed by our experiments. In summary, the CS-SGT approach is able to achieve similar performance as CSTSVM in taxonomy integration while holding high efficiency as ENB.

5

Experiments

We conduct experiments with real-world web data, to demonstrate the advantage of our proposed CS-SGT approach to taxonomy integration. To facilitate comparison, we use exactly the same datasets and experimental setup as [17].

5.1

Datasets

We have collected 5 datasets from Google and Yahoo: Book, Disease, Movie, Music and News. One dataset includes the slice of Google’s taxonomy and the slice of Yahoo’s taxonomy about websites on one specific topic.

308

Dell Zhang, Xiaoling Wang, and Yisheng Dong

In each slice of taxonomy, we take only the top level directories as categories, e.g., the “Movie” slice of Google’s taxonomy has categories like “Action”, “Comedy”, “Horror”, etc. In each category, we take all items listed on the corresponding directory page and its sub-directory pages as its objects. An object (listed item) corresponds to a website on the world wide web, which is usually described by its URL, its title, and optionally a short annotation about its content. The set of objects occurred in both Google and Yahoo covers only a small portion (usually less than 10%) of the set of objects occurred in Google or Yahoo alone, which suggests the great benefit of automatically integrating them. This observation is consistent with [1]. The number of categories per object in these datasets is 1.54 on average. This observation confirms our previous statement in §3 that an object may belong to multiple categories, and justifies our strategy to build a binary classifier for each category in the master taxonomy. The category distributions in all theses datasets are highly skewed. For example, in Google’s Book taxonomy, the most common category contains 21% objects, but 88% categories contain less than 3% objects and 49% categories contain less than 1% objects. In fact, skewed category distributions have been commonly observed in realworld applications [28].

5.2

Tasks

For each dataset, we pose 2 symmetric taxonomy integration tasks: (integrating objects from Yahoo into Google) and (integrating objects from Google into Yahoo). As described in §3, we formulate each task as a classification problem. The objects in can be used as test examples, because their categories in both taxonomies are known to us [1]. We hide the test examples’ master categories but expose their source categories to the learning algorithm in training phase, and then compare their hidden master categories with the predictions of the learning algorithm in test phase. Suppose the number of the test examples is n . For tasks, we randomly sample n objects from the set G-Y as training examples. For tasks, we randomly sample n objects from the set Y-G as training examples. This is to simulate the common situation that the sizes of and are roughly in same magnitude. For each task, we do such random sampling 5 times, and report the classification performance averaged over these 5 random samplings.

5.3

Features

For each object, we assume that the title and annotation of its corresponding website summarizes its content. So each object can be considered as a text document composed of its title and annotation6. The most commonly used feature extraction technique for text data is to treat a document as a bag-of-words [18, 25]. For each document d in a collection of documents D, its bag-of-words is first pre-processed by removal of stop-words and 6

Note that this is different with [1, 23] which take actual Web pages as objects.

Web Taxonomy Integration Using Spectral Graph Transducer

309

stemming. Then it is represented as a feature vector where indicates the importance weight of term (the i-th distinct word occurred in D). Following the TF×IDF weighting scheme, we set the value of to the product of the term frequency and the inverse document frequency i.e., The term frequency means the number of occurrences of in d . The inverse document frequency is defined as where

is the total number of documents in D, and

is the number of documents in which are normalized to have unit length.

5.4

occur. Finally all feature vectors

Measures

As stated in §3, it is natural to accomplish a taxonomy integration task via an ensemble of binary classifiers, each for one category in To measure classification performance, we use the standard F-score measure) [15]. The F-score is defined as the harmonic average of precision (p) and recall (r), F = 2pr/(p + r), where precision is the proportion of correctly predicted positive examples among all predicted positive examples, and recall is the proportion of correctly predicted positive examples among all true positive examples. The F-scores can be computed for the binary decisions on each individual category first and then be averaged over categories. Or they can be computed globally over all the M×n binary decisions where M is the number of categories in consideration (the number of categories in and n is the number of total test examples (the number of objects in The former way is called macroaveraging and the latter way is called micro-averaging [28]. It is understood that the micro-averaged F-score (miF) tends to be dominated by the classification performance on common categories, and that the macro-averaged F-score (maF) is more influenced by the classification performance on rare categories [28]. Since the category distributions are highly skewed (see §5.1), providing both kinds of scores is more informative than providing either alone.

5.5

Settings

We use the SGT software implemented by Joachims7 with the following parameters : “-k 10”, “-d 100”, “-c 1000 –t f –p s”. We set the parameter for CS similarity function to 0.2. Fine-tuning using tune sets would decisively generate better results than sticking with a pre-fixed value. In other words, the performance superiority of CS-SGT is under-estimated in our experiments.

5.6

Results

The experimental results of SGT and CS-SGT are shown in Table 1. We see that CSSGT really can achieve much better performance than SGT for taxonomy integration. 7

http://sgt.joachims.org/

310

Dell Zhang, Xiaoling Wang, and Yisheng Dong

We think this is because CS-SGT makes effective use of the affinity information present in the taxonomy data.

In Figure 2 and 3, we compare the experimental results of CS-SGT and those of ENB and CS-TSVM which come from [17]. We see that CS-SGT outperforms ENB consistently and significantly. We also find that CS-SGT’s macro-averaged F-scores are slightly lower than those of CS-TSVM, and its micro-averaged F-scores are comparable to those of CS-TSVM. On the other hand, our experiments demonstrated that CS-SGT was much faster than CS-TSVM: CS-TSVM took about one or two days to run all the experiments while CS-SGT finished in several hours.

Fig. 2. Comparing the macro-averaged F-scores of ENB, CS-TSVM and CS-SGT

6

Conclusion

Our main contribution is to show how Spectral Graph Transducer (SGT) can be enhanced for taxonomy integration tasks. We have compared the proposed CS-SGT approach to taxonomy integration with two existing state-of-the-art approaches, and demonstrated that CS-SGT is both effective and efficient.

Web Taxonomy Integration Using Spectral Graph Transducer

311

Fig. 3. Comparing the micro-averaged F-scores of ENB, CS-TSVM and CS-SGT

The future work may include: comparing with the approaches in [19, 23], incorporating commonsense knowledge and domain constraints into the taxonomy integration process, extending to full-functional ontology mapping systems, and so forth.

References 1. Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of the 10th International World Wide Web Conference (WWW). (2001) 603-612 2. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001) 3. Lacher, M. S., Groh, G.: Facilitating the Exchange of Explicit Knowledge through Ontology Mappings. In: Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS). (2001) 305-309 4. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to Map between Ontologies on the Semantic Web. In: Proceedings of the 11th International World Wide Web Conference (WWW). (2002) 662-673 5. Mitchell, T.: Machine Learning. international edn. McGraw Hill, New York (1997) 6. Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proceedings of the 20th International Conference on Machine Learning (ICML). (2003) 290-297 7. Fensel, D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. Springer-Verlag (2001) 8. Chalupsky, H.: OntoMorph: A Translation System for Symbolic Knowledge. In: Proceedings of the 7th International Conference on Principles of Knowledge Representation and Reasoning (KR). (2000) 471-482 9. McGuinness, D. L., Fikes, R., Rice, J., Wilder, S.: The Chimaera Ontology Environment. In: Proceedings of the 17th National Conference on Artificial Intelligence (AAAI). (2000) 1123–1124 10. Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic Integration of Knowledge Sources. In: Proceedings of The 2nd International Conference on Information Fusion. (1999) 11. Noy, N. F., Musen, M. A.: PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI). (2000) 450-455

312

Dell Zhang, Xiaoling Wang, and Yisheng Dong

12. Ichise, R., Takeda, H., Honiden, S.: Rule Induction for Concept Hierarchy Alignment. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 26-29 13. Noy, N. F., Musen, M. A.: Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 63-70 14. Stumme, G., Maedche, A.: FCA-MERGE: Bottom-Up Merging of Ontologies. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 225-230 15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, New York, NY (1999) 16. Rocchio, J. J.: Relevance Feedback in Information Retrieval. In: G. Salton, (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall (1971) 313-323 17. Zhang, D., Lee, W. S.: Web Taxonomy Integration using Support Vector Machines. In: Proceedings of the 13th International World Wide Web Conference (WWW). (2004) 18. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML). (1999) 200-209 19. Zhang, D., Lee, W. S.: Web Taxonomy Integration through Co-Bootstrapping. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). (2004) 20. Freund, Y., Schapire, R. E.: A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences 55 (1997) 119-139 21. Schapire, R. E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39 (2000) 135-168 22. Schapire, R. E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37 (1999) 297-336 23. Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-Training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). (2003) 177-186 24. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK (2000) 25. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML). (1998) 137-142 26. Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, NY (1998) 27. Bennett, K.: Combining Support Vector and Mathematical Programming Methods for Classification. In: B. Scholkopf, C. Burges, and A. Smola, (eds.): Advances in Kernel MethodsSupport Vector Learning. MIT-Press (1999) 28. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR). (1999) 42-49

Contextual Probability-Based Classification Gongde Guo1,2, Hui Wang2, David Bell3, and Zhining Liao2 1

School of Computer Science Fujian Normal University, Fuzhou, 350007, China 2 School of Computing and Mathematics University of Ulster, BT37 0QB, UK {G.Guo,H.Wang,Z.Liao}@ulster.ac.uk 3

School of Computer Science Queen’s University Belfast, BT7 1NN, UK [emailprotected]

Abstract. The (kNN) method for classification is simple but effective in many cases. The success of (kNN in classification depends on the selection of a “good value” for In this paper, we proposed a contextual probability-based classification algorithm (CPC) which looks at multiple sets of nearest neighbors rather than just one set of nearest neighbors for classification to reduce the bias of The proposed formalism is based on probability, and the idea is to aggregate the support of multiple neighborhoods for various classes to better reveal the true class of each new instance. To choose a series of more relevant neighborhoods for aggregation, three neighborhood selection methods: distance-based, symmetric-based, and entropy-based neighborhood selection methods are proposed and evaluated respectively. The experimental results show that CPC obtains better classification accuracy than kNN and is indeed less biased by after saturation is reached. Moreover, the entropy-based CPC obtains the best performance among the three proposed neighborhood selection methods.

1

Introduction

kNN is a simple but effective method for classification [1]. For an instance to be classified, its nearest neighbors are retrieved, and this forms a neighborhood of Majority voting among the instances in the neighborhood is commonly used to decide the classification for with or without consideration of the distancebased weighting. Despite its conceptual simplicity, kNN performs as well as any other possible classifier when applied to non-trivial problems. Over the last 50 years, this simple classification method has been extensively used in a broad range of applications such as medical diagnosis, text categorization [2], pattern recognition [3], data mining [4], and e-commerce. However, to apply kNN we need to choose an appropriate value for and the success of classification is very much dependent on this value. In a sense, kNN is biased by There are many ways of choosing the value, and a simple one is to run the algorithm many times with different values and choose the one with the best performance. But this is not a pragmatic method in real applications. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 313–326, 2004. © Springer-Verlag Berlin Heidelberg 2004

314

Gongde Guo et al.

In order for kNN to be less dependent on the choice of we propose to look at multiple sets of nearest neighbors rather than just one set of nearest neighbors. As we know that for an instance each neighborhood bears support for different possible classes. The proposed formalism is based on contextual probability [5], and the idea is to aggregate the support of multiple sets of nearest neighbors for various classes to give a more reliable support value, which better reveals the true class of However, in practice the given data set is usually a sample of the underlying data space, it is impossible to gather all the neighborhoods to aggregate the support for classifying a new instance. On the other hand, even if it is possible to gather all the neighborhoods of a given new instance for classification, the computational cost could be unbearable. In a sense, the classification accuracy of CPC depends on a given number of chosen neighborhoods. So methods used to select more relevant neighborhoods for aggregation in the process of picking up neighborhoods are important. Having identified the existing problems of CPC, we propose three neighborhood selection methods in this paper, aimed at choosing a set of neighborhoods as informative as possible for classification to further improve the classification accuracy of CPC. The rest of the paper is organized as follows: Section 2 describes the contextual probability-based classification method. Section 3 introduces the three neighborhood selection methods: distance-based, symmetric-based, and entropybased neighborhood selection methods. The experimental results are described and discussed in Section 4. Section 5 ends the paper with a summary, linking on existing problems and further research directions.

2

Contextual Probability-Based Classification

Let

be a finite set called a frame of discernment. A mass function is such that

The mass function is interpreted as a representation (or measure) of knowledge or belief about and is interpreted as a degree of support for [6, 7]. To extend our knowledge to an event, A, that we cannot evaluate explicitly for we define a new function such that for any

This means that the knowledge of event A may not be known explicitly in the representation of our knowledge, but we know explicitly some events X that are related to it (i.e. A overlaps with X or Part of the knowledge about X, should then be shared by A, and a measure of this part is The mass function can be interpreted in different ways. In order to solve the aggregation problem, one interpretation is made as follows.

Contextual Probability-Based Classification

315

Let S be a finite set of class labels, and be a finite data set each element of which has a class label in S. The labelling is denoted by a function so that for is the class label of Consider a class Let and The mass function for is defined as such that, for

clearly defined as

and if the distribution over is uniform, then Based on the mass function, the aggregation function for is such that, for

When A is singleton, denoted as

equation (4) can be changed to equation (5).

If the distribution over is uniform then, for represented as equation (6).

Let represent the number of ways of picking N possibilities, then,

and

can be

unordered outcomes from

Let be an instance to be classified. If we know for all then we can assign to the class that has the largest Since the given data set is usually a sample of the underlying data space we may never know the true All we can do is to approximate Equation (6) shows the relationship between and and the latter can be calculated from some given events. If the set of events is complete, i.e. we can accurately calculate and hence otherwise if it is partial, i.e. a subset of is a approximate and so is From equation (5) we know that the more we know about the more accurate (and hence will be. As a result, we can try to gather as many relevant events about as possible. In the spirit of kNN we can deem the neighborhood of as relevant. Therefore we can take neighborhoods of as events. But in practice, the more neighborhoods chosen for classification, the more computational cost it takes. With limited computing time, the choice of the more relevant neighborhoods is non-trivial. This is

316

Gongde Guo et al.

one reason that motivated us to seek a series of more relevant neighborhoods to aggregate the support for classification. Also in the spirit of kNN, for an instance to be classified, the closer an instance is to the more contribution the instance donates for classifying Based on this understanding, for a given number of neighborhoods (for example, chosen for aggregation, we choose a series of specific neighborhoods, which we think are relevant to an instance to be classified, for classification. Summarizing the above discussion we propose the following procedure for CPC. 1. Determine N and for every class and then calculate and These numbers are valid for any 2. Select a number of neighborhoods 3. Calculate for all and 4. Calculate for every 5. Calculate for every 6. Classify for that has the largest In its simplest form kNN is majority voting among the nearest neighbors In our terminology kNN can be described as follows: Select one neighborhood A of calculate then calculate and then finally classify by largest We can see that kNN considers only one neighborhood, and it does not take into account the proportion of instances in a class. In this sense, therefore, kNN is a special case of our classification procedure. of

3

Neighborhood Selection

In practice, a given data set is usually a sample of the underlying data space. It is impossible to gather all the neighborhoods to aggregate the support for classifying a new instance. On the other hand, even if it is possible to gather all neighborhoods for classification, the computational cost could be unbearable. So methods used to select more relevant neighborhoods for aggregation in the process of picking up neighborhoods are quite important. In this section, we describe the three proposed neighborhood selection methods: distance-based, symmetric, and entropy-based neighborhood selection methods which have been implemented in our prototype.

3.1 Distance-Based Neighborhood Selection For a new instance to be classified, distance-based neighborhood selection proceeds by choosing nearest neighbors with different as neighborhoods. One simple way, for example, is to ensure that for each its nearest neighbors make up of a neighborhood called With this convention, we have and where represents the number of neighbors within This is the simplest neighborhood selection method. Figure 1 demonstrates the first four neighborhoods using the distance-based neighborhood selection method.

Contextual Probability-Based Classification

317

Fig. 1. The first four distance-based neighborhoods around

3.2

Symmetric-Based Neighborhood Selection

Let S be a finite set of class labels denoted as and be a finite data set denoted as Each instance in denoted as has a class label in S. The labelling is denoted by a function so that for is the class label of Firstly, we project data set into space. Each instance is represented as a point in the space. Then we partition the space into grids. The partitioning process proceeds as follows: For each dimension of space, if feature is ordinal, we partition into equal intervals, where is the standard deviation of the values occurring for the feature is a parameter whose value is application dependent. We use symbol to represent the length of each cell of feature i.e. If feature is nominal, its discrete values provide a natural partition. At the end of the partitioning process all the instances in data set are distributed into the grids. Assume is an instance to be classified denoted as the initial cell location of denoted by can be calculated as follows: For ordinal feature For nominal feature

in cell

it is represented as an interval

in cell

it is represented as a set

All the instances covered by cell make up of the first neighborhood Strictly speaking, each cell in grids, e.g. is a hypertuple. A hypertuple is a tuple where entries are sets for nominal features, and intervals for ordinal features instead of single values [10]. Assume is the neighborhood and is the corresponding hypertuple, to generate the next neighborhood the hypertuple is expanded in the following way: An ordinal feature in is expanded to An nominal feature in is expanded to

which is represented as an interval which is represented as a set where and

where is a set which represents all the values of feature that occur in the training instances. All the instances covered by the newly generated hypertuple make up of Figure 2 is an example of three symmetric hypertuples

Gongde Guo et al.

318

around which are denoted as tively. In Figure 2, is covered by the instances covered by hypertuples respectively, where

and both

and respecare covered by All make up of the neighborhood

Fig. 2. Three symmetric neighborhoods

3.3

Entropy-Based Neighborhood Selection

We proposed an entropy-based neighborhood selection method by selecting a given number of neighborhoods with as much information for classification as possible. Our goal is to improve the classification accuracy of CPC. This is a neighborhood-expansion method by which the next neighborhood is generated by expanding the previous one. Obviously, the earlier one is covered by the later one. In each neighborhood expansion process, we calculate the entropy of each possible expansion (candidate) and select the one with minimal entropy as our next neighborhood. The smaller the entropy of a neighborhood, the more imbalance there is in the class distribution of the neighbors, and the more relevant the neighbors are to the instance to be classified. Assume is the neighborhood and is the corresponding hypertuple in space. Consider feature If it is ordinal, then is an interval denoted as The set of all the instances covered by hypertuple and the set of all the instances covered by hypertuple will be two candidates for the next neighborhood selection. If feature is nominal, is a set denoted as For every instance where the set of all the instances covered by hypertuple will be a candidate for the next neighborhood selection. We then calculate each candidate’s entropy according to equation (7), and choose the candidate with minimal entropy as The entropy is defined as follows:

In equation (7), is the number of classes; are class labels. in equation (8) is determined by counting the number of instances in the

Contextual Probability-Based Classification

319

candidate that belongs to class and presented this as a percentage of the total number of instances in this candidate. All the instances covered by make up of Suppose that a new instance to be classified initially falls into a cell in grids represented as a hypertuple i.e. is covered by For hypertuple if is ordinal, represents an interval, denoted by where Obviously, satisfies if is nominal, is a set, denoted by where All the instances covered by hypertuple make up of a set denoted by which is the first neighborhood of our algorithm. The detailed entropy-based neighborhood selection algorithm is described as follows: 1. Set is covered by 2. For to {Find the neighborhood with minimal entropy dates expanding from

Suppose that of entropy, i.e.

and

among all the candi-

are two neighborhoods of having the same amount

If where represents the cardinality of we believe that is more relevant to than so in this case, we prefer to choose as the next neighborhood. Otherwise, we prefer to choose the one with minimal as the next neighborhood. According to equation (7), the smaller a neighborhood’s entropy is, the more imbalance its class distribution is, and consequently the more information it has for classification. So, in our algorithm, we adopt equation (7) to be the criteria for neighborhoods selection. In each neighborhood expanding process, we select a candidate with minimal as the next neighborhood. To illustrate the method we consider an example here. For simplicity, we describe our entropy-based neighborhood selection method in 2-dimensional space. Suppose that an instance to be classified locates at cell [3, 3] in the leftmost graph of Figure 3. We collect all the instances, which are covered by cell [3, 3] into a set called as the first neighborhood. Then we try to expand our cell one step in each of 4 different directions (up, down, left, and right) respectively and choose a candidate with minimal as a new expanded area,

Fig. 3. Neighborhood expansion process (1)

320

Gongde Guo et al.

e.g. Then we look up, down, left, right again and select a new area (e.g. in the rightmost graph of Figure 3). All the instances covered by the expanded area make up of the next neighborhood called and so on. At the end of the procedure, we obtain a series of neighborhoods e.g. as shown in Figure 3 from left to right. If an instance to be classified locates at cell [2, 3] in the leftmost graph of Figure 4, the selection process of three neighborhoods is demonstrated by Figure 4 from left to right.

Fig. 4. Neighborhood expansion process (2)

4

Experiment and Evaluation

One motivation of this work is the fact that kNN for classification is heavily dependent on the choice of a ‘good’ value for The objective of this paper is therefore to come up with a method in which this dependence is reduced. A contextual probability-based classification method is proposed to solve this problem, which works in the same spirit as kNN but needs more neighborhoods. For simplicity we refer to our classification procedure presented in the section 2 as nokNN. To distinguish between three different neighborhood selection methods, we refer to distance-based neighborhood selection method as nokNN(d), symmetric neighborhood selection method as nokNN(s), and entropy-based neighborhood selection method as nokNN(e). Here we experimentally evaluate the classification procedures of nokNN(d), nokNN(s), and nokNN(e) with real world data sets in order to verify our expectations and to see if and how aggregating different neighborhoods improves the classification accuracy of kNN.

4.1

Data Sets

In experiment, we used fifteen public data sets available from the UC Irvine Machine Learning Repository. General information about these data sets is shown in Table 1. The data sets are relatively small but scalability is not an issue when data sets are indexed. In Table 1, the meaning of the column headings is follows, NF-Number of Features, NN-Number of Nominal features, NO-Number of Ordinal features, NBNumber of Binary features, NI-Number of Instances, and CD-Class Distribution.

Contextual Probability-Based Classification

321

4.2 Experiments Experiment 1. kNN and nokNN(d) were implemented in our prototype. In the experiment, 30 neighborhoods were used and for every data set. kNN was run with varying number of neighbors ranging from 1 to 88 with step 3 for and nokNN(d) was run with varying number of neighborhoods ranging from 1 to 30 with step 1 for N. Each set of nearest neighbors for kNN) makes up a neighborhood. There are totally 30 neighborhoods corresponding to different ranging from 1 to 88 with step 3. The comparison of kNN and nokNN(d) in classification accuracy is shown in Figure 5. Each value in horizontal axis, e.g. represents the number of neighborhoods used for aggregation for nokNN(d) and the neighborhood used for kNN. The value for kNN with respect to the neighborhood is The detailed experimental results for kNN and nokNN(d) are presented in two separate tables: Table 2 for nokNN(d) and Table 3 for kNN, where N is varied from 1 to 10 for both kNN and nokNN(d).

Fig. 5. A comparison of nokNN(d) and kNN in average classification accuracy

322

Gongde Guo et al.

In Table 2, heading represents the number of neighborhoods used for aggregation. In Table 3, heading represents the neighborhoods used for kNN. The neighborhood contains neighbors, i.e.

Contextual Probability-Based Classification

323

Fig. 6. Classification accuracy of nokNN(d) testing on Diabetes data set

Fig. 7. Classification accuracy of kNN testing on Diabetes data set

Figure 6 and Figure 7 show the full details of the performance of nokNN(d) and kNN testing on the Diabetes data set where the number of neighborhoods varies from 1 to 30. We also give the worst and best performance of kNN together with the corresponding “N” values, and the performance of nokNN(d) in Table 4 when ten neighborhoods are used for aggregation. In this experiment, we use the 10-fold cross validation method for evaluation. The experimental results show that the performance of kNN varies when different neighborhoods are used while the performance of nokNN(d) improves with increasing number of neighborhoods, but stabilizes after a certain number of stages in Figure 5). Furthermore the stabilized performance of nokNN(d) is comparable (in fact slightly better in our experiment on fifteen data sets) to the best performance of kNN within 10 neighborhoods. Experiment 2 In this experiment, our goal is to test whether or not the entropy-based neighborhood selection method can improve the classification accuracy of CPC. In the experiment, for each value of N, e.g. nokNN(e) represents the average classification accuracy obtained when neighborhoods are used for aggregation, and kNN represents the average classification accuracy obtained when testing on the neighborhood. A comparison of entropy-based nokNN(e) and kNN with respect to classification accuracy using 10-fold cross validation is shown in Figure 8. To further verify our aggregation method, we also implemented a symmetric neighborhood selection method. Refer to section 3.2 for more details.

324

Gongde Guo et al.

Fig. 8. A comparison of nokNN(e) and kNN in classification accuracy

Figure 9 shows that the similar results are obtained using the symmetric neighborhood selection method. A comparison of entropy-based nokNN(e) with symmetric-based nokNN(s), and distance-based nokNN(d) in classification accuracy is shown in Figure 10 It is obvious that the entropy-based CPC obtains better classification accuracy than the symmetric-based CPC and the distance-based CPC, especially when the number of neighborhoods for aggregation is relatively small, e.g. The experimental results justify our hypotheses: (1) the bias of can be removed by CPC, and (2) the entropy-based neighborhood selection method indeed improves the classification accuracy of CPC.

5

Conclusions

In this paper we have discussed the issues related to the kNN method for classification. In order for kNN to be less dependent on the choice of we looked at

Contextual Probability-Based Classification

325

Fig. 9. A comparison of nokNN(s) and kNN in classification accuracy

Fig. 10. A comparison of nokNN(d), nokNN(s), and nokNN(d)

multiple sets of nearest neighbors rather than just one set of nearest neighbors. A set of neighbors is called a neighborhood. For an instance each neighborhood bears support for different possible classes. We have presented a novel formalism based on probability to aggregate the support for various classes to give a more reliable support value, which better reveals the true class of Based on this idea, for specific neighborhoods used in kNN, which always surround around the instance to be classified, we have proposed a contextual probability-based classification method together with three different neighborhood selection methods. To choose a given number of neighborhoods with as much information for classification as possible, the proposed entropy-based neighborhood selection method which partitions a multidimensional data space into grids and expands neighborhood each time with minimal information entropy among all candidates in this grids. This method is independent on “distance metric” or “similarity metric”. Experiments on some public data sets have shown that using nokNN (whether nokNN(d), nokNN(s), or nokNN(e)) the classification accuracy increases as the number of neighborhoods increases, but stabilizes after a small number of neighborhoods; using kNN, however, the classification performance varies when different neighborhoods are used. Experiments also have shown that the stabilized performance of nokNN(d) is comparable to the best performance of kNN. The comparison of entropy-based, symmetric-based, and distance-based CPC has shown that the entropy-based CPC obtains the highest classification accuracy.

326

Gongde Guo et al.

References 1. Hand, D., Mannila, H., and Smyth, P. Principles of Data Mining, The MIT Press, 2001. 2. Sebastiani, F. Machine Learning in Automatic Text Categorization, In ACM Computing Survey, Vol.34, No. 1, pages.1-47, March 2002. 3. Ripley, B. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. 4. Mitchell, T. Machine Learning, MIT Press and McGraw-Hill, 1997. 5. Wang, H. Contextual Probability, Journal of Telecommunications and Information Technology, 4(3), pages 92-97, 2003. 6. Guan, J. and Bell, D. Generalization of the Dempster-Shafer Theory. Proc. IJCAI93, pages 592-597, 1993. 7. Shafer, G. A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey, 1976. 8. Feller, W. An Introduction to Probability Theory and Its Applications, Wiley, 1968. 9. Michie, D., Spiegelhalter, D. J., and Taylor, C. C. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 10. Wang, H., Duntsch, I. and Bell, D. Data Reduction Based on Hyper Relations. Proc. of KDD98, New York, pages 349-353, 1998.

Improving the Performance of Decision Tree: A Hybrid Approach LiMin Wang1, SenMiao Yuan1, Ling Li2, and HaiJun Li2 1

College of Computer Science and Technology, JiLin University, ChangChun 130012, China

2

School of Computer, YanTai University, YanTai 264005, China

[emailprotected]

Abstract. In this paper, a hybrid learning approach named Flexible NBTree is proposed. Flexible NBTree uses Bayes measure to select proper test and applies post-discretization strategy to construct decision tree. The finial decision tree nodes contain univariate splits as regular decision trees, but the leaf nodes contain General Naive Bayes, which is a variant of standard Naive Bayesian classifier. Empirical studies on a set of natural domains show that Flexible NBTree has clear advantages with respect to the generalization ability when compared against its counterpart, NBTree. Keywords: Flexible NBTree; Bayes measure post-discretization

General Naive Bayes;

1 Introduction Decision tree based methods of supervised learning represent one of the most popular approaches within the AI field for dealing with classification problems. They have been widely used for years in many domains such as web mining, data mining, pattern recognition, signal processing, etc. But standard decision tree learning algorithm [1] has difficulty in obtaining the relation between continuousvalued data points. It is a key issue in research to learn from data consisting of both continuous and nominal variables. Some researchers indicate that hybrid approaches can take advantage of both symbolic and connectionist models to handle tough problems. Much research has addressed the issue of combining decision tree with other learning algorithm to construct hybrid model. Baldwin et al. [2] used mass assignment theory to translate attribute values to probability distribution over the fuzzy partitions, then introduced probabilistic fuzzy decision trees in which fuzzy partitions were used to discretize continuous test universes. Tsang et al. [3] used a hybrid neural network to refine fuzzy decision tree and extracts a fuzzy decision tree with parameters, which is equivalent to a set of fuzzy production rules. Based on variable precision rough set theory, Zhang et al. [4] introduced a new concept of generalization and employed the variable precision rough sets (VPRS) model to construct multivariate decision tree. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 327–335, 2004. © Springer-Verlag Berlin Heidelberg 2004

328

LiMin Wang et al.

By redefining test selection measure, this paper proposes a novel hybrid approach, Flexible NBTree, which attempts to utilize the advantage of decision tree and Naive Bayes. The final classifier resembles Kohavi’s NBTree [5] but in two respects: 1. NBTree pre-discretizes the data set by applying an entropy based algorithm. Flexible NBTree applyies post-discretization strategy to construct decision tree. 2. NBTree uses standard Naive Bayes at the leaf node to handle pre-discretized and nominal attributes. Flexible NBTree uses General Naive Bayes (GNB), which is a variant of standard Naive Bayes, at the leaf node to handle continuous and nominal attributes in the subspace. The remainder of this paper is organized as follows: Section 2, 3 introduce the post-discretization strategy and GNB, respectively. Section 4 illustrates Flexible NBTree in detail. Section 5 presents the corresponding experimental results of compared performance with regarding to Flexible NBTree and NBTree. Section 6 sums up whole paper.

2

The Post-discretization Strategy

When applying post-discretization strategy to construct decision tree, at each internal node in the tree, we first select the test which is the most useful for improving classification accuracy, then apply discretization of continuous tests.

2.1

Bayes Measure

In this discussion we use capital letters such as X, Y for variable names, and lower-case letters such as to denote specific values taken by those variables. Let P(·) denote the probability, refer to the probability density function. Suppose the training set T consists of predictive attributes and class attribute C. Each attribute is either continuous or nominal. The aim of decision tree learning is to construct a tree model which can describe the relationship between the predictive attributes and class attribute C. That is, the classification accuracy of the tree model on data set T should be the highest. Correspondingly the Bayes measure which is introduced in this section as a test selection measure, is also based on this criterion. Let represent one of the predictive attributes. According to Bayes theorem, if is nominal then:

Otherwise if

is continuous then:

Improving the Performance of Decision Tree: A Hybrid Approach

329

The aim of Bayesian classification is to decide and choose the class that maximizes the posteriori probability. When some instances satisfy their class labels are most likely to be:

Definition 1. Suppose as:

has

distinct values. We define the Bayes measure

where N is the size of set T. Intuitively spoken, is the classification accuracy when classifier consists of attribute only. It describes the extent to which the model constructed by attribute fits class attribute C. The predictive attribute which maximizes is the one that is most useful for improving classification accuracy.

2.2

Discretization of Continuous Tests

The aim of discretization is to partition the values of continuous test nominal set of intervals. According to (3), we have:

where conditional probability density function trary values and of attribute when

into a

is continuous. Given arbithere will be

So, the class labels inferred from (3) will not change within a small interval of the values of For clarification, suppose the relationship between the distribution of and C is shown in Fig. 1.

Fig. 1. The relationship between the distribution of

and C

LiMin Wang et al.

330

We can see from Fig. 1 that,

Note that the attribute values are inferred from (3), not the true class labels of training instances. And in the current example, there are three candidate boundaries corresponding to the values of at which the value of C changes: If we use these boundaries to discretize attribute the classification accuracy after discretization will be equal to So, the process of computing is also the process of discretization. The Bayes measure can also be used to automatically find the most appropriate boundaries for discretization and the number of intervals. Although this kind of discretization method can retain classification accuracy, it may lead to too many intervals. The Minimum Description Length (MDL) principle is used in our experimental study to control the number of intervals. Suppose we have sorted sequence S into ascending order by the values of Such a sequence is partitioned by boundary B to two subsets The class information entropy of the partition denoted by is given by:

where Ent(·) denotes the entropy function,

and

stands for the proportion of the instances in

that belong to class

According to MDL principle, the partitioning within S is reasonable iff

where is the information gain, which measures the decrease of the weighted average impurity of the partitions compared with the impurity of the complete set S. N is the number of instances in set is the number of class labels represented in set This approach can then be applied recursively to all adjacent partitions of attribute thus create the final intervals.

3

General Naive Bayse (GNB)

Naive Bayes comes originally from work in pattern recognition and is based on one assumption that predictive attributes are conditionally independent given the class attribute C, which can be expressed as follows:

Improving the Performance of Decision Tree: A Hybrid Approach

331

But when instance space contains continuous attributes, the situation is different. For clarity, we first just consider two attributes: and Suppose the values of have been discretized into a set of intervals, each corresponding to a nominal value. Then the independence assumption should be:

where is arbitrary interval of the values of attribute This assumption, which is the basis of GNB, supports very efficient algorithms for both classification and learning. By the definition of a derivative,

where and

When hence

We now extend (9) to handle a much more common situation. Suppose the first of attributes are continuous and the remaining attributes are nominal. Similar to the induction process of (9), we will have

Then the classification rule of GNB is:

The probability in (11) are estimated by using the Laplace-estimate and M-estimate [6], respectively.

332

LiMin Wang et al.

Kernel-based density estimation [7] is the most widely used non-parametric density estimation technique. Compared with parametricdensity estimation technique, it does not make any assumption of data distribution. In this paper we choose it to estimate conditional probability density in Eq.(11):

where is the corresponding value of attribute X when K(·) is a given kernel function And is the corresponding kernel width, is the number of training instances when This estimate converges to the true probability density function if the kernel function obeys certain smoothness properties and the kernel width are chosen appropriately. One way of measuring the difference between the true and the estimated is the expected cross-entropy:

where and is chosen to minimize the estimated cross-entropy. In our experiments, we use an exhaustive grid search where grid width is 0.01 and the search is over

4

Flexible NBTree

Kohavi proposes NBTree as a hybrid approach combining the Naive Bayes and decision tree. It has been shown that NBTree frequently achieves higher accuracy than either a Naive Bayes or a decision tree. Like NBTree, Flexible NBTree also uses a tree structure to split the instance space into subspaces and generates one Naive Bayes in each subspace. However, it uses a different discretization strategy and different version of Naive Bayes. The Flexible NBTree learning algorithm is shown as follows.

Improving the Performance of Decision Tree: A Hybrid Approach

5

333

Experiments

In order to evaluate the performance of Flexible NBTree and compare it against its counterpart, NBTree, we conducted an empirical study on 18 data sets from the UCI machine learning repository1. Each data set consists of a set of classified instances described in terms of varying numbers of continuous and nominal attributes. For comparison purpose, the stopping criterions in our experiments are the same: the relative reduction in error for a split is less than 5% and there are no more than 30 instances in the node. The classification performance was evaluated by ten-folds cross-validation for all the experiments on each data set. Table 1 shows classification accuracy and standard deviation for Flexible NBTree and NBTree, respectively. indicates that the accuracy of Flexible NBTree is higher than that of NBTree at a significance level better than 0.05 using a two-tailed pairwise t-test on the results of the 20 trials in a data set. From Table 1, the significant advantage of Flexible NBTree over NBTree in terms of the higher accuracy can be clearly seen. In order to investigate the reason(s), we analyze the experimental results on data set Breast-w in particular. Figure 2 shows the comparison of classification accuracy for Flexible NBTree and NBTree. When N (the training size of data set Breast-w) < 650, the tree structures that learned from these two algorithms are almost the same. But when 1

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

334

LiMin Wang et al.

Fig. 2. Comparison of the classification accuracy

the decision node in the second layer of Flexible NBTree contains univariate test and that learned from NBTree contains test . Correspondingly from Fig. 2 we can see that, when N = 600 Flexible NBTree achieves 92.83% accuracy on the test set while NBTree reaches about 92.73%. When N = 650 Flexible NBTree achieves 93.51% accuracy while NBTree reaches about 92.92%. The error reduction increases from 1.38% to 8.33%. We attribute this improvement to the effectiveness of post-discretization strategy. Since no information-lossless discretization procedure is available, some helpful information may lose in the transformation from infinite numeric area to finite subintervals. We conjecture that pre-discretization does not take full advantage of the information that continuous attributes supply and this may affect the cutting points of continuous test or even test selection in the process of tree construction, thus degrade the classification performance to some extent. But post-discretization strategy applies discretization only when necessary, thus the possibility of information loss can be reduced to minimum.

6

Summary

Pre-discretization is a common choice for handling continuous attributes in machine learning. But the information loss may affect classification performance negatively. In this paper, we propose a novel learning approach, Flexible NBTree, which is hybrid of decision tree and Naive Bayes. Flexible NBTree applies postdiscretization strategy to mitigate the negative effect of information loss. Experiments with natural domains showed that Flexible NBTree generalizes much better than NBTree.

Improving the Performance of Decision Tree: A Hybrid Approach

335

References 1. Quinlan, J. R.: Discovering rules from large collections of examples: A case study. Expert Systems in the Micro Electronic Age, Edinburgh University Press, (1979) 2. Baldwin, JF., Sachin, B. Karale.: Asymmetric Triangular Fuzzy Sets for Classification Models. Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, KES 2003 Oxford, UK, (2003) 364-370 3. Tsang, ECC., Wang, XZ., Yeung, YS.: Improving learning accuracy of fuzzy decision trees by hybrid neural networks. IEEE Transactions on Fuzzy Systems, 8 (2000) 601-614 4. Zhang, L., Ye-Yun, M., Yu, S., Ma-Fan, Y.: A New Multivariate Decision Tree Construction Algorithm Based on Variable Precision Rough Set. Advances in WebAge Information Management, (2003) 238-246 5. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. Proceedings of the 2th International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, (1996) 202-207 6. Cestnik, B.: Estimating probabilities: A crucial task in machine learning. Proceedings of the 9th European Conference on Artificial Intelligence, (1990) 147-149 7. Chen, H., Meer, P.: Robust Computer Vision through Kernel Density Estimation. Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, (2002) 236-246 8. Smyth, P., Gray, A., Fayyad, U.: Retrofitting decision tree classifiers using kernel density estimation. Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann Publishers, (1995) 506-514

Understanding Relationships: Classifying Verb Phrase Semantics* Veda C. Storey1 and Sandeep Purao2 1

J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302 [emailprotected]

2

School of Information Sciences & Technology, The Pennsylvania State University, University Park, PA 16801-3857 [emailprotected]

Abstract. Relationships are an essential part of the design of a database because they capture associations between things. Comparing and integrating relationships from heterogeneous databases is a difficult problem, partly because of the nature of the relationship verb phrases. This research proposes a multi-layered approach to classifying the semantics of relationship verb phrases to assist in the comparison of relationships. The first layer captures fundamental, primitive relationships based upon well-known work in data abstractions and conceptual modeling. The second layer captures the life cycle of natural progressions in the business world. The third layer reflects the context-dependent nature of relationships. Use of the classification scheme is illustrated by comparing relationships from various application domains with different purposes.

1 Introduction Comparing and integrating databases is an important problem, especially in an increasingly networked world that relies on inter-organizational coordination and systems. With this, is the need to develop new methods to design and integrate disparate databases. Database integration, however, is a difficult problem and one for which semi-automated approaches would be useful. One of the main difficulties is comparing relationships because their verb phrases may be generic or dependent upon the application domain. Being able to compare the semantics of verb phrases in relationships would greatly facilitate database design comparisons. It would be even more useful if the comparison process could be automated. Fully automated techniques, however, are unlikely so solutions to integration problems should aid integrators, but require minimal work on their part [Biskup and Embley, 2003]. The objective of this research is to: propose an ontology for understanding the semantics of relationship verb phrases by mapping the verb phrases to various categories that capture different interpretations. Doing so requires that a classification scheme be developed that captures both the domain-dependent and domain independent nature of verb phrases. The contribution of this research is to provide a useful approach to classifying verb phrases so relationships can be compared in a semi-automated way. * This research was partially supported by J. Mack Robinson College of Business, Georgia State University and Pennsylvania State University. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 336–347, 2004. © Springer-Verlag Berlin Heidelberg 2004

Understanding Relationships: Classifying Verb Phrase Semantics

337

2 Related Work The design of a database involves representing the universe of discourse in a structure in such a way that it accurately reflects reality. Conceptual modeling of databases is, therefore, concerned with things (entities) and associations among things (relationships) [Chen 1993; Wand et al. 1999]. A relationship R, can be expressed as A verb phrase B (A vp B), where A and B are entities. Most database design practices use simple, binary associations that capture these relationships between entities. A verb phrase, which is selected by a designer with the application domain in mind, can capture much of the semantics of the relationship. Semantics, for this research, is defined as the meaning of a term or a mapping from the real world to a construct. Understanding a relationship, therefore, requires that one understand the semantics of the accompanying verb phrase. Consider the relationships from two databases: Customer (entity) buys (verb) Product (entity) Customer (entity) purchases (verb) Product (entity) These relationships reflect the same aspect of the universe of discourse, and use synonymous verb phrases. Therefore, the two relationships may be mapped to a similar interpretation, recognized as identical, and integrated. Next, consider: Customer reserves Car Customer rents Car. These relationships reflect different concepts from the universe of discourse. The first captures the fact that a customer wants to do something; the second, that the customer has done it. These may be viewed as different states in a life cycle progression, but the two underlying relationships cannot be considered identical. Thus, they could not be mapped to the same semantic interpretation. Finally, consider: Manager considers Agreement Manager negotiates Agreement. The structures of the relationships suggest that both relationships represent an interaction. However, “negotiates” implies changing the status, whereas “considers” involves simply viewing the status. On the other hand, Manager makes Agreement Manager writes Agreement may capture an identical notion of creation. These examples illustrate the importance of employing and understanding how a verb phrase captures the semantics of the application domain. The interpretation of verbs depends upon the nouns (entities) that surround them [Fellbaum, 1998]. Research has been carried out on defining and understanding ontology creation and use. There are different definitions and interpretations of ontologies [Weber 2002]. In general, though, ontologies deal with capturing, representing, and using surrogates for the meanings of terms. This research adopts the approach of Dahlgren [1988] who developed an ontology system as a classification scheme for speech understanding and implemented it as an interactive tool. Work on ontology development has been carried out in database design (Embley et al. [1999], Kedad and Metais [1999], Dullea and Song [1999], Bergholtz and Johannesson [2001]). These efforts

338

Veda C. Storey and Sandeep Purao

provide useful insights and build upon data abstractions. However, no comprehensive ontology for classifying relationships has been proposed.

3 Ontology for Classifying Relationships This section proposes an ontology for classifying the verb phrases of relationships. The ontology is of the type developed by Dahlgren [1988] which operates as an interactive system to classify things. The most important part is the classification scheme. It is the focus of this research and is divided into three layers (Figure 1). The layers were developed by considering: 1) prior research in data modeling, in particular, data abstractions and the inherent business life cycle; 2) the local context of the entities; and 3) the domain-dependent nature of verb phrases.

Fig. 1. Relationship classification levels

3.1 Fundamental Categories The fundamental categories are primitives that reflect a natural division in the real world. This category has three general classes that form the basis of how things in the real world can be associated with each other: status, change in status, and interaction as shown in Figure 2.

Fig. 2. Fundamental Categories

Understanding Relationships: Classifying Verb Phrase Semantics

339

Status captures the fact that one thing has a status with respect to the other. These are primitive because they describe a permanent, or durable, association of one entity with another, expressing the fact that A with respect to B. Business applications follow a natural life cycle of conception or creation through to ownership and, eventually, destruction. The change of status category describes this transition from one status to another. Relationships in this category express the fact that A is transitioning from A with respect to B to A with respect to B. An interaction does not necessarily lead to a change of status of either entity. This happens when the effect of an interaction is worth remembering. Consider the verb phrase, ‘create.’ In some cases, it is useful to remember this as a status as in Author writes Book. In other cases, the interaction itself is important, even if it does not result in a change of status. The interaction category, therefore, expresses the fact that A with respect to B. These fundamental categories are sufficiently coarse that all verb phrases will map to them. They are also coarse enough to warrant finer categories to distinguish among the large set of relationships in each category. Thus, further refinement is needed for each fundamental category. 3.1.1 Refining the Category: Status The ‘Status’ category has been extensively studied by research on data abstractions, which focuses on the structure of relationships as a surrogate for understanding their semantics. Most data abstractions associate entities at different levels of abstraction (sub/superclass relationships) [Goldstein and Storey, 1999]. Since data abstractions infer semantics based on the structure of relationships, they, thus, provide a good start point for understanding the semantics of relationships. Research on understanding natural language also provides verb phrase categories such as auxiliary, generic and other types. The first layer captures fundamental differences between kinds of relationships and was build by considering prior, well-accepted research on data abstractions and other frequently-used verb phrases whose interpretation is unambiguous. These are independent of context. This category, thus, captures the fundamental ways in which things in the real world are related so the categories in this level can be used to distinguish among the fundamental types. Additional results from research on patterns [Coad, 1995] and linguistic analysis [Miller, 1990] results in a hierarchical classification with defined primitives at the leaves of the tree. Figure 3 shows this finer classification of the category ‘Status.’ Examples of primitive status relationships are shown in Table 1. There are two variations of one thing being assigned to another: is-assigned-to and is-subjected-to. In A is-subjected-to B, A does not have a choice with respect to its relationship with B, whereas it might in the former. Temporal relationships capture the sequence of when things happen and can be clearly categorized as before, during, and after. 3.1.2 Refining the Category: Change of Status The change-of-status primitives, in conjunction with the status primitives, capture the lifecycle transitions for each status. Although the idea of a lifecycle has been alluded

340

Veda C. Storey and Sandeep Purao

Fig. 3. Primitives for the Category ‘Status’

to previously [Hay 1996], prior research has not systematically recognized the lifecycle concept. Our conceptualization of the ‘Change of Status’ category is based on an extension and understanding of each primitive in the ‘Status’ category during the

Understanding Relationships: Classifying Verb Phrase Semantics

341

business lifecycle. Consider verb phrases that deal with acquiring something, as is typical of business transactions related to the status primitive ‘is-owner-of.’ The lifecycle for this status primitive has the states shown in Figure 4.

Fig. 4. The Relationship Life Cycle

Each state may, in turn, be mapped to different status primitives. For example, the lifecycle starts with needing something (‘has-attitude-towards’ and ‘requires’) which is followed by intending to become an owner (‘acquire’ or ‘create’), owning (‘owner’ or ‘in-control-of’) and giving up ownership (‘seller’ or ‘destroyer’). The primitives therefore illustrate a lifecycle that goes through creation or acquisition, ownership, and destruction. The life cycle can be logically divided into: intent, attempt to acquire, transition to acquiring, intent to give up, attempt to give up, and transition to giving up. Table 2 shows this additional information superimposed on the different states within the lifecycle. The sub-column under the change-of-status primitives shows the meanings captured in each: intent, attempt and the actual transition.

3.1.3 Refining the Category: Interaction ‘Interaction’ describes communication of short duration between two entities or an operation of one entity on another. The interaction may cause a change in one of the entities. For example, one entity may ‘manipulate’ another [Miller, 1990], or cause movement of the other through time or space (‘transmit,’ ‘receive’). Two entities may

342

Veda C. Storey and Sandeep Purao

interact without causing change to either (‘communicate with,’ ‘observe’). One entity may interact with another also by way of performance (‘operate,’ ‘serve’). Figure 5 shows the primitives for ‘Interaction’ with examples given in Table 3.

Fig. 5. Primitives for the Category ‘Interaction’

3.2 The Local (Internal) Context The second category captures internal context by taking into account the nature of the entities surrounding the verb phrase, highlighting the need to understand the nouns that surround verb phrases [Fellbaum, 1998]. For this research, entities are classified as: actor, action, and artifact. Actor entities are capable of performing independent actions. Action represents the performance of an act. Artifact represents an inanimate object not capable of independent action. After entities have been classified, valid primitives can be specified for each pair of entity types. For example, it does not make sense to allow the primitive ‘perform’ for two entities of the kind ‘Actor.’ On the other hand, this primitive is appropriate when one of the entities is classified as ‘Actor’ and the other as ‘Action.’ The argument can be applied both to the ‘Status’ and ‘Interaction’ primitives. Because the ‘Change of Status’ primitives capture the lifecycle of ‘Status’ primitives, constraints identified for ‘Status’ primitives apply to the ‘Change of Status’ primitives as well. Table 4 shows these constraints for ‘Status’ primitives. Similar constraints have been developed for the ‘Interaction’ primitives.

Understanding Relationships: Classifying Verb Phrase Semantics

343

3.3 Global (External) Context The third level captures the external context, that is, the domain in which the relationship is used, reflecting the domain-dependent nature of verb phrases. Although attempts have been made to capture taxonomies of such domain-dependent verbs, a great deal of manual effort has been involved. This research takes a more pragmatic approach where a knowledge base of domain-dependent verb phrases may be constructed over time when the implemented ontology is being used. When the user classifies a verb phrase, its classification and application domain should be stored. Consider the use of ‘opens’ in a theatre database versus a bank database. The relationship Character Door in the theatre domain maps to the interaction primitive <manipulates>. In the bank application, Teller Account maps to the status primitive ; Customer Account maps to . If a verb phrase has already been classified by a user, it can be suggested as a preliminary classification for additional users, who are interested in classifying it. If a verb phrase has already been classified by a different user for the same application domain, then that classification should be displayed to the user who would agree with the classifycation or provide a new classification. New classifications will also be stored. Ideally, consensus will occur over time. This way the knowledge base builds up, ensuring that the verbs important to different domains are captured appropriately. The following will be stored: [Relationship, Verb phrase classification, Application Domain, User]

3.4 Use of the Ontology The ontology can be used for comparing relationships across two databases by first comparing the entities, followed by classification of the verb phrases accompanying the relationships. Examples are shown in Table 5. The ontology consists of a verb phrase classification scheme, a knowledge base that stores the classified verb phrases, organized by user and application, and a userquestioning scheme as mentioned above. The user is instructed to classify the entities of a relationship as actor, action, or artifact. The next step is to classify the verb

344

Veda C. Storey and Sandeep Purao

phrase. First, the user is asked to select one the three categories: ‘Status,’ ‘Interaction,’ or ‘Change of Status.’ Based on this selection, and the constraints provided by the entity types, primitives within each category are presented to the user for an appropriate classification. Suppose a user classifies a relationship as ‘Status.’ Then, knowing the nature of the entities, only certain primitives are presented as possible for the classification of the relationship. Furthermore, identifying that a verb phase is either status, change or status, or interaction restricts the subset of categories from which an appropriate classification can be obtained and, hence, the options presented to the user. If the verb phrase cannot be classified in this way, then, the other levels are checked to see if they are needed.

4 Assessment Assessing an ontology is a difficult task. A plausible approach to assessment of an ontology is suggested by Gruninger and Fox [1995]. They suggest evaluating the ‘competency’ of an ontology. One of the ways to determine this ‘competency’ is to identify a list of queries that a knowledge-base, which builds on the ontology, should be able to answer (competency queries). Based on these queries, the ontology may be evaluated by posing questions such as: Does the ontology contain enough information to answer these types of queries? Do the answers require a particular level of detail or representation of a particular area? Noy and McGuiness [2001] suggest that the competency questions may be representative, but need not be exhaustive. Following our intent of classifying relationships for the purpose of comparison across databases, we attempted to assess whether the classification scheme of the ontology can provide a correct and complete classification of relationship verb phrases. To do so, a study was carried out which involved the following steps: 1) generation of the verb phrases to be classified; 2) generation of relationships using the verb phrases in different application domains; and 3) classification of all verb phrases.

Understanding Relationships: Classifying Verb Phrase Semantics

345

Step 1: Generation of Verb Phrase Only business-related verbs were used because the intent of the relationship ontology is use for business databases. Furthermore, it restricts the scope of the research. Since the SPEDE verbs [Cottam, 2000] were developed for business applications, these automatically became part of the sample set. The researchers independently selected business-related verbs from a set of 700 generated randomly from WordNet. The verbs that were common to the selections made by both researchers were added to the list from SPEDE. The same procedure was carried out from a set of 300 verbs that were randomly selected by people who support the online dictionary http://dictionary. cambridge.org/. This resulted in a total of 211 business verbs. Step 2: Generation of Relationships Containing Verbs by Application Domain For each verb, a definition was obtained from the on-line dictionary. Dictionaries provide examples for understanding and context, which helped to generate the relationships. Relationships were generated for seven application domains (approximately 30 verbs in each): 1) education, 2) business management, 3) manufacturing, 4) airline, 5) service, 6) marketing, 7and) retail. Examples are shown in Table 6.

After generating the relationships, the researchers independently classified them using the relationship ontology. First, 30 verbs were classified and the researches agreed on 80% of the cases. The remaining verbs were then classified. The next step involved assessing how many of the ontology classifications the set of 211 verbs covered to test for completeness. The researchers generated additional relationships for ten subclasses for a total of 225. Sample classifications are shown in Table 7. The results of this exercise were encouraging, especially given our focus on evaluating the competency of the ontology [Gruninger and Fox 1995]. The classification scheme worked well for these sample relationships. It allowed for the classification of all verb phrases. The biggest difficulty was in identifying whether to move from one level to the next. For example, Student acquires Textbook is immediately classifiable

346

Veda C. Storey and Sandeep Purao

by the primitives. In other cases, the next layer was necessary. Further research is needed to design a user interface that can explain the use and categories to the user so they can be effectively applied. A preliminary version of a prototype has been developed. This will be completed and an empirical test carried out with typical end-users, most likely, database designers.

5 Conclusion A classification scheme for comparing relationship verb phrases has been presented. It is based upon results obtained from research on conceptual modeling, common sense knowledge of a typical life cycle, and the domain-dependent nature of relationships. Further research is needed to complete the ontology system for which the classification scheme will be a part. Then, it needs to be expanded to allow for multiple classifications and the user interface refined.

References 1. Bergholtz, M., and Johnannesson, P., “Classifying the Semantics of Relationships in Conceptual Modelling by Categorization of Roles,” Proceedings of the 6th International Workshop on Applications of Natural Language to Information Systems (NLDB’01), 28-29 June 2001, Madrid, Spain. 2. Biskup, J. and Embley, D.W., “Extracting Information from Heterogeneous Information Sources using Ontologically Specified Target Terms,” Information Systems, Vol.28, No.3, 2003. 3. Brachman, R.J., “What IS-A is and Isn’t: An Analysis of Taxonomic Links in Semantic Networks,” IEEE Computer, October, 1983. 4. Brodie, M., “Association: A Database Abstraction,” Proceedings of the Entity-Relationship Conference, 1981. 5. Chen, P., “The Entity-Relationship Approach”, In Information Technology in Action: Trends and Perspectives, Englewood Cliffs: Prentice Hall, 1993, pp.13-36. 6. Coad, P. et al. 1995. Object Models: Strategies, Patterns, & Applications. Prentice Hall. 7. Cottam, H., “Ontologies to Assist Process Oriented Knowledge Acquisition,” http://www. spede.co.uk/papers/papers.htm, 2000. 8. Dahlgren, K., Naive Semantics for Natural Language Understanding, Kluwer Academic Publishers, Hingham, MA, 1988.

Understanding Relationships: Classifying Verb Phrase Semantics

347

9. Dullea, J. and Song, I.-Y., “A Taxonomy of Recursive Relationships and Their Structural Validity in ER Modeling,” in Akoka, J. et al. (eds.), Conceptual Modeling – ER’99, International Conference on Conceptual Modeling, Lecture Notes in Computer Science 1728, Paris, France, 15-18 November 1999, pp.384-389. 10. Embley, D., Campbell, D.M., Jiang, Y.S., Ng, Y.K., Smith, R. D., Liddle, S.W. and Quass, D.W., “A Conceptual-modeling Approach to Web Data Extraction,” Data & Knowledge Engineering, 1999. 11. Fellbaum, V., “Introduction,” in Wordnet: An Electronic Lexical Database, The MIT Press, Cambridge, Mass., 1998, pp.1-19. 12. Gamma, E., Helm, R., Johnson, R., and Vlissides, J., Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA., 1995. 13. Goldstein, R.C. and Storey, V.C., “Data Abstractions: “Why and How”, Data and Knowledge Engineering, Vol.29, No.3, 1999, pp. 1-18. 14. Gruninger, M. and Fox, M.S. Methodology for the Design and Evaluation of Ontologies. In: Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI-95, Montreal. 15. Hay, D. C., and Barker, R. Data Model Patterns: Conventions of Thought. Dorset House, 1996. 16. Kedad, Z., and Metais, E., “Dealing with Semantic Heterogeneity During Data Integration,” in Akoka, J. et al. (eds.), Conceptual Modeling – ER’99, International Conference on Conceptual Modeling, Lecture Notes in Computer Science 1728, Paris, France, 1518 November 1999, pp.325-339. 17. Larmon, C., Applying UML and Patterns. Prentice-Hall, 1997. 18. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K.J., “Introduction to WordNet: An On-line Lexical Database,” International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235-244. 19. Motschnig-Pitrik, R. and Myloppoulos, J., “Class and Instances,” International Journal of Intelligent and Cooperative Systems, Vol.1, No.1, 1992, pp.61-92. 20. Motschnig-Pitrik, R., “A Generic Framework for the Modeling of Contexts and its Applications, ” Data and Knowledge Engineering, Vol. 32, 2000, pp.145-180. 21. Noy, N. F., and McGuinness, D. L. 2001. Ontology Development 101: A Guide to Creating Your First Ontology. Available at http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. Accessed 15 March 2004 22. Smith, J., Smith, D. “Database Abstractions: Aggregation and Generalization,” ACM Transactions on Database Systems. Vol.2, No.2, 1977, pp. 105-133. 23. Wand, Y., Storey, V.C. and Weber, R., “Analyzing the Meaning of a Relationship,” ACM Transactions on Database Systems, Vol.24, No.4. December, 1999, pp.494-528. 24. Weber, R., “Ontological Issues in Accounting Information Systems,” Sutton, S. and Arnold, V., (Eds.), Researching Accounting as an Information Systems Discipline, 2002.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree Yuejin Yan*, Zhoujun Li, and Huowang Chen 1

School of Computer Science, National University of Defense Technology, Changsha 410073, China Tel: 86-731-4532956 [emailprotected] http://www.nudt.edu.cn

Abstract. Maximal frequent itemsets mining is a fundamental and important problem in many data mining applications. Since the MaxMiner algorithm introduced the enumeration trees for MFI mining in 1998, there have been several methods proposed to use depth-first search to improve performance. This paper presents FIMfi, a new depth-first algorithm based on FP-tree and MFI-tree for mining MFI. FIMfi adopts a novel item ordering policy for efficient lookaheads pruning, and a simple method for fast superset checking. It uses a variety of old and new pruning techniques to prune the search space. Experimental comparison with previous work reveals that FIMfi reduces the number of FP-trees created greatly and is more than 40% superior to the similar algorithms on average.

1 Introduction Since the frequent itemsets mining problem (FIM) was first addressed [1], frequent itemsets mining in large database have been an important problem for it enables essential data mining such as discovering association rules, data correlations, sequential patterns, etc. There are two types of algorithms to mine frequent itemsets. The first one is candidate set generate-and-test approach [1]. The basic idea is to generate and test the candidate itemsets. Each candidate itemset with k+1 items is only generated from frequent itemsets with k items. This process is repeated in bottom-up fashion until no candidate itemset can be generated. In each level, all the frequencies of the candidate itemsets are tested by scanning the database. But this method requires scanning the database several times. In the worst case, the number of the scan is equal to the maximal length of the frequent itemsets. Besides this, lots of candidate itemsets is generated, most of them would be infrequent. Another method is data transformation approach [2, 4]: it avoids the cost of generating and testing a large number of candidate sets by growing a frequent itemset from its prefix. It constructs a sub database related to each frequent itemset h such that all frequent itemsets that have h as prefix can be mined only using the sub database. * Corresponding author. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 348–361, 2004. © Springer-Verlag Berlin Heidelberg 2004

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

349

The number of frequent itemsets increases exponentially with the increasing of frequent itemsets’ length. So large length of frequent itemsets leads to no feasible of FI mining. However, since frequent itemsets are upward closed, it is sufficient to discover only all maximal frequent itemsets. As a result, researchers now turn to find MFI (maximal frequent itemsets) [5,6,9,10,4,7]. A frequent itemset is called maximal if it has no frequent superset. Given a set of MFI, it is easy to analyze some interesting properties of the database, such as the longest pattern, the overlap of the MFI, etc. Also, there are applications where the MFI is adequate, for example, the combinatorial pattern discovery in biological applications [3]. This paper focuses on the MFI mining problems based on data transformation approach. We use FP-tree to represent sub database containing all relevant frequency information, and MFI-tree are used to store information of discovered MFI that is useful for superset frequency pruning. With these two data structure, our algorithm takes a novel item ordering policy, and integrates a variety of old and new prune strategies. It also uses a simple but fast superset checking method along with some other optimizations. The remaining of this paper is organized as follows. In section 2, we briefly review the MFI mining problem and introduce the related works. Section 3 gives the MFI mining algorithm, FIMfi, which does the MFI mining based on FP-tree and MFI-tree. In this section we also introduce our novel item ordering policy, the prune strategies we applied and the simple but fast superset checking that is needed in efficient “lookaheads” pruning. In section 4, we compare our algorithm with some previous works. Finally, section 5 gives the conclusions.

2 Preliminaries and Related Work This section will formally describe the MFI mining problem and the set enumeration tree that represents the searching space. Also the related works and two important data structure, FP-tree and MFI-tree, which is used in our scheme, will be introduced in this section.

2.1 Problem Revisit Let be a set of m distinct items. Let D denote a database of transactions, where each transaction contains a set of items. A set is also called an itemset. An itemset with k items is called a k-itemset. The support of an itemset X, denoted as sup(X), is the number of transactions in which X occurs as a subset. For a given D and the threshold min_sup, itemset X is frequent if If and for any we have sup(Y) < min_sup, then X is called maximal frequent itemset. From the definitions we can have two lemmas as follows: Lemma 1: A restricted subset of any frequent itemset is not a maximal frequent itemset. Lemma2: A subset of any frequent itemset is a frequent itemset, a superset of any infrequent itemset is not a frequent itemset.

350

Yuejin Yan, Zhoujun Li, and Huowang Chen

Given a transactional database D, supposed I is an itemset of it, then any combination of the items in I would be frequent and all these combinations compose the search space, which can be represented by set enumeration tree [5]. For example, supposed I = {a,b,c,d,e, f} is sorted in firm lexicographic order, then the searching tree can be shown as Figure 1. To avoid the tree too big, we use subset infrequency pruning and superset frequency pruning technique in the tree, and we will introduce the two pruning techniques in next section. The root of the tree represents the empty itemset, and the nodes at level k contain all of the k-itemsets. The itemset associated with each node, n, will be referred as the node’s head (n). The possible extensions of the itemset is denoted as con_tail(n), which is the set of items after the last item of head(n). The frequent extensions denoted as fre_tail(n) is the set of items that can be appended to head(n) to build the longer frequent itemsets. In depth-first traversal of the tree, fre_tail(n) contains only the frequent extensions of n. The itemset associated with each children node of node n is build by appended one of fre_tail(n) to head (n). As example in Figure 1, suppose node n is associated with {b}, then head (n) = {b} and con_tail(n) = {c,d,e,f}. We can see that {b,f} is not frequent, fre_tail(n) = {c,d,e}. The children node of n, {b,e}, is build by appending e from fre_tail(n) to {b}.

Fig. 1. Search space tree

The problem of MFI mining can be thought as to find a border of the tree, all the elements above the border are frequent itemsets, and others are not. All MFIs is near the border. As our examples in Figure1, itemsets in ellipses are MFI.

2.2 Related Work Given the set enumeration tree, we can describe the most recent approaches to MFI mining problem. The MaxMiner [5] employs a breadth-first traversal policy for the searching. To reduce the search space according to lemma 1, it performs not only subset infrequency pruning to skip over the itemset that have an infrequent subset, but also superset frequency pruning (also called lookaheads pruning). To increase the effectiveness of superset frequency pruning, MaxMiner dynamically reorders the children nodes, which was used in all the MFI algorithms after it [4,6,7,9,10]. Normally depthfirst approach have better performance on lookaheads, but MaxMiner uses a breadthfirst approach instead to limit the number of passes over the database.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

351

DepthProject performs a mixed depth-first traversal, and do the subset infrequency pruning and a variation of superset frequency pruning [6] to the tree. Also it uses an improved counting method based on transaction projections along its branches. The original database and the projections are represented as a bitmap. The experiment results in [6] show that DepthProject outperforms MaxMiner by more than one time. Mafia [7] is another depth-first algorithm, it also uses a vector bitmap representation, where the count of an itemset is based on the column in the bitmap. Besides the two pruning methods we mentioned above, another novel pruning technique called PEP (Parent Equivalence Pruning) in [8] is also used in Mafia, The experiments in [7] shows that PEP prunes the search space greatly. Both DepthProject and Mafia mine a superset of the MFIs, and require a postpruning to eliminate non-maximal frequent itemsets. GenMax [9] integrates the pruning with the mining and finds the exact MFIs by using two strategies. First, just like transaction database is projected on current node, the discovered MFI set can also be projected on the node (Local MFI) and thus yields fast superset checking; Second, GenMax uses Diffset propagation to do fast frequency computation. AFOPT [3] uses a data structure called AFOPT tree in which items are ascending frequency ordered to store the transactions in original database. It also uses subset infrequency pruning, superset frequency pruning and PEP pruning to reduce the search space. And it employs LMFI generated by pseudo projection technique to test whether a frequent itemset is subset of one of it. is an extension of the FP-growth method, for MFIs mining only. It uses a FP-tree to store the transaction projection of the original database for each node in the tree. In order to test whether a frequent itemset is the subset of any discovered MFI in lookaheads pruning, another tree structure (MFI-tree) is utilized to keep the track of all discovered MFI, this makes effective superset checking. uses an array for each node to store the counts of all 2-itemsets that are subset of the frequent extensions itemset, this makes the algorithm scan each FP-tree only once for each recursive call emanating from it. The experiment results [10] shows that has the best performance for almost all the tested database.

2.3 FP-Tree and MFI-Tree The FP-growth method [2] builds a data structure called the FP-tree (Frequent Pattern tree) for each node of the search space tree. FP-tree is a compact representation of all relevant frequency information of current node, each of its path from the root to a node represents an itemset, and the nodes along the paths are stored according to the order of the items infre_tail(n). Each node of the FP-tree also stores the number of transactions or conditional pattern bases which containing the itemset represented by the path. Compression is achieved by building the tree in such a way that overlapping itemsets share prefixes of the corresponding branches. Each FP-tree of the nodes is associated with a header table. Single items in tail and the support of itemset that is the union of head and the item are stored in the header table in decreasing order of the support. The entry for an item also contains the head of a list that links to all the corresponding nodes of the FP-tree.

352

Yuejin Yan, Zhoujun Li, and Huowang Chen

To construct FP-tree of node n, the FP-growth method first finds all the frequent items in fre_tail(n) by an initial scan of the database or the head(n)’s conditional pattern base that comes from FP-tree of its parent node. And then these items are inserted into the header table in the order of items in fre_tail(n). In the next and the last scan, frequent itemset which is subset of the tail are inserted into the FP-tree as a branch. If a new itemset shares a prefix with another itemset that is already in the tree, then the new itemset will share the branch that representing the common prefix with the existing itemset. For example, for the database and min_sup shown in Figure2 (a), the FP-tree of root and itemset {f}is shown as Figure2 (b) and (c).

Fig. 2. Examples of FP-tree

uses an array for each node along with the FP-tree to avoid the first scan of the conditional pattern bases. For each 2-itemsets {a,b} in frequent extensions itemset, an array entry is used to store the support of then when extending the tree from a node to one of its children, we can build the header of the children’ FP-tree according to the array, and avoid scanning the FP-tree of current node again. Considering a given MFI M at node n in the depth-first MFI mining, if we have then all the children of n will not be considered according to lemma 1. This is the superset frequency pruning, also called lookaheads in [5]. Lookaheads needs to access some information in discovered MFI relevant to current node for pruning. uses another FP-tree (MFI-tree) to map the need. The differences between the MFI-tree and the FP-tree of the same node are as follows: first, the nodes do not record frequency information, but they store the length of the itemset represented by the path from the root to the current node. Second, for each itemset S represented by a path, is subset of a certain discovered MFI. In addition, when considering an offspring node of a node, the MFI-tree of the node will be updated as soon as a new MFI is found. Figure3 shows several examples of MFItree.

3 Mining Maximal Frequent Itemsets by FIMfi In this section, we discuss our algorithm FIMfi in details and explain why it is faster than some previous schemes.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

353

Fig. 3. Examples of MFI-tree

3.1 Pruning Techniques Subset Infrequency Pruning: Supposed n is a node in the search space tree, then for each item x in con_tail(n) that is possible to become an item in fre_tail(n), we need to compute the support of the itemset If < min_sup, then we don’t add it into fre_tail(n) and the node identified by itemset will not be considered any more. This is based on lemma 2: all itemsets that is superset of are not frequent. Superset Frequency Pruning: Superset frequency pruning is also called lookaheads pruning. Considering a node n, if itemset is frequent, then all the children node of n should be pruned (lemma 1). There are two existing methods for determining whether the itemset is frequent. The first is to count the support of directly, this method is normally used in an bread-first algorithms such as in MaxMiner. The second one is to check if a superset of has been already in the discovered MFIs. It is commonly used by the depth-first MFI algorithms [4,7,9,10]. Also there are some other techniques, such as LMFI and MFI projection that is used to reduce the cost of checking. For example, in the MFI-tree situation, we can just check if a superset of fre_tail(n) can be found in all conditional pattern bases of head(n), and then finish the superset checking. Here we propose a new way to do lookaheads pruning based on FP-tree. For a given node, we can get all the conditional pattern bases of the head(n) from the FP-tree of its parent node, and then our algorithm tries to find a superset of fre_tail(n) in a collection of conditional pattern bases, and the last items’ counts of these bases are no less than minimum support. If we find one, S, then we know being frequent, so is frequent based on lemma 2. For example, when considering itemset {b}, the fre_tail of {b} is {a,c}, there is a conditional pattern base of {b} as a3,c3 (Figure2 (b)), then we know {b,c,a} frequent, all the children of {b} will be pruned. If FIMfi finds a superset of fre_tail(n) in FP-tree and is an undiscovered MFI, FIMfi needs to update MFI-trees with as described before. In addition, we also do superset frequency pruning with itemset Before generating fre_tail(n) from con_tail(n), our algorithm will check if there is a superset of con_tail(n) in FP-tree, this is because our scheme will use a very simple and fast method to do the superset checking (see section 3.2).

354

Yuejin Yan, Zhoujun Li, and Huowang Chen

Parent Equivalence Pruning: Also the FIMfi uses the PEP for its efficiency. As an example, taking any item x from fre_tail(n), then there is So if any frequent itemset Z, which contains Y but does not contain x, has the frequent superset Since we only want to get MFI, it is not necessary to count itemsets which contain Y but do not contain x. Therefore, we can move item x from fre_tail(n) to head(n). From the experiment result we find that the PEP can greatly reduce the number of FP-trees’ comparing to the

3.2 Superset Checking As discussed before, superset checking is a main operation in lookaheads pruning. This is because that each new MFI needs to be checked before being added into the MFIs. MaxMiner needs scan all the discovered MFIs, and tries to map item by item for each discovered MFI. Though GenMax uses LMFI to store all the relevant MFIs, it also needs mapping item by item. As for it only needs map fre_tail(n) item by item for all conditional pattern bases of head(n) in MFI-tree. Our simple but fast superset checking method of is based on the lemmas as follows: Lemma 3: If there is one conditional pattern base of head(n) in MFI-tree and its length is equal to the length of fre_tail(n), then is frequent. Proof: Let S be the itemset represented by the base, then is frequent. And for each item x in S, For the bases of same length, there is S =fre_tail(n). Hence, we obtain the lemma. Lemma 4: If there is one conditional pattern base of head(n) in MFI-tree and its length is equal to the length of con_tail(n), then or is frequent. Proof: Let S be the itemset represented by the base, then is frequent. Since con_tail(n) includes all possible extensions of head(n), there is For the bases of same length, there is S = con_tail(n). Hence, we obtain the lemma. Lemma 5: Suppose y is a conditional pattern base of head(n) in FP-tree. If the counter associated with the last item of y is no less than min_sup, and the length of y is equal to the length of fre_tail(n), then is frequent. Proof: Similar as Lemma3. Lemma 6: Suppose y is a conditional pattern base of head(n) in FP-tree. If the counter associated with the last item of y is no less than min_sup, and the length of y is equal to the length of con_tail(n), then is frequent. Proof: Similar as Lemma4. According to lemma 3 and lemma 4, the superset checking needs not to map item by item, and can just be done by checking the length of itemsets. Here the level of the last item in the base can be used as the length of the base. For more efficient lengths checking, the only change of FIMfi for the MFI-tree is storing the node links of items

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

355

to the header table in the decreasing order of the bases’ level. Now the superset checking is very simple for it only needs to check the length of two itemsets. Similarly, the superset checking based on FP-tree can also be simple according to lemma 5 and lemma 6. In this situation, we add a level to each node of the FP-tree, with the level representing the length of the path from the node in question to root. And the node links, whose counts are no less than min_sup, are stored in decreasing order of the levels. The example is shown in Figure2 (d). Therefore, this superset checking is also simple for it needs only to check the length of two itemsets. Let’s revisit the example in section 3.1, when doing the superset checking of we need only compare the length of the conditional pattern base a3:c3 to the length of itemset {a,c}.

3.3 Item Ordering Policy Item ordering policy firstly appears in [5], and is used by almost all the follow MFI algorithms for it can increase the effectiveness of superset frequency pruning. As we know, items with higher frequency are more likely to be members of lone frequent itemsets and subset of some discovered MFIs. As for node n, after fre_tail(n) is generated and before extending to the children, the traditional scheme can sort the items at the tail in the decreasing order of This makes the most frequent items appear in more itemsets that are frequent extensions of some nodes n’s offspring. Therefore, there will be more such pruned offspring nodes. In general, this type of item order policy works better in lookaheads by scanning the database to count the support of in breath-first algorithms, such as in MaxMiner. All the recently proposed depth-first algorithms do the superset checking instead to implement the lookaheads pruning, for the counting support of costs high in depth-first policy. Since superset checking of FIMfi is based on MFI-tree and/or FP-tree, we try to find an item ordering policy to make use of the information of MFI-tree and/or FPtree. As we know, if S is a subset of tail, and is frequent, then we can prune the nodes identified by itemset this is because the itemsets corresponding to the nodes and their offspring are not maximal (lemma 1). Based on FP-tree and MFI-tree, when a policy can let S be the maximal subset of fre_tail(n), we can achieve maximal pruning at the node in question. Supposed there are two itemsets and is represented by the conditional pattern base whose length is maximal in MFI-tree. is represented by the conditional pattern base, whose length is maximal among a collection of conditional pattern bases in FP-tree, here the last item’s count of these bases is no less than min_sup. Let S be the longest one of and and we put the items in S to the head of fre_tail(n), then we can attain the maximal pruning. For example, when considering the node n identified by {e}, we know and as in Figure2(b), then the sorted items in fre_tail(n) is in sequence of a,c,b, the old decreasing order of supports is b,a,c. using the old decreasing order policy has to build FP-trees for nodes {e}, {e,a}, and {e,c}, but FIMfi with the new order policy only need to build FP-trees for nodes {e} and {e,a}. Similarly, when considering the node {d}, we know fre_tail(n)={a,c,b}, as in Figure3(b) and the sorted items in

356

Yuejin Yan, Zhoujun Li, and Huowang Chen

fre_tail(n) is in sequence of a,c,b, the old decreasing order of supports is b,a,c. The experiments results of these two policies will be illustrated in section 4 for comparison purpose. Furthermore, for the items in fre_tail(n)-S, we also sort them in the decreasing order of

3.4 Optimizations FIMfi uses the same array technique for counting frequency as the one in but FIMfi doesn’t count the whole triangle array as do. Suppose, that at node n, sorted fre_tail(n) is with that is frequent. When we extend the nodes corresponding to the items less than the superset checking will return true and those nodes will be pruned. So, for the 2-itemsets which are subsets of the corresponding cells will not be used any more. Therefore, FIMfi will not count those cells when building the array. By this way, FIMfi costs less than does when counting the array. And it is obvious that the bigger the l is, the more counting time saved. We also use the memory management described in [10] to reduce the time consumed in allocating and deallocating space for FP-trees and MFI-trees.

Fig. 4. Pseudo-code of Algorithm FIMfi

3.5 FIMfi Based on section 3.1-3.4, here we show the pseudo-code of FIMfi in Figure 4. In each call procedure, each newly found MFI may be used in superset checking for ancestor nodes of the current node, so we use a parameter called M-trees to access MFI-tree of the ancestor nodes. And when the top call is over, all the MFIs to be mined are stored in the MFI-tree of root in the search space tree. From line (4) to line (6), FIMfi does superset frequency pruning for itemset con_tail(n’). When x is the end item of the header, there is no need to do the pruning,

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

357

for the pruning has already been done by the procedure calling current one in line (12) and/or line (22). Lines from (7) to (10) use the optimization array technique. The PEP technique is used in line (11) and line (19). The superset frequency pruning for itemsets fre_tail(n’) is done in lines from (12) to (18), when the condition at line (17) is true, all the children nodes of n’ are pruned and fre_tail(n’) need not to be inserted into n.MFI-tree any more. Line (20) uses our novel item ordering policy. Line (21) builds a new FP-tree: n’.FP-tree. Lines from (22) to (25) do another superset frequency pruning for fre_tail(n’) in the tree. The return statements in line (4), (6), (13), (17) and (24) mean that all the children nodes after n’ of n are pruned there. And the continue statements in line (14), (18) and (25) tell us that node n’ will be pruned, then we can go to consider the next child of n. After the constructing of n ’.FP-tree and n ’.MFI-tree and the updating of M-trees, FIMfi will be called recursively with the new node n’ and the new M-trees. Note that the algorithm doesn’t employ single path trimming used in and AFOPT. If, by constructing n’.FP-tree, we can find out that n’.FP-tree only has a single path, the superset checking at line (20) will return true, there will be a superset frequency pruning instead of a single path trimming.

4 Experimental Evaluations In the first Workshop on Frequent Itemset Mining Implementations (FIMI’03) [11], which took place at ICDM’03 (The Third IEEE International Conference on Data Mining), there are several recently presented algorithms that are good for mining MFI, such as AFOPT, Mafia and etc, we now present the performance comparisons of our FIMfi with them. All the experiments were conducted on 2.4 GHZ Pentium IV with 1024 MB of DDR memory running Microsoft Windows 2000 Professional. The codes of other four algorithms were downloaded from [12] and all codes of the five algorithms were complied using Microsoft Visual C++ 6.0. Duo to the lack of space, only the results for three real dense datasets and one real sparse dataset are shown here. The datasets we used are also selected from all the 11 real datasets of FIMI’03[12], they are BMS-WebView-2 (sparse), Connect, Mushroom and Pumsb_star, and their data characteristics can be found in [11].

4.1 Comparison of FP-Trees’ Number The item ordering policy and PEP technology are the main improvement of FIMfi. To test their performance in pruning, we build two sub algorithms: FIMfi-order and FIMfi-pep here. Comparing with FIMfi, FIMfi-order just doesn’t use PEP for pruning, and FIMfi-pep discards our novel item ordering policy along with the optimization array technique. We take as the benchmark algorithm, because it is also an MFI mining algorithm based on FP-tree and MFI-tree which does the MFI Mining best for almost all the datasets in FIMI’03 [11]. The numbers of FP-tree created by the four algorithms are shown in Figure 5. On the datasets Mushroom, Connect and Pumsb_star FIMfi-order and FIMfi-pep both

358

Yuejin Yan, Zhoujun Li, and Huowang Chen

generate less than half number of the FP-trees than that of The combination of the ordering policy and PEP into FIMfi creates the least number of FP-trees in the four algorithms. In fact, at the lowest support of Mushroom, creates more than 3 times number of FP-trees than FIMfi does. Note that in Figure 5, there is no result of BMS-WebView-2, it is because that all the four algorithms generate only one tree for BMS-WebView-2, then we omit it.

Fig. 5. Comparison of FP-trees’ Number

4.2 Performance Comparisons The performance comparisons of FIMfi, AFOPT and Mafia on sparse data BMS-WebView-2 are shown in Figure 6. FIMfi is faster than AFOPT at the higher supports that are no less than 50%, and is always defeated by AFOPT at not only lower but also higher supports. FIMfi outperforms about 20% to 40% at all supports and Mafia more than 20 times.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

359

Fig. 6. Performance on Sparse Datasets

Figure 7 gives the results of comparison the four algorithms on dense data. For all supports on dense datasets, FIMfi has the best performance. FIMfi runs around 40% %60 faster than on all of the dense datasets. AFOPT is the slowest algorithm on Mushroom and Pumsb_star and runs from 2 to 10 times worse than FIMfi on all of the datasets across all supports. Mafia is the slowest algorithm on Connect, it runs between 2 to 5 times slower than FIMfi on Mushroom and Connect across all supports. On Pumsb_star, Mafia is outperformed by FIMfi for all the supports though it outperforms at lower supports.

Fig. 7. Performance on Dense Datasets

360

Yuejin Yan, Zhoujun Li, and Huowang Chen

5 Conclusions Different from the traditional item ordering policy in which the items are sorted on the decreasing order of supports, this paper introduces a novel item ordering policy based on FP-tree and MFI-tree. The policy can guarantee maximal pruning of each node in the search space tree, and then greatly reduces the number of FP-trees created. The experimental comparison of FP-trees’ number reveals that FIMfi will generate less than half number of FP-trees than the traditional one does for dense datasets. We have found a simple method for fast superset checking. The method simplifies the superset checking to check only the equivalence of two integral, therefore makes the cost of superset checking less. Several old and new pruning techniques are integrated into FIMfi. Among the new ones, the superset frequency pruning based on FP-tree is first introduced and makes the cutting of search space more efficiently. The PEP technique used in FIMfi greatly reduces the number of FP-tree created comparing with by experimental results in section 4.1. In FIMfi we also present a new optimization in array technique and use the memory management to further reduce the run time. Our experimental results demonstrate that FIMfi is more optimized for mining MFI and outperforms by 40% averagely, and on dense data it outperforms AFOPT and Mafia more than 2 times to 20 times.

Acknowledgements We would like to thank Jianfei Zhu for providing the executable of FPMax and the code of before the download website being available. We also thank Guimei Liu for providing the code of AFOPT and Doug Burdick for providing the website of downloading the code of Mafia.

References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994. 2. J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’00), Dallas, TX, May 2000. 3. L. Rigoutsos and A. Floratos: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm.Bioinformatics 14,1 (1998), 55-67. 4. Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wei Wang and Xiangye Xiao. AFOPT: An Efficient Implementation of Pattern Growth Approach. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 5. Roberto Bayardo. Efficiently mining long patterns from databases. In ACM SIGMOD Conference, 1998.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

361

6. R. Agarwal, C. Aggarwal and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 2001. 7. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A Performance Study of Mining Maximal Frequent Itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 8. M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed association rule mining. TR 99-10, CS Dept., RPI, Oct. 1999. 9. K. Gouda and M. J. Zaki. Efficiently Mining Maximal Frequent Itemsets. Proc. of the IEEE Int. Conference on Data Mining, San Jose, 2001. 10. Gosta Grahne and Jianfei Zhu. Efficiently Using Prefix-trees in Mining Frequent Itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 11. Bart Goethals and M. J. Zaki. FIMI’03: Workshop on Frequent Itemset Mining Implementations. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 12. Codes and datasets available at http://fimi.cs.helsinki.fi/.

Multi-phase Process Mining: Building Instance Graphs B.F. van Dongen and W.M.P. van der Aalst Department of Technology Management, Eindhoven University of Technology P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands {b.f.v.dongen,w.m.p.v.d.aalst}@tm.tue.nl

Abstract. Deploying process-driven information systems is a time-consuming and error-prone task. Process mining attempts to improve this by automatically generating a process model from event-based data. Existing techniques try to generate a complete process model from the data acquired. However, unless this model is the ultimate goal of mining, such a model is not always required. Instead, a good visualization of each individual process instance can be enough. From these individual instances, an overall model can then be generated if required. In this paper, we present an approach which constructs an instance graph for each individual process instance, based on information in the entire data set. The results are represented in terms of Event-driven Process Chains (EPCs). This representation is used to connect our process mining to a widely used commercial tool for the visualization and analysis of instance EPCs. Keywords: Process mining, Event-driven process chains, Workflow management, Business Process Management.

1

Introduction

Increasingly, process-driven information systems are used to support operational business processes. Some of these information systems enforce a particular way of working. For example, Workflow Management Systems (WFMSs) can be used to force users to execute tasks in a predefined order. However, in many cases systems allow for more flexibility. For example transactional systems such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management) and SCM (Supply Chain Management) are known to allow the users to deviate from the process specified by the system, e.g., in the context of SAP R/3 the reference models, expressed in terms of Event-driven Process Chains (EPCs, cf. [13,14,19]), are only used to guide users rather than to enforce a particular way of working. Operational flexibility typically leads to difficulties with respect to performance measurements. The ability to do these measurements, however, is what made companies decide to use a transactional system in the first place. To be able to calculate basic performance characteristics, most systems have their own built-in module. For the calculation of basic characteristics such as the average flow time of a case, no model of the process is required. However, for more complicated characteristics, such as the average time it takes to transfer work P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 362–376, 2004. © Springer-Verlag Berlin Heidelberg 2004

Multi-phase Process Mining: Building Instance Graphs

363

from one person to the other, some notion of causality between tasks is required. This notion of causality is provided by the original model of the process, but deviations in execution can interfere with causalities specified there. Therefore, in this paper, we present a way of defining certain causal relations in a transactional system. We do so without using the process definition from the system, but only looking at a so called process log. Such a process log contains information about the processes as they actually take place in a transactional system. Most systems can provide this information in some form and the techniques used to infer relations between tasks in such a log is called process mining. The problem tackled in this paper has been inspired by the software package ARIS PPM (Process Performance Monitor) [12] developed by IDS Scheer. ARIS PPM allows for the visualization, aggregation, and analysis of process instances expressed in terms of instance EPCs (i-EPCs). An instance EPC describes the the control-flow of a case, i.e., a single process instance. Unlike a trace (i.e., a sequence of events) an instance EPC provides a graphical representation describing the causal relations. In case of parallelism, there may be different traces having the same instance EPC. Note that in the presence of parallelism, two subsequent events do not have to be causally related. ARIS PPM exploits the advantages of having instance EPCs rather than traces to provide additional management information, i.e., instances can be visualized and aggregated in various ways. In order to do this, IDS Scheer has developed a number of adapters, e.g., there is an adapter to extract instance EPCs from SAP R/3. Unfortunately, these adapters can only create instance EPCs if the actual process is known. For example, the workflow management system Staffware can be used to export Staffware audit trails to ARIS PPM (Staffware SPM, cf. [20]) by taking projections of the Staffware process model. As a result, it is very time consuming to build adapters. Moreover, the approaches used only work in environments where there are explicit process models available. In this paper, we do not focus on the visualization, aggregation, and analysis of process instances expressed in terms of instance EPC or some other notation capturing parallelism and causality. Instead we focus on the construction of instance graphs. An instance graph can be seen as an abstraction of the instance EPCs used by ARIS PPM. In fact, we will show a mapping of instance graphs onto instance EPCs. Instance graphs also correspond to a specific class of Petri nets known as marked graphs [17], T-systems [9] or partially ordered runs [8,10]. Tools like VIPTool allow for the construction of partially ordered runs given an ordinary Petri net and then use these instance graphs for analysis purposes. In our approach we do not construct instance graphs from a known Petri net but from an event log. This enhances the applicability of commercial tools such as ARIS PPM and the theoretical results presented in [8,10]. The mapping from instance graphs to these Petri nets is not given here. However, it will become clear that such a mapping is trivial. In the remainder of this paper, we will first describe a common format to store process logs in. Then, in Section 3 we will give an algorithm to infer causality at an instance level, i.e. a model is built for each individual case. In Section 4 we will provide a translation of these models to EPCs. Section 5 shows a concrete

364

B.F. van Dongen and W.M.P. van der Aalst

example and demonstrates the link to ARIS PPM. Section 6 discusses related work followed by some concluding remarks.

2

Preliminaries

This section contains most definitions used in the process of mining for instance graphs. The structure of this section is as follows. Subsection 2.1 defines a process log in a standard format. Subsection 2.2 defines the model for one instance.

2.1

Process Logs

Information systems typically log all kinds of events. Unfortunately, most systems use a specific format. Therefore, we propose an XML format for storing event logs. The basic assumption is that the log contains information about specific tasks executed for specific cases (i.e., process instances). Note that unlike ARIS PPM we do not assume any knowledge of the underlying process. Experience with several software products (e.g., Staffware, InConcert, MQSeries Workflow, FLOWer, etc.) and organization-specific systems (e.g., Rijkswaterstaat, CJIB, and several hospitals) show that these assumptions are justified. Figure 1 shows the schema definition of the XML format. This format is supported by our tools, and mappings from several commercial systems are available. The format allows for logging multiple processes in one XML file (cf. element “Process”). Within each process there may be multiple process instances (cf. element “ProcessInstance”). Each “ProcessInstance” element is composed of “AuditTrailEntry” elements. Instead of “AuditTrailEntry” we will also use the terms “log entry” or “event”. An “AuditTrailEntry” element corresponds to a single event and refers to a “WorkflowModelElement” and an “EventType”. A “WorkflowModelElement” may refer to a single task or a subprocess. The “EventType” is used to indicate the type of event. Typical events are: “schedule” (i.e., a task becomes enabled for a specific instance), “assign” (i.e., a task

Fig. 1. XML schema for process logs.

Multi-phase Process Mining: Building Instance Graphs

365

instance is assigned to a user), “start” (the beginning of a task instance), “ complete” (the completion of a task instance). In total, we identify 12 events. When building an adapter for a specific system, the system-specific events are mapped on these 12 generic events. As Figure 1 shows the “WorkflowModelElement” and “EventType” are mandatory for each “AuditTrailEntry”. There are three optional elements “Data”, “Timestamp”, and “Originator”. The “Data” element can be used to store data related to the event of the case (e.g., the amount of money involved in the transaction). The “Timestamp” element is important for calculating performance metrics like flow time, service times, service levels, utilization, etc. The “Originator” refers to the actor (i.e., user or organization) performing the event. The latter is useful for analyzing organizational and social aspects. Although each element is vital for the practical applicability of process mining, we focus on the “WorkflowModelElement” element. In other words, we abstract from the “EventType”, “Data”, “Timestamp”, and “Originator” elements. However, our approach can easily be extended to incorporate these aspects. In fact, our tools deal with these additional elements. However, for the sake of readability, in this paper events are identified by the task and case (i.e., process instance) involved.

Table 1 shows an example of a small log after abstracting from all elements except for the “WorkflowModelElement” element (i.e., task identifier). The log shows two cases. For each case three tasks are executed. Case 1 can be described by the sequence SAB and case 2 can be described by the sequence SBA. In the remainder we will describe process instances as sequences of tasks where each element in the sequence refers to a “WorkflowModelElement” element. A process log is represented as a bag (i.e., multiset) of process instances. Definition 2.1. (Process Instance, Process Log) Let T be a set of log entries, i.e., references to tasks. Let define the set of sequences of log entries with length at least 1. We call a process instance (i.e., case) and a process log. If

is a process instance of length then each element corresponds to “AuditTrailEntry” element in Figure 1. However, since we abstract from timestamps, event types, etc., one can think of as a reference to a task. denotes the length of the process instance and the element. We assume process instances to be of finite length. denotes a

366

B.F. van Dongen and W.M.P. van der Aalst

bag, i.e., a multiset of process instances. is the number of times a process instance of the form appears in the log. The total number of instances in a bag is finite. Since W is a bag, we use the normal set operators where convenient. For example, we use as a shorthand notation for

2.2

Instance Nets

After defining a process log, we now define an instance net. An instance net is a model of one instance. Since we are dealing with an instance that has been executed in the past, it makes sense to define an instance net in such a way that no choices have to be made. As a consequence of this, no loops will appear in an instance net. For readers familiar with Petri nets it is easy to see that instance nets correspond to “runs” (also referred to as occurrence nets) [8]. Since events that appear multiple times in a process instance have to be duplicated in an instance net, we define an instance domain. The instance domain will be used as a basis for generating instance nets. Definition 2.2. (Instance domain) Let i.e., We define

be a process instance such that as the domain of

Using the domain of an instance, we can link each log entry in the process instance to a specific task, i.e., can be used to represent the element in In an instance net, the instance is extended with some ordering relation to reflect some causal relation. Definition 2.3. (Instance net) Let instance. Let be the domain of and let is irreflexive, asymmetric and acyclic,

such that is a process be an ordering on such that:

where relation satisfying:

if and only if

is the smallest

or

We call N an instance net. The definition of an instance net given here is rather flexible, since it is defined only as a set of entries from the log and an ordering on that set. An important feature of this ordering is that if then there is no set such that Since the set of entries is given as a log, and an instance mapping can be inferred for each instance based on textual properties, we only need to define the ordering relation based on the given log. In Section 3.1 it is shown how this can be done. In Section 4 we show how to translate an instance net to a model in a particular language (i.e., instance EPCs).

3

Mining Instance Graphs

As seen in Definition 2.3, an instance net consists of two parts. First, it requires a sequence of events as they appear in a specific instance. Second, an ordering on the domain of is required. In this section, we will provide a method

Multi-phase Process Mining: Building Instance Graphs

367

that infers such an ordering relation on T using the whole log. Furthermore, we will present an algorithm to generate instance graphs from these instance nets.

3.1

Creating Instance Nets

Definition 3.1. (Causal ordering) Let W be a process log over a set of log entries T, i.e., Let and be two log entries. We define a causal ordering on W in the following way: if and only if there is an instance and such that and and if and only if there is an instance and such that and and and and not if and only if and or or or The basis of the causal ordering defined here, is that two tasks A and B have a causal relation if in some process instance, A is directly followed by B and B is never directly followed by A. However, this can lead to problems if the two tasks are in a loop of length two. Therefore, also holds if there is a process instance containing ABA or BAB and A nor B can directly succeed themselves. If A directly succeeds itself, then For the example log presented in Table 1, T = {S,A,B} and causal ordering inferred on T is composed of the following two elements and By defining the relation, we defined an ordering relation on T. This relation is not necessarily irreflexive, asymmetric, nor acyclic. This relation however can be used to induce an ordering on the domain of any instance that has these properties. This is done in two steps. First, an asymmetric order is defined on the domain of some Then, we prove that this relation is irreflexive and acyclic. Definition 3.2. (Instance ordering) Let W be a process log over T and let be a process instance. Furthermore, let be a causal ordering on T. We define an ordering on the domain of in the following way. For all such that we define if and only if and or The essence of the relation defined here is in the final part. For each entry within an instance, we find the closest causal predecessor and the closest causal successor. If there is no causal predecessor or successor then the entry is in parallel with all its predecessors or successors respectively. It is trivial to see that this can always be done for any process instance and with any causal relation. In the example log presented in Table 1 there are two process instances, case 1 and case 2. From here on, we will refer to case 1 as and to case 2 as We know that and that Using the causal relation the relation is inferred such that and For this also applies. It is easily seen that the ordering relation is indeed irreflexive and asymmetric, since it is only defined on and for which Therefore, it can easily be concluded that it is irreflexive and acyclic. Furthermore, the third property holds as well. Therefore we can now define an instance net as

368

3.2

B.F. van Dongen and W.M.P. van der Aalst

Creating Instance Graphs

In this section, we present an algorithm to generate an instance graph from an instance net. An instance graph is a graph where each node represents one log entry of a specific instance. These instance graphs can be used as a basis to generate models in a particular language. Definition 3.3. (Instance graph) Consider a set of nodes N and a set of edges We call an instance graph of an instance net if and only if the following conditions hold. is the set of nodes. 1. 2. The set of edges E is defined as

where

An instance graph as described in Definition 3.3 is a graph that typically describes an execution path of some process model. This property is what makes an instance graph a good description of an instance. It not only shows causal relations between tasks but also parallelism if parallel branches are taken by the instance. However, choices are not represented in an instance graph. The reason for that is obvious, since choices are made at the execution level and do not appear in an instance. With respect to these choices, we can also say that if the same choices are made at execution, the resulting instance graph is the same. Note, that the fact that the same choices are made does not imply that the process instance is the same. Tasks that can be done in parallel within one instance can appear in any order in an instance without changing the resulting instance graph. For case 1 of the example log of Table 1 the instance graph is drawn in Figure 2. Note that in this graph, the nodes 1,2 and 3 are actually in the domain of and therefore, they refer to entries in Table 1. It is easily seen that for case 2 this graph looks exactly the same, although the nodes refer to different entries. In order to make use of instance graphs, we will show that an instance graph indeed describes an instance such that an entry in the log can only appear if all predecessors of that entry in the graph have already appeared in the instance.

Fig. 2. Instance graph for

Multi-phase Process Mining: Building Instance Graphs

369

Definition 3.4. (Pre- and postset) Let be an instance graph and let We define to be the preset of such that We define to be the postset of such that Property 3.5. (Instance graphs describe an instance) Every instance graph of some process instance describes that instance in such a way that for all holds that for all implies that This ensures that every entry in process entry occurs only after all predecessors in the instance graph have occurred in Proof. To prove that this is indeed the case for instance graph we consider Definition 3.3 which implies that for “internal nodes” we know that if and only if Furthermore, from the definition of we know that implies that For the source and sink nodes, it is also easy to show that implies that because 0 is the smallest element of N while is the largest. Property

3.6. (Strongly connectedness) For every instance graph of some process instance holds that the short circuited graph is strongly connected.1

Proof. From Definition 3.3 we know that for all such that there does not exist a such that holds that Furthermore, we know that for all such that there does not exist a such that holds that Therefore, the graph is strongly connected if the edge is added to E. In the remainder of this paper, we will focus on an application of instance graphs. In Section 4 a translation from these instance graphs to a specific model are given.

4

Instance EPCs

In Section 3 instance graphs were introduced. In this section, we will present an algorithm to generate instance EPCs from these graphs. An instance EPC is a special case of an EPC (Event-driven Process Chain, [13]). For more information on EPCs we refer to [13,14,19]. These instance EPCs (or i-EPCs) can only contain AND-split and AND-join connectors, and therefore do not allow for loops to be present. These i-EPCs serve as a basis for the tool ARIS PPM (Process Performance Monitor) described in the introduction. In this section, we first provide a formal definition of an instance EPC. An instance EPC does not contain any connectors other than AND-split and ANDjoins connectors. Furthermore, there is exactly one initial event and one final event. Functions refer to the entries that appear in a process log, events however do not appear in the log. Therefore, we make the assumption here that each 1

A graph is strongly connected if there is a directed path from any node to any other node in the graph.

370

B.F. van Dongen and W.M.P. van der Aalst

event uniquely causes a function to happen and that functions result in one or more events. An exception to this assumption is made when there are multiple functions that are the start of the instance. These functions are all preceded by an AND-split connector. This connector is preceded by the initial event. Consequently, all other connectors are preceded by functions and succeeded by events. Definition 4.1. (Instance EPC) Consider a set of events E, a set of functions F, a set of connectors C and a set of arcs We call (E, F, C, A) an instance EPC if and only if the following conditions hold. 1. 2. Functions and events alternate in the presence of connectors:

where is acyclic. 3. The graph 4. There exists exactly one event such that there is no element such that We call the initial event. such that there is no element 5. There exists exactly one event such that We call the final event. 6. The graph is strongly connected. there are exactly two elements such 7. For each function that and Functions only have one input and one output. 8. For each event there are exactly two elements such that and Events only have one input and one output, except for the initial and the final event. For them the following holds. For there is exactly one element such that and for there is exactly one element such that

4.1

Generating Instance EPCs

Using the formal definition of an instance EPC from Definition 4.1, we introduce an algorithm that produces an instance EPC from an instance graph as defined in Definition 3.3. In the instance EPC generated it makes sense to label the functions according to the combination of the task name and event type as they appear in the log. The labels of the events however cannot be determined from the log. Therefore, we propose to label the events in the following way. The initial event will be labeled “initial”. The final event will be labeled “final”. All other events will be labeled in such a way that it is clear which function succeeds it. Connectors are labeled in such a way that it is clear whether it is a split or a join connector and to which function or event it connects with the input or output respectively. Definition 4.2. (Converting instance graphs to EPCs) Let W be a process log and let be an instance graph for some process instance

Multi-phase Process Mining: Building Instance Graphs

371

To create an instance EPC, we need to define the four sets E, F, C and A. The set of functions F is defined as In other words, for every entry in the process instance, a function is defined. The set of events E is defined as and In other words, for every function there is an event preceding it, unless it is a minimal element with respect to Furthermore, there is an initial event and a final event The set of connectors C is defined as where

Here, the connectors are constructed in such a way that connectors are always preceded by a function, except in case the process starts with parallel functions, since then the event is succeeded by a split connector. The set of arcs A is defined as where

It is easily seen that the instance EPC generated by Definition 4.2 is indeed an instance EPC, by verifying the result against Definition 4.1. In definitions 3.3 and 4.1 we have given an algorithm to generate an instance EPC for each instance graph. The result of this algorithm for both cases in the example of Table 1 can be found in Figure 3. In Section 5 we will show the practical use of this algorithm to ARIS PPM.

Fig. 3. Instance EPC for

372

5

B.F. van Dongen and W.M.P. van der Aalst

Example

In this section, we present an example illustrating the algorithms described in sections 3 and 4. We will start from a process log with some process instances. Then, we will run the algorithms to generate a set of instance EPCs that can be imported into ARIS PPM.

5.1

A Process Log

Consider a process log consisting of the following traces.

The process log in Table 2 shows the execution of tasks for a number of different instances of the same process. To save space, we abstracted from the original names of tasks and named each task with a single letter. The subscript refers to the position of that task in the process instance. Using this process log, we will first generate the causal relations from Definition 3.1. Note that casual relations are to be defined between tasks and not between log entries. Therefore, the subscripts are omitted here. This definition leads to the following set of causal relations: Using these relations, we generate instance graphs as described in Section 3 for each process instance. Then, these instance graphs are imported into ARIS PPM and a screenshot of this tool is presented (cf. Figure 5).

5.2

Instance Graphs

To illustrate the concept of instance graphs, we will present the instance graph for the first instance, “case 1”. In order to do this, we will follow Definition 3.2 to generate an instance ordering for that instance. Then, using these orderings, an instance graph is generated. Applying Definition 3.2 to case 1 in the log presented in Table 2 using the casual relations given in Section 5.1 gives the

Multi-phase Process Mining: Building Instance Graphs

373

following instance ordering: Using this instance ordering, an instance graph can be made as described in Definition 3.3. The resulting graph can be found in Figure 4. Note that the instance graphs of all other instances are isomorphic to this graph. Only, the numbers of the nodes change.

Fig. 4. Instance graph for case 1.

For each process instance, such an instance graph can be made. Using the algorithm presented in Section 4 each instance can than be converted into an instance EPC. These instance EPCs can be imported directly into ARIS PPM for further analysis. Here, we would like to point out again that our tools currently provide an implementation of the algorithms in this paper, such that the instance EPCs generated can be imported into ARIS PPM directly. A screenshot of this tool can be found in Figure 5 where “case 1” is shown as an instance EPC. Furthermore, inside the boxed area, the aggregation of some cases is shown. Note that this aggregation is only part of the functionality of ARIS PPM. Using graphical representations of instances, a large number of analysis techniques is available to the user. However, creating instances without knowing the original process model is an important first step.

6

Related Work

The idea of process mining is not new [1, 3, 5–7, 11, 12, 15, 16, 18, 21] and most techniques aim at the control-flow perspective. For example, the allows for the construction of a Petri net from an event log [1,5]. However, process mining is not limited to the control-flow perspective. For example, in [2] we use process mining techniques to construct a social network. For more information on process mining we refer to a special issue of Computers in Industry on process mining [4] and a survey paper [3]. In this paper, unfortunately, it is impossible to do justice to the work done in this area. To support our mining efforts we have developed a set of tools including EMiT [1], Thumb [21], and MinSoN [2]. These tools share the XML format discussed in this paper. For more details we refer to www.processmining.org. The focus of this paper is on the mining of the control-flow perspective. However, instead of constructing a process model, we mine for instance graphs.

374

B.F. van Dongen and W.M.P. van der Aalst

Fig. 5. ARIS PPM screenshot.

The result can be represented in terms of a Petri net or an (instance) EPC. Therefore, our work is related to tools like ARIS PPM [12], Staffware SPM [20], and VIPTool [10]. Moreover, the mining result can be used as a basis for applying the theoretical results regarding partially ordered runs [8].

7

Conclusion

The focus of this paper has been on mining for instance graphs. Algorithms are presented to describe each process instance in a particular modelling language. From the instance graphs described in Section 3, other models can be created as well. The main advantage of looking at instances in isolation is twofold. First, it can provide a good starting point for all kinds of analysis such as the ones implemented in ARIS PPM. Second, it does not require any notion of completeness of a process log to work. As long as a causal relation is provided between log entries, instance graphs can be made. Existing methods such as the [1,3,5] usually require some notion of completeness in order to rediscover the entire process model. The downside thereof is that it is often hard to deal with noisy process logs. In our approach noise can be filtered out before implying the causal dependencies between log entries, without negative implications on the result of the mining process. ARIS PPM allows for the aggregation of instance EPCs into an aggregated EPC. This approach illustrates the wide applicability of instance graphs. However, the aggregation is based on simple heuristics that fail in the presence of

Multi-phase Process Mining: Building Instance Graphs

375

complex routing structures. Therefore, we are developing algorithms for the integration of multiple instance graphs into one EPC or Petri net. Early experiments suggest that such a two-step approach alleviate some of the problems existing process mining algorithms are facing [3,4].

References 1. W.M.P. van der Aalst and B.F. van Dongen. Discovering Workflow Performance Models from Timed Logs. In Y. Han, S. Tai, and D. Wikarski, editors, International Conference on Engineering and Deployment of Cooperative Information Systems (EDCIS 2002), volume 2480 of Lecture Notes in Computer Science, pages 45–63. Springer-Verlag, Berlin, 2002. 2. W.M.P. van der Aalst and M. Song. Mining Social Networks: Uncovering interaction patterns in business processes. In M. Weske, B. Pernici, and J. Desel, editors, International Conference on Business Process Management, volume 3080 of Lecture Notes in Computer Science, pages 244–260. Springer-Verlag, Berlin, 2004. 3. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A.J.M.M. Weijters. Workflow Mining: A Survey of Issues and Approaches. Data and Knowledge Engineering, 47(2):237–267, 2003. 4. W.M.P. van der Aalst and A.J.M.M. Weijters, editors. Process Mining, Special Issue of Computers in Industry, Volume 53, Number 3. Elsevier Science Publishers, Amsterdam, 2004. 5. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: Discovering Process Models from Event Logs. QUT Technical report, FIT-TR-2003-03, Queensland University of Technology, Brisbane, 2003. (Accepted for publication in IEEE Transactions on Knowledge and Data Engineering.). 6. R. Agrawal, D. Gunopulos, and F. Leymann. Mining Process Models from Workflow Logs. In Sixth International Conference on Extending Database Technology, pages 469–483, 1998. 7. J.E. Cook and A.L. Wolf. Discovering Models of Software Processes from EventBased Data. ACM Transactions on Software Engineering and Methodology, 7(3):215–249, 1998. 8. J. Desel. Validation of Process Models by Construction of Process Nets. In W.M.P. van der Aalst, J. Desel, and A. Oberweis, editors, Business Process Management: Models, Techniques, and Empirical Studies, volume 1806 of Lecture Notes in Computer Science, pages 110–128. Springer-Verlag, Berlin, 2000. 9. J. Desel and J. Esparza. Free Choice Petri Nets, volume 40 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1995. 10. J. Desel, G. Juhas, R. Lorenz, and C. Neumair. Modelling and Validation with VipTool. In W.M.P. van der Aalst, A.H.M. ter Hofstede, and M. Weske, editors, International Conference on Business Process Management (BPM 2003), volume 2678 of Lecture Notes in Computer Science, pages 380–389. Springer-Verlag, 2003. 11. J. Herbst. A Machine Learning Approach to Workflow Management. In Proceedings 11th European Conference on Machine Learning, volume 1810 of Lecture Notes in Computer Science, pages 183–194. Springer-Verlag, Berlin, 2000. 12. IDS Scheer. ARIS Process Performance Manager (ARIS PPM), http://www.idsscheer.com, 2002.

376

B.F. van Dongen and W.M.P. van der Aalst

13. G. Keller, M. Nüttgens, and A.W. Scheer. Semantische Processmodellierung auf der Grundlage Ereignisgesteuerter Processketten (EPK). Veröffentlichungen des Instituts für Wirtschaftsinformatik, Heft 89 (in German), University of Saarland, Saarbrücken, 1992. 14. G. Keller and T. Teufel. SAP R/3 Process Oriented Implementation. AddisonWesley, Reading MA, 1998. 15. A.K.A. de Medeiros, W.M.P. van der Aalst, and A.J.M.M. Weijters. Workflow Mining: Current Status and Future Directions. In R. Meersman, Z. Tari, and B.C. Schmidt, editors, On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, volume 2888 of Lecture Notes in Computer Science, pages 389–406. Springer-Verlag, Berlin, 2003. 16. M. zur Mühlen and M. Rosemann. Workflow-based Process Monitoring and Controlling - Technical and Organizational Issues. In R. Sprague, editor, Proceedings of the 33rd Hawaii International Conference on System Science (HICSS-33), pages 1–10. IEEE Computer Society Press, Los Alamitos, California, 2000. 17. T. Murata. Petri Nets: Properties, Analysis and Applications. Proceedings of the IEEE, 77(4):541–580, April 1989. 18. M. Sayal, F. Casati, and M.C. Shan U. Dayal. Business Process Cockpit. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB’02), pages 880–883. Morgan Kaufmann, 2002. 19. A.W. Scheer. Business Process Engineering, Reference Models for Industrial Enterprises. Springer-Verlag, Berlin, 1994. 20. Staffware. Staffware Process Monitor (SPM). http://www.staffware.com, 2002. 21. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2): 151–162, 2003.

A New XML Clustering for Structural Retrieval* Jeong Hee Hwang and Keun Ho Ryu Database Laboratory, Chungbuk National University, Korea {jhhwang,khryu)@dblab.chungbuk.ac.kr

Abstract. XML becomes increasingly important in data exchange and information management. Starting point for retrieving the information and integrating the documents efficiently is clustering the documents that have similar structure. Thus, in this paper, we propose a new XML document clustering method based on similar structure. Our approach first extracts the representative structures of XML documents by sequential pattern mining. And then we cluster XML documents of similar structure using the clustering algorithm for transactional data, assuming that an XML document as a transaction and the frequent structure of documents as the items of the transaction. We also apply our technique to XML retrieval. Our experiments show the efficiency and good performance of the proposed clustering method. Keywords: Document Clustering, XML Document, Sequential Pattern, Structural Similarity, Structural Retrieval

1 Introduction XML(eXtensible Markup Language) is a standard for data representation and exchange on the Web, and we will find large XML document collection on the Web in the near future. Therefore, it has become crucial to address the question of how we can efficiently query and search XML documents. Meanwhile, the hierarchical structure of XML has a great influence on the information retrieval, the document management system, and data mining[l,2,3,4]. Since an XML document is represented as a tree structure, one can explore the relationship among XMLs using various tree matching algorithms[5,6]. A closely related problem is to find trees in a database that “match” a given pattern or query tree[7]. This type of retrieval often exploits various filters that eliminate unqualified data trees from consideration at an early stage of retrieval. The filters accelerate the retrieval process. Another approach to facilitating a search is to cluster XMLs into appropriate categories. We propose a new XML clustering technique based on similar structure in this paper. We first extract the representative structures of frequent patterns including hierarchical structure information from XML documents by the sequential pattern mining method[8]. And then we perform the document clustering by considering both the CLOPE algorithm[9] and large items [10], assuming that an XML document as a transaction and the extracted frequent structures from documents as the items of the transaction. We also apply our method to structural retrieval of XML documents in order to verify the efficiency of proposed technique. * This work was supported by University IT Research Center Project and ETRI in Korea.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 377–387, 2004. © Springer-Verlag Berlin Heidelberg 2004

378

Jeong Hee Hwang and Keun Ho Ryu

The remaining of the paper is organized as follows. Section 2 reviews the previous researches related to the structure of XML documents. Section 3 describes the method extracting the representative structures of XML documents. In section 4, we define our clustering criterion using large items, and we describe about updating the cluster, and section 5 explains how to apply our clustering method to XML retrieval. Section 6 shows the experiment results of clustering algorithm and the result of XML retrieval, section 7 concludes the paper.

2 Related Works Recently, as XML documents with various structures are increasing, it is needed to study the method that classifies the similar structure documents and retrieves the documents [3,4]. [11] considered XML as a tree and analyzed the similarity among the documents by taking account of semantics. [12] referred the necessity to manage the increasing XML documents and proposed the clustering method about element tags and the text of XML documents using k-means algorithm. In [3,4,13], they say that there are two kinds of structure mining technique for extracting the XML document structure; intra-structured mining for one document and inter-structured mining for various documents. But the concrete algorithm is not described. [14] proposed the clustering method about the DTD based on the similarity of elements as the way to find out the mediate DTD to integrate DTDs. But it can just be applied to the DTDs with the same application domain. [15] concentrated on finding out the common structure of the tree, but not cosidering the document clustering. [16] grouped trees about the same pairs of labels occurring frequently, and then finds a subset of the frequent trees. But the multi relational tree structure can’t be detected, because it is based on the label pairs. [17] proposed the method for clustering the XML documents using the bit map indexing, but it requires too much space for a large amount of documents. In this paper, we use the CLOPE algorithm[9] adding the notion of large items for document clustering. The CLOPE algorithm uses only the rate of the common items, not considering individual items in a cluster. Therefore, it can have some problems that the similarity between clusters may be higher, and it mayn’t control the number of clusters. In order to address this problem, we add the notion of large items about a cluster to CLOPE algorithm.

3 Extracting the Representative Structure of XML Documents XML document has sequential and hierarchical structure of elements. Therefore, the orders of the elements and the elements themselves have the feature that can distinguish the XML documents [11,13]. Thus, we use the sequential pattern mining that considers both the frequency and the order of elements.

3.1 Element Path Sequences We first extract representative structures of each document based on the path from the root to the element about elements having content value. Figure 1 is an example XML document to show how to find out the representative structures from the documents.

A New XML Clustering for Structural Retrieval

Fig. 1. An XML document

379

Fig. 2. Element mapping table

We rename each element with alphabet to easily distinguish elements using the element mapping table, as shown in Figure 2. Based on the renamed element by Figure 2, the element paths having contents value is represented as Figure 3, in which element paths are regarded as the sequences and each element contained in sequence is considered to be the items. And then we find out the frequent sequence structures that satisfy the given minimum support by the sequential pattern mining algorithm.

3.2 The Sequential Pattern Algorithm to Extract The Frequent Structure To extract the frequent structures, we use the PrefixSpan Fig. 3. Element path sealgorithm[8] about Figure 3. To do this, we define the quences frequent structure minimum support as follows. Definition 1 (Frequent Structure Minimum Support). Frequent structure minimum support is the least frequency that satisfies the rate of the frequent structure among the whole paths in a document, and the path sequences that satisfy this condition are the frequent structures. The formula of this is as follows. FFMS = frequent structure rate the number of path of the whole documents If the frequent structure rate is 0.2, FFMS of sequence set of Figure 3 is 2 And the element frequency of the length-1 satisfying the FFMS is a: 6, b: 6, c2: 4. Starting from this length-1 sequential pattern, we extract the frequent pattern structures using the projected DB(refer to [8] for the detail algorithm). According to this method, the maximal frequent structure in Figure 3 is , and this path is occurred at the rate of about 66%(4/6) to the whole document. We also include the structures of length over the regular rate to the maximal frequent structure (e.g. the most frequent length 5 80% = the frequent structure length 4) to the representative structures as the input data for clustering. The reason is that it can avoid frequent structures missing, in case there are various subjects in a document.

380

Jeong Hee Hwang and Keun Ho Ryu

4 Large Item Based Document Cluster The frequent structures of each XML document are basic data for clustering. We assume the XML documents as a transaction, the frequent structures extracted from each document as the items of the transaction, and then we perform the document clustering using the notion of large items.

4.1 A New Clustering Criterion The item set included all the transaction is defined as cluster set as and transaction set that represents the document as As a criterion to allocate a transaction to the appropriate cluster, we define the cluster allocation gain. Definition 2 (Cluster Allocation Gain). The cluster allocation gain is the sum of the ratio of the total occurrences to the individual items in every cluster. The following equation expresses this.

where G is the occurrence rate(H) to individual item(W) in a cluster, H = T (the total occurrence of the individual items) / W (the number of the individual items), Gain is a criterion function for cluster allocation of the transaction, and the higher the rate of the common items, the more the cluster allocation gain. Therefore we allocate a transaction to the cluster to be the largest Gain. However if we use only the rate of the common items, not considering the individual items like CLOPE, it causes some problems as follows. Example 1. Assume that transaction t4 = {f, c} is to be inserted, under the condition of the cluster C1 = {a:3, b:3, c:1}, C2 = {d:3, e:1, c:3} including three transactions respectively. If t4 is allocated to C1 or C2, then Gain is

Other

while, if t4 is allocated to a new cluster, then Gain is Thus, t4 is allocated a new cluster by Definition 2. As you see in this example, we can get the considerably higher allocation grain about a new cluster, because Gain about a new cluster equals Due to this, it causes the production of many clusters over the regular size, so that it may reduce cluster cohesion. In order to address this problem, we define the large items and the cluster participation as follows. Definition 3 (Large Items). Item support of cluster Ci is defined as the number of the transactions including item (j and its closing tag is placed after the tag . Based on the IR annotations the intrasite search engine can improve the quality and accuracy of query results. Only pages which have IRDisplayContentType equal to “content” are indexed. Pages defined as “entry” represent those pages which are entry points to the Web site. Notice that the distinction between content and entry pages is necessary because user queries can also be classified in two categories [22,29]: (1) bookmark queries, which refer to locating an entry page for a specific site portion. For example, searching for the entry page of the economy section of a newspaper Web site. (2) content queries, which are the most common sort, denote user queries that result in single content pages. For example, finding a page that describes how the stock market operates.

412

Keyla Ahnizeret et al.

Pages with IRDisplayContentType “irrelevant” are automatically discarded. Annotating irrelevant pages is important as it makes the system to index pages with relevant content only, making the resulting search engine more efficient and accurate. Each piece of information has a level of significance defined by the value given to the attribute IRSignificance. This feature specify the importance of a particular piece of information with respect to the page where it is placed. This means that the same piece of information might have a different level of significance when placed in another page. In the indexing process the system stores each piece of information, their location and their IRSignificance value. In addition, the number of occurrences of each term present in the piece of information is also stored. This information is used by our information retrieval model to compute the ranking of documents for each user query submitted to the intrasite search engine. The information retrieval model adopted here is an extension of the well know Vector Space Model [27]. This model is based on identifying the importance how related is each term (word) to each document (page) which should be expressed as a function The queries in this document are modelled in the same way, and the function is used to represent each element modelled as a vector in a space determined by the set of all distinct terms. The ranking in this model is computed by the score function for each document in the collection and a given query as in the equation below.

which is the cosine between vectors and and expresses how similar is document to the query The documents which have similarity higher than zero are presented to the users in a descending order. The function in Equation 1 gives a measurement of how related term and document are. This value is usually computed as where is the inverse document frequency and measures the importance of term for the whole set of documents, while expresses the importance of term for document The idf value is usually computed as

where #docs is the number of documents (pages) in the collection and is the number of documents where term occurs. The tf value can be computed in several ways. However, it is always a function of the term frequency in the document. Common formulae directly compute number of occurrences of in [16]. We here propose the use of information provided from the Web site modelling to define the function tf based not only on the term frequency, but also in the IRSignificance described during the Web site

Information Retrieval Aware Web Site Modelling and Generation

modelling. Given a Web page we define

composed of

413

different pieces of information

where gives the number of occurrences of term in the piece of information and assigns values 0,1,2 or 3 corresponding to irrelevant, low, medium or high, respectively, for piece of information derived from the Web site modelling. By using this equation, the system assigns to each piece of information a precise importance value, allowing ranking the pages according to the terms used in a query that match the most significant pieces of information.

4.1

Generating a Web Site and Its Associated Intrasite Search Engine A high-level specification of an application must be provided as the starting point of the development process. Existing conceptual models can be used for this task as discussed earlier. Since issues related to mapping a data model constructs to our intermediate representation language are not in the scope of this paper, we assume that this task has been already carried out. Details of mapping procedures from an ER schema to our intermediate representation can be found in [4]. Once an intermediate representation of the Web site application is provided, the next step is the generation of the pages and the search engine. The steps to perform a complete generation involve: 1. Instantiation of pieces of information, what usually involves access to databases. 2. Creation of pages. For each display unit a number of corresponding pages are created depending on the number of instances of its pieces of information. 3. Instantiation of links. 4. Translation of all intermediate representation to a target language, such as HTML. 5. Application of visualization styles to all pieces of information and pages, based on style and page templates definitions. 6. Creation of additional style sheets, as CSS specifications. 7. Creation of the intrasite search engine.

Visualization is described by individual pieces of information styles, page styles (stylesheet) and page templates. A suitable interface should be offered to the designer in order to input all necessary. Currently, a standard CSS stylesheet is automatically generated including definitions provided by the designer. The reason to make use of stylesheets is to keep the representation for our visualization styles simple. Without a stylesheet, all visual details would have to be included as arguments to the mapping procedure which translates a visualization style to HTML.

414

Keyla Ahnizeret et al.

As for the creation of the intrasite search engine, its code is automatically incorporated as part of the resulting Web site. Furthermore its index is generated along with the Web site pages, according to the IRDisplayContentType and IRSignificance specifications.

5

Experiments

In this section we present experiments to evaluate the impact of our new integrated strategy for designing Web sites and intrasite search engines. For these experiments we have constructed two intrasite search systems for the Brazilian Web portal “ultimosegundo”, indexing 12,903 news Web pages. The first system was constructed without considering information provided by the application data model and it has been implemented using the traditional vector space model [27]. The second system was constructed using our IR-aware data model described in Section 4. To construct the second system we first modelled the Web site using our IR-aware methodology, generating a new version where the IRDisplayContentType of each page and the IRSignificance of the semantic pieces of information that compose each page are available. Figure 6 illustrates a small portion of the intermediate representation of the Web site modelled using our modelling language. The structure and content of this new site is equal to the original version, preserving all pages and keep them with the same content. The first side effect of our methodology is that only pages with useful content are indexed. In the example only pages derived from the display unit NewsPage

Fig. 6. Example of a Partial Intermediate Representation of a Web Site

Information Retrieval Aware Web Site Modelling and Generation

415

are indexed. Furthermore, pieces of information that do not represent useful information are also excluded from the search system. For instance, each news Web page in the site have links to related news (OtherNews), these links are considered as non-relevant pieces of information because they are included in the page as a navigation facility, not as content. As a result, the final index size was only 43% of the index file created to index the original site, which means our intrasite search version uses less storage space and is faster when processing user queries. The experiments evaluating the quality of results were performed using a set of 50 queries extracted from a log of queries on new Web sites. The queries were randomly selected from the log having an average length of 1.5 terms, as the majority of queries are composed of one or two terms. In order to evaluate the results, we have used a precision recall curve, which is the most applied method for evaluating information retrieval systems [1]. The precision at any point of this curve computed using the set of relevant answers for each query (N) and the set of answers given by each system this query (R). The formulae for computing precision and recall are described in Equations 4. For further details about precision recall curve the interested reader is referred to [1, 32].

To obtain the precision recall curve we need to use human judgment for determining the set of relevant answers for each query evaluated. This set was determined here using the pooling method used for the Web-based collections of TREC [19]. This method consists of retrieving a fix number of top answers from each of the system options evaluated and then make a pool of answers which is used for determining the set of relevant documents. Each answer in the pool is analyzed by humans and is classified as relevant or non relevant for the given user query. After analyzing the answers in the pool, we use the relevant answers identified by humans as the set N in the Equations 4. For each of the 50 queries of our experiments, we composed a query pool formed by the top 50 documents generated by each of the 2 intra site search systems evaluated. The query pools contained an average of 62.2 pages (some queries had less than 50 documents in the answer). All documents in each query pool were submitted to a manual evaluation. The average number of relevant pages per query pool is 28.5. Figure 7 shows the precision recall curve obtained in our experiment for both systems. Our Modelling-aware intrasite search is labelled in the figure as “Modelling-aware”, while the original vector space model is labelled as “Conventional” . This Figure shows that the quality of the ranking results of our system was superior in all points of recall. The precision at the first points in the curve was roughly 96% in our system, against 86.5% which means an improvement of almost 11% in the precision. For higher levels of recall the difference becomes ever higher, being roughly 20% at 50% of recall and 50% at 100%. This last result indicates that our system found in average 50% more relevant documents in this experiment. The average precision for the 11 points were 56% for the

416

Keyla Ahnizeret et al.

Fig. 7. Comparison of average precision versus recall curves obtained when processing the 50 queries using the original vector space model and the IR-aware model

conventional system and 84% for the Modelling-aware system, which represents an improvement of 48%. Another important data about the experiment is that our system has returned on the average only 209.8 documents per query (from these, we selected 50 for evaluating) while the original system has returned 957.66 results on the average. This difference is again due to the elimination of non relevant information from the index. To give an example, the original system gave almost all pages as a result for the query “september 11th”, while our system gives less than 300 documents. This difference happened because almost all pages in the site had a footnote text linking a special section about this topic in the site.

6

Conclusions

We have presented a new modelling technique for Web site design that transfers information about the model to the Web pages generated. We also presented a new intrasite search model that uses this information to improve the quality of results presented to users and to reduce the size of the indexes generated for processing queries. In our experiments we have presented one particular example of application of our method that illustrates its viability and effectiveness. The gains obtained in precision and storage space reduction may vary for different Web sites. However this example has shown a good indication that our method can be effectively deployed to solve the problem of intrasite and Intranet search. For the site modelled we had an improvement of 48% in the average precision and at the same time a reduction in the index size, occupying only 43% of the space used by the traditional implementation. That means our method produces faster and more precise intrasite search systems. As future work we are planning to study the application of our method to other Web sites in order to evaluate in more detail the gains obtained and to refine our approach. We are also studying strategies for automatically compute

Information Retrieval Aware Web Site Modelling and Generation

417

the IRSiginificance of pieces of information and for automatically determining the weights of each piece of information for each display unit. These automatic methods will allow the use of our approach for non-modelled Web sites which may be used for extending the benefits of our method to global search engines. The paradigm described here opens new possibilities for designing better intrasite search systems. Another future research direction is defining new modelling characteristics that can be useful for intrasite search systems. For instance, we are interested in finding ways for determining the semantic relations between Web pages during the modelling phase and use this information to cluster these pages in a search system. The idea is to use the cluster properties for improving the knowledge about the semantic meaning of each Web page in the site.

Acknowledgements This paper is the result of research work done in the context of the SiteFix and Gerindo projects sponsored by the Brazilian Research Council - CNPq, grants no.55.2197/02-5 and 55.2087/05-5. The work was also supported by a R&D grant by Philips MDS-Manaus. The second author was sponsored by The Amazonas State Research Foundation - FAPEAM. The fourth author is sponsored by the Brazilian Research Council - CNPq, grant no.300220/2002-2.

References 1. BAEZA-YATES, R., AND RIBEIRO-NETO, B. Modern Information Retrieval, 1st ed. Addison-Wesley-Longman, 1999. 2. BRIN, S., AND PAGE, L. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (Brisbane, Australia, April 1998), pp. 107–117. 3. CAVALCANTI, J., AND ROBERTSON, D. Synthesis of Web Sites from High Level Descriptions. In Web Engineering: Managing Diversity and Complexity in Web Application Development, S. M. . Y. Deshpande, Ed., vol. 2016 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, Germany, 2001, pp. 190–203. 4. CAVALCANTI, J., AND ROBERTSON, D. Web Site Synthesis based on Computational Logic. Knowledge and Information Systems Journal (KAIS) 5, 3 (Sept. 2003), 263– 287. 5. CAVALCANTI, J., AND VASCONCELOS, W. A Logic-Based Approach for Automatic Synthesis and Maintenance of Web Sites. In Proceedings of the 14th International Conference on Software Engineering and Knowledge Engineering - SEKE’02 (Ischia, Italy, July 2002), pp. 619–626. 6. CERI, S., FRATERNALI, P., AND BONGIO, A. Web Modeling Language (WebML): a Modeling Language for Designing Web Sites. In Proceedings of the WWW9 conference (Amsterdam, the Netherlands, May 2000), pp. 137–157. 7. CHEN, M., HEARST, M., HONG, J., AND LIN, J. Cha-cha: a system for organizing intranet search results. In Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems (Boulder,USA, October 1999). 8. CHEN, P. The entity-relationship model: toward a unified view of data. ACM Transactions on Database Systems 1, 1 (1976).

418

Keyla Ahnizeret et al.

9. CRASWELL, N., HAWKING, D., AND ROBERTSON, S. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, USA, September 2001), pp. 250–257. 10. EIRON, N., AND MCCURLEY, K. S. Analysis of anchor text for web search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, Canada, July 2003), pp. 459– 460. 11. FERNÁNDEZ, M., FLORESCU, D., KANG, J., LEVY, A., AND SUCIU, D. Catching the Boat with Strudel: Experience with a A Web-site Management System. SIGMOD Record 27, 2 (June 1998), 414–425. 12. FLORESCU, D., LEVY, A., AND MENDELZON, A. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27, 3 (Sept. 1998), 59–74. 13. GARZOTTO, G., PAOLINI, P., AND SCHWABE, D. HDM - A Model-Based Approach to Hypertext Application Design. TOIS 11, 1 (1993), 1–26. 14. GEVREY, J., AND RÜGER, S. M. Link-based approaches for text retrieval. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 279–285. 15. GÓMEZ, J., CACHERO, C., AND PASTOR, O. Conceptual Modeling of DeviceIndependent Web Applications. IEEE Multimedia 8, 2 (Apr. 2001), 26–39. 16. G.SALTON, AND BUCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management 5, 24 (1988), 513–523. 17. HAGEN, P., MANNING, H., AND PAUL, Y. Must Search Stink ? The Forrester Report, June 2000. 18. HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. B. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Maryland, USA, November 1998), pp. 91–104. 19. HAWKING, D., CRASWELL, N., THISTLEWAITE, P. B., AND HARMAN, D. Results and challenges in web search evaluation. Computer Networks 31, 11–16 (May 1999), 1321–1330. Also in Proceedings of the 8th International World Wide Web Conference. 20. HAWKING, D., VOORHEES, E., BAILEY, P., AND CRASWELL, N. Overview of trec-8 web track. In Proc. of TREC-8 (Gaithersburg MD, November 1999), pp. 131–150. 21. JIN, Y., DECKER, S., AND WIEDERHOLD, G. OntoWebber: Model-driven ontologybased Web site management. In Proceedings of the first international semantic Web working symposium (SWWS’01) (Stanford, CA, USA, July 2001). 22. KANG, I.-H., AND KIM, G. Query type classification in web document retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, July 2003), pp. 64–71. 23. KANUNGO, T., AND ZIEN, J. Y. Integrating link structure and content information for ranking web documents. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 237–239. 24. KLEINBERG, J. M. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, USA, January 1998), pp. 668–677. 25. MAEDCHE, A., STAAB, S., STOJANOVIC, N., STUDER, R., AND SURE, Y. SEAL - A framework for developing semantic Web portals. In Proceedings of the 18th British national conference on databases (BNCOD 2001) (Oxford, England, UK, July 2001).

Information Retrieval Aware Web Site Modelling and Generation

419

26. MECCA, G., ATZENI, P., MASCI, A., MERIALDO, P., AND SINDONI, G. The ARANEUS Web-Base Management System. SIGMOD Record (ACM Special Interest Group on Management of Data) 27, 2 (1998), 544. 27. SALTON, G., AND MCGILL, M. J. Introduction to Modern Information Retrieval, 1st ed. McGraw-Hill, 1983. 28. SCHWABE, D., AND ROSSI, G. The Object-oriented Hypermedia Design Model. Communications of the ACM 38, 8 (Aug. 1995), 45–46. 29. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Query-independent evidence in home page finding. ACM Transactions on Information Systems - ACM TOIS 21, 3 (2003), 286–313. 30. VASCONCELOS, W., AND CAVALCANTI, J. An Agent-Based Approach to Web Site Maintenance. In Proceedings of the 4th International Conference on Web Engineering and Knowledge Engineering - ICWE 2004- To Appear. (Munich, Germany, July 2004). 31. WESTERVELD, T., KRAAIJ, W., AND HIEMSTRA, D. Retrieving Web pages using content, links, URLs and anchors. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 663–672. 32. WITTEN, I., MOFFAT, A., AND BELL, T. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York, 1999. 33. XUE, G.-R., ZENG, H.-J., CHEN, Z., MA, W.-Y., ZHANG, H.-J., AND LU, C.-J. Implicit link analysis for small web search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto - Canada, July 2003), pp. 56–63.

Expressive Profile Specification and Its Semantics for a Web Monitoring System* Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering, UT Arlington, Texas, USA {eppili,jacob,sachde,sharma}@cse.uta.edu

Abstract. World wide web has gained a lot of prominence with respect to information retrieval and data delivery. With such a prolific growth, a user interested in a specific change has to continuously retrieve/pull information from the web and analyze it. This results in wastage of resources and more importantly the burden is on the user. Pull-based retrieval needs to be replaced with a pushbased paradigm for efficiency and notification of relevant information in a timely manner. WebVigiL is an efficient profile-based system to monitor, retrieve, detect and notify specific changes to HTML and XML pages on the web. In this paper, we describe the expressive profile specification language along with its semantics. We also present an efficient implementation of these profiles. Finally, we present the overall architecture of the WebVigiL system and its implementation status.

1 Introduction Information on the Internet, growing at a rapid rate, is spread over multiple repositories. This has greatly affected the way information is accessed, delivered and disseminated. Users, at present, are not only interested in the new information available on web pages but also in retrieving changes of interest in a timely manner. More specifically, users may only be interested in particular changes (such as keywords, phrases, links etc). Push and Pull paradigms [1] are traditionally used for monitoring the pages of interest. Pull Paradigm is an approach where the user performs an explicit action in the form of a query, transaction execution on a periodic basis on the pages of interest. Here, the burden of retrieving the required information is on the user and may result in changes being missed when a large number of web sites need to be monitored. In the push paradigm, the system is responsible for accepting user needs and informs the user (or a set of users) when something of interest happens. Although this approach reduces the burden on the user, naive use of a push paradigm results in informing users about the changes to web pages irrespective of the user’s interest. At present most of the systems use a mailing list to send the same compiled changes to all its subscribers. * This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center–San Diego & by the Rome Laboratory (grant F30602-01 -2-05430), and by NSF (grant IIS-0123730). P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 420–433, 2004. © Springer-Verlag Berlin Heidelberg 2004

Expressive Profile Specification and Its Semantics for a Web Monitoring System

421

Hence, an approach is needed which replaces periodic polling and notifies the user of the relevant changes in a timely manner. The emphasis in WebVigiL is on selective change notification. This entails notifying the user about the changes to the web pages based on user specified interest/policy. WebVigiL is a web monitoring system, which uses an appropriate combination of push and intelligent pull paradigm with the help of active capability to monitor customized changes to HTML and XML pages. WebVigiL intelligently pulls the information using a learning-based algorithm [2] from the web server based on user profile and propagates/pushes only the relevant information to the end user. In addition, WebVigiL is a scalable system, designed to detect even composite changes for a large number of users. An overview of the paradigm used and the basic approach taken for effective monitoring is discussed in [3]. This paper concentrates on the expressiveness of change specification, its semantics, and its implementation. In order for the user to specify notification and monitoring requirements, an expressive change specification language is needed. The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the syntax and semantics of the change specification language which captures the monitoring requirements of the user and in addition supports inheritance, event-based duration and composite changes. Section 4 gives an overview of the current architecture and status of the system. Section 5 concludes the paper with an emphasis on future work.

2 Related Work Many research groups have been working to address detecting changes to documents. GNU diff [4] detects changes between any two text files. Most of the previous work in change detection has dealt only with flat-files [5] and not structured or unstructured web documents. Several tools have been developed to detect changes between two versions of unstructured HTML documents [6]. Some change–monitoring tools such as ChangeDetection.com [7] have been developed using the push-pull paradigm. But these tools detect changes to the entire page instead of user specified components and the changes can be tracked only on limited pages.

2.1 Approaches for User Specification Present day users are interested in monitoring changes to pages and want to be notified based on his/her profile. Hence, an expressive language is necessary to specify user-intent on fetching, monitoring and propagating changes. WebCQ [8] detects customized changes between two given HTML pages and provides an expressive language for the user to specify his/her interests. But WebCQ only supports changes between the last two pages of interest. As a result, flexible and expressive compare options are not provided to the user. AT&T Internet Difference Engine [9] views a HTML document as a sequence of sentences and sentence-breaking markups. This approach may be expensive computationally as each sentence may need to be compared with all sentences in the document. WYSIGOT [10] is a commercial application that can be used to detect changes to HTML pages. It has to be installed on the local

422

Ajay Eppili et al.

machine, which is not always possible. This system gives an interface to specify the specifications for monitoring a web page. It has the feature to monitor an HTML page and also all the pages that it points to. But the granularity of change detection is at the page level. In [11], the authors allow the user to submit monitoring requests and continuous queries on the XML documents stored in the Xyleme repository. WebVigiL supports a life-span for change monitoring request which is akin to a continuous query. Change detection is continuously performed over the life-span. To the best of our knowledge, customized changes, inheritance, different reference selection or correlated specifications cannot be specified in Xyleme.

3 Change Specification Language The present day web user’s interest has evolved from mere retrieval of information to monitoring the changes on web pages that are of interest. As the web pages are distributed over large repositories, the emphasis is on selective and timely propagation of information/changes. Changes need to be notified to the user in different ways based on their profiles/policies. In addition, the notification of these changes may have to be sent to different devices that have different storage and communication bandwidths. The language for establishing the user policies should be able to accommodate the requirements of a heterogeneous distributed large network-centric environment. Hence, there is a need to define an expressive and extensible specification language wherein the user can specify details such as the web page(s) to be monitored, the type of change (keywords, phrases etc.) and the interval for comparing occurrence of changes. User should also be able to specify how, when, and where to be notified taking into consideration the quality of service factors such as timeliness, size vs. quality of notification. WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements of a user pertaining to the web [12]. Each monitoring request is termed a Sentinel. The change specification language developed for this purpose allows the user to create a monitoring request based on his/her requirements. The semantics of this language for WebVigiL have been formalized. Complete syntax of the language is shown in Fig 1. Following are a few monitoring scenarios that can be represented using the above sentinel specification language. Example 1: Alex wants to monitor http://www.uta.edu/spring04/cse/classes.htm for the keyword “cse5331” to take a decision for registering the course cse5331. The sentinel starts from May 15, 2004 to August 10, 2004 (summer semester) and she wants to be notified as soon as a change is detected. Sentinel (s1) for this scenario is as follows: Create Sentinel s1 Using http://www.uta.edu/ spring04/cse/classes.htm Monitor keyword (cse5331) Fetch 1 day From 05/15/04 To 08/10/04

Expressive Profile Specification and Its Semantics for a Web Monitoring System

423

Notify By email [emailprotected] Every best effort Compare pairwise ChangeHistory 3 Presentation dual frame Example 2: Alex wants to monitor the same URL as in Example 1 for regular updates on new courses getting added but is not interested in changes to images. As it is correlated with sentinel s1, the duration is specified between the start of s1 and the end of s1. The sentinel (s2) for the above scenario is: Create Sentinel s2 Using s1 Monitor Anychange AND (NOT) images Presentation only change

Fig. 1. Sentinel Syntax

3.1 Sentinel Name This is to specify a name for user’s request. The syntax of sentinel name is Create Sentinel <sentinel-name>. For every sentinel, the WebVigiL system generates a

424

Ajay Eppili et al.

unique identifier. In addition, the system also allows the user to specify a sentinel name. The user is required to specify a distinct name for all his sentinels. This name identifies a request uniquely. Further it facilitates the user to specify another sentinel in terms of his/her previously defined sentinels.

3.2 Sentinel Target The syntax of sentinel target is Using <sentinel- target>. The sentinel-target could be either a URL or a previously defined sentinel If the new sentinel specifies the sentinel target as then inherits all properties of unless the user overrides those properties in the current specification. In Example 1, Alex is interested in monitoring the course web page for the keyword ‘cse5331’. Alex should be able to specify this URL as the target on which the system monitors the changes on the keyword cse5331. Later Alex wants to get updates on the new classes being added to the page, as this may affect her decision for registering for the course cse5331. She should place another sentinel for the same URL but with different change criteria. As the second case is correlated with the first case, Alex can specify s1 as the sentinel target with a different change type. Sentinels are correlated if they inherit run time properties such as start and end time of a sentinel. Otherwise, they merely inherit static properties (e.g., URL, change type, etc. of the sentinel). The language allows the user to specify the reference web page or a previously placed sentinel as the target.

3.3 Sentinel Type WebVigiL allows the detection of customized changes in the form of sentinel type and provides explicit semantics for the user to specify his/her desired type of change. The syntax of sentinel type is given as: Monitor <sentinel-type>, where sentinel type is sentinel-type= [] [ ] In Example 1, Alex is interested in ‘cse5331’. Detecting changes to the entire page leads to wasteful computations and further sends unwanted information to Alex. In Example 2, Alex is interested in any change to the class web page but is not interested in the changes pertaining to images. WebVigiL handles such requests by introducing change type and operators in its change specification language. The contents of a web page can be any combination of objects such as set of words, links and images. Users can specify such objects using change type and use operators over these objects. Change Specification Language defines Primitive change and Composite change for a sentinel type. Primitive change: It is the detection of a single type of change between two versions of the same page. For keyword change, the user must specify a set of words. An exception list can also be given for any change. For phrase change, a set of phrases is specified. For regular expressions, a valid regular expression is given. Composite change: It comprises of a combination of distinct primitive change(s) specified on the same page, using one of the binary operators AND and OR. The

Expressive Profile Specification and Its Semantics for a Web Monitoring System

425

semantics of composite change formed by the use of an operator can be defined as follows (Note that V, and ~ are Boolean AND, OR, and NOT operators, respectively).

3.4 Change Type If and are two different versions of the same page, then Change on with reference to is defined as: if the change type t is detected as insert in or delete in False otherwise. The sentinel-type is the change type t selected from the set T where T = {any change, links, images, all words except <set of words>, phrase:<set of phrases>, keywords:<set of words>, table:

, list :<list id>, regular expression: <exp> }. Based on the form of information that is usually available on web pages change types may be classified as links, images, keywords, phrases, all words, table, list, regular expression and any change based on the form of information. Links: Corresponds to a set of hypertext references. In HTML, links are presentationbased objects represented between the hypertext tag (). Given two versions of a page, if any of the old links are deleted in the new version or new links are inserted, a change is flagged. Images: Corresponds to a set of image references extracted from the image source. In HTML, images are represented by the image source tag (< IMG src=“.”>). The changes detected are similar to the links except that the images are monitored. Keywords<set of words>: Corresponds to a set of unique words from the page. A change is flagged when any of the keyword (mentioned in the set of words) appears or disappears in a page with respect to the previous version of the same page. Phrase<set of phrases>: Corresponds to a set of contiguous words from the page. A change is flagged on the appearance or disappearance of a given phrase in a page with respect to the previous version of the same page. Update to a phrase is also flagged depending on the percentage of words that has been modified in a phrase. If the number of words changed exceeds above a threshold, it is deemed as a delete (or disappearance). Table: Corresponds to the content of the page represented in a tabular format. Though the table is a presentation object, the changes are tracked on the contents of the table. Hence, whenever the table contents are changed, it is flagged as a table change. List: Corresponds to the contents of a page represented in a list format. The list format can be bullets or numbered. Any change detected on the set of words represented in a list format is flagged as a change. Regular expression <exp>: Expressed as valid regular expression syntax for querying and extracting specific information from the document data.

426

Ajay Eppili et al.

All words: A page can be divided into a set of words, links and images. Any change to the set of words between two versions of the same page is detected as all words change. All words encompass phrases, keywords and words in the table and list. While considering changes to all words, the presentation objects such as table and list are not considered and only the content in these presentation objects are taken into consideration. Anychange: Anychange encompasses all the above given types of changes. Changes to any of the defined set (i.e., all words, all links and all images) are flagged as anychange. Hence, the granularity is limited to a page for anychange. Any change is the superset of all changes.

3.5 Operators Users may want to detect more than one type of change on a given page or the nonoccurrence of a type of change. To facilitate such detections the change specification language includes unary and binary operators. NOT: A unary operator, which detects the non-occurrence of a change type. For a given change type t on version with reference to version of the same page the semantics of NOT are: OR: A binary operator representing disjunction of change types. It is denoted by OR for two primitive changes and specified on version with reference to version of the same page. A change is detected if either is detected or is OR detected. Formally, where t1, t2 are the types of changes and t1t2 AND: A binary operator representing conjunction of change types. It is denoted by AND for two primitive changes and specified on version with reference to version of the same page. A change is detected when both and are detected. Formally, AND where t1, t2 are types of changes and t1 t2 The unary operator NOT can be used to specify a constituent primitive change in a composite change. For example, for a page containing the list of fiction books, a user can specify a change type as: All words AND NOT phrase {“ Lord of the Rings”}. A change will be flagged only if given two versions of a page, at least some words may change such as insertion of a new book and author etc. but the phrase “Lord of the Rings” has not changed. Hence, the user is interested in monitoring the arrival of new books or removal of old books, only as long as the book “Lord of the Rings” is available.

Expressive Profile Specification and Its Semantics for a Web Monitoring System

427

3.6 Fetch Changes can be detected for a web page only when a new version of the page is fetched. New versions can be fetched based on the freshness of the page. The page properties (or meta-data) of a web page, such as the last modified date for static pages or checksum for dynamic pages define whether a page has been modified or not. The syntax of fetch is Fetch on change. User can specify a ‘time interval’ indicating how often a new page should be fetched, or can specify ‘on change’ to indicate that he/she is unaware of the change frequency of the page. On change: This option relieves the user of knowing when the page changes. WebVigiL’s fetch module uses a heuristic-based fetch algorithm called Best Effort Algorithm [13] to determine the interval with which a page should be fetched. This algorithm uses change history and meta-data of the page. Fixed Interval User can specify a fixed user-defined fetch interval when a page is fetched by the system, can be in terms of minutes, hours, days or weeks (a non-negative integer).

3.7 Sentinel Duration WebVigiL monitors a web page for changes during the lifespan of the sentinel. The lifespan of a sentinel is a closed interval formed by the start time and end time of sentinel. This is defined as: From To Let the timeline be an equidistant discrete time domain having “0” as the origin and each time point as a positive integer as defined in [14]. Defining it in terms of the timeline, occurrences of the created Sentinel S are specific points on the time line and the duration (lifespan) defines the closed interval within which S occurs. The ‘From’ modifier denotes the start of a sentinel S and the ‘To’ modifier denotes the end of S. The start and end times of a sentinel can be specific times or can depend upon the attributes of other correlated sentinels. The user has the flexibility to specify the duration as one of the following: (a) Now (b) Absolute time (c) Relative time (d) Eventbased time Now: A system-defined variable that keeps track of the current time. Absolute time: Denoted as time point T, it can be specified as a definite point on the time line. The format for specifying the time point is MM/DD/YYYY. Relative time: It is defined as an offset from a time point (either absolute or eventbased). The offset can be specified by the time interval defined in Section 3.6. Event-based time: Events, such as the start and end of a sentinel can be mapped to specific time points and can be used to trigger the start or end of a new sentinel. Start of a sentinel can also depend on the active state of another sentinel and is specified by the event ‘during’. During defines that a sentinel should be started in the closed

428

Ajay Eppili et al.

interval of and the start should be mapped to Now. When a sentinel inherits from another sentinel having a start time of Now, as the properties are inherited, the time of the current sentinel will be mapped to the current time.

3.8 Notification Users need to be notified of detected changes. How, when and where to notify is an important criterion for notification and should be resolved by the change specification semantics. Notification Mechanism: The mechanism selected for notification is important especially when multiple types of devices with varying capabilities are involved. The syntax for specifying the notification mechanism is given by: Notify By . The allows the users to select the appropriate mechanism for notification from a set of options O = {email, fax, PDA}. The default is email. Notification Frequency: The notification module has to ensure that the detected changes are presented to the user at the specified frequency. The system should incorporate the flexibility to allow users to specify the desired frequency of notification. The syntax of notification frequency has been defined as: best effort immediate interactive where is as defined in the Section 3.6. Immediate denotes immediate (without delay) notification on change detection. Best effort is defined as notify as soon as possible after change detection. Hence, best effort is equivalent to immediate but will have lesser priority than immediate for notification. Interactive is a navigational style notification approach where the user visits the WebVigiL dashboard to retrieve the detected changes at his/her convenience.

3.9 Compare Options One of the unique aspects of WebVigiL is its compare option and its efficient implementation. Changes are detected between two versions of the same page. Each fetch of the same page is given a version number. The first version of the page will be the first page fetched after a sentinel starts. Given a sequence of versions of the same page, the user may be interested in knowing changes with respect to different references. In order to facilitate this, the change specification language allows users to specify three types of compare options. The syntax of compare options is: Compare where compare options can be selected from a set P = {pairwise, moving n, every n}. Pairwise: The default is pairwise, which will allow change comparison between two chronologically adjacent versions as shown in Fig 2. Every n: Consider an example where the user is aware of the changes occurring on a page such as a web developer or administrator and is interested in the cumulative changes between only n versions. This compare option allows detecting changes

Expressive Profile Specification and Its Semantics for a Web Monitoring System

429

between versions and For the next comparison, the nth page becomes the reference page. For example if a user wants to detect changes between every 4 versions of the page, the versions for comparing will be selected as shown in Fig 2.

Fig. 2. Compare Options

Moving n: This is a moving window concept for tracking changes. When a user wants to monitor the trend of a particular stock where meaningful change detection is only possible between particular set of pages occurring in a moving window. For moving n, If a user specifies the compare option of moving n where n=4, as shown in Fig 2, will be the reference page for The next comparison will be between and WebVigiL believes in giving the users more flexibility and options for change detection and hence has incorporated several compare options for change specification along with efficient change detection algorithms. By default, the previous page (based on user-defined fetch interval where appropriate) and the current page are used for change detection.

3.10 Change History The syntax of Change History is ChangeHistory . Change Specification language facilitates the user to specify the number of previous changes to be maintained by the system. User should be able to view last n changes detected for a particular request (sentinel). WebVigiL provides an interface to users to view and manage the sentinels they have placed. A user dashboard is provided for this purpose. Interactive option is a navigational style notification approach where the users visit the WebVigiL dashboard to retrieve the detected changes at their convenience. Through the WebVigiL dashboard users can view and query the changes generated by their sentinels. Change history, mentioned by the user will be used by the system to maintain detected changes.

3.11 Presentation Presentation semantics are included in the language to present the detected changes to users in a meaningful manner. In Example 1 Alex is interested in viewing the content cse5331 along with the context, but in Example 2 she is interested in getting a brief

430

Ajay Eppili et al.

overview of the changes occurring to the page. To support these, change specification language facilitates the users with two types of presentations. In change only approach, changes, to the page along with the type (insert/delete/update) of change information are displayed in an HTML file using a tabular representation. Dual Frame approach shows both documents (involved in the change) on the same page in different frames side-by-side, highlighting the changes between the documents. The syntax is Presentation <presentation options > where presentation options is specified as <presentation options> change-only dual-frame approach.

3.12 Desiderata All of the above expressiveness is of not much use if they are not implemented efficiently. One of the focuses of WebVigiL was to design efficient ways of supporting the sentinel specification, provide a truly asynchronous way of notification and managing the sentinels using the active capability developed by the team earlier. In the following sections, we describe the overall WebVigiL architecture and the current status of the working system. The reader is welcome to access the system at http://berlin.uta.edu:8081/webvigil/and test the usage of the system.

4 WebVigiL Architecture and Current Status WebVigiL is a profile-based change detection and notification system. The high-level block diagram shown in Fig 3 details the architecture of WebVigiL. WebVigiL aims at investigating the specification, management and propagation of changes as requested by the user in a timely manner while meeting the quality of service requirements [15]. All the modules shown in the architecture (Fig 3) have been implemented.

Fig. 3. WebVigiL Architecture

Expressive Profile Specification and Its Semantics for a Web Monitoring System

431

User specification module provides an interface for the first time users to register with the system and a dashboard for registered users to place, view, and manage their sentinels. Sentinel captures the user’s specification for monitoring a web page. Verification module is used to validate user-defined sentinels before sending the information to the Knowledgebase. The Knowledgebase is used to persist meta-data about each user and his/her sentinels. Change detection module is responsible for generating ECA rules [16] for the run time management of a validated sentinel. Fetch module is used to fetch pages for all active or enabled sentinels. Currently fetch module supports fixed interval and best effort approaches for fetching the web pages. Version management module deals with a centralized server based repository service that retrieves, archives, and manages versions of pages. A page is saved in the repository only if the latest copy in the repository is older than the fetched page. Subsequent requests for the web page can access the page from the cache instead of repeatedly invoking the fetch procedure. [3] discusses how each URL is mapped to a unique directory and how all the versions of this URL are stored in this directory. Versions are checked for deletion periodically and versions no longer needed are deleted.

Fig. 4. Presentation using dual frame approach for a html page

The change detection module [17] builds a change detection graph to efficiently detect and propagate the changes. The graph captures the relationship between the

432

Ajay Eppili et al.

pages and sentinels, and groups the sentinels based on the change type and target web page. Change detection is performed over the versions of the web page and the sentinels associated with the groups are informed about the detected changes. Currently, grouping is performed only for sentinels that follow best effort approach for fetching pages. WebVigiL architecture can support various page types such as XML, HTML, and TEXT in a uniform way. Currently changes are detected between HTML pages using specifically developed CH-Diff [2] module and XML pages using CX-Diff [18] module. Change detection modules for new page types can be added or current modules for HTML and XML page types can be replaced by efficient modules without disturbing the rest of the architecture. Among the change types discussed above in Section 0 all change types except Table, List and Regular expressions are currently supported by WebVigiL. Currently notification module propagates the detected changes to users via email. Presentation module supports both change-only and dual-frame approaches for presenting the detected changes. A screenshot of the notification using dual frame approach for html pages is shown in Fig 4. This approach is visually intuitive and enhances user interpretation since changes are presented along with the context.

5 Conclusion and Future Work In this paper we have discussed the rationale for an expressive change specification language, its syntax as well as its semantics. We have given a brief overview of WebVigiL architecture and have discussed the current status of the system, which included a complete implementation of the language presented. We are currently working on several extensions. The change specification language can be extended to provide the capability of supporting sentinels on multiple URLs. The current fetch module is being extended to a distributed fetch module to reduce the network traffic. The deletion algorithm for the cached versions discussed in Section 0 is being improved to efficiently delete the no longer needed pages as soon as possible instead of the slightly conservative approach used currently.

References 1. Deolasee, P., et al., Adaptive Push-Pull: Disseminating Dynamic Web Data, in Proceeding of the 10th International WWW Conference. Hong Kong: p. 265-274, 2001. 2. Pandrangi, N., et al., WebVigiL: User Profile-Based Change Detection for HTML/XML Documents, in Twentieth British National Conference on Databases. Coventry, UK. pages 38 - 55, 2003, 3. Chakravarthy, S., et al., WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments. in Second International Workshop on Web Dynamics. Hawaii, 2002. 4. GNUDiff. http://www.gnu.org/software/diffutils/diffutils.html, 5. Hunt, J.W. and Mcllroy, M.D. An algorithm for efficient file comparison, in Technical Report. Bell Laboratories. 1975. 6. Zhang, K., A New Editing based Distance between Unordered Labeled Trees. Combinatorial Pattern Matching, vol. 1 p. 254-265, 1993.

Expressive Profile Specification and Its Semantics for a Web Monitoring System

433

7. Changedetection. http://www.changedetection.com, 8. Liu, L., et al., Information Monitoring on the Web: A Scalable Solution, in World Wide Web: p. 263-304, 2002. 9. Douglis, F., et al., The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web, in World Wide Web. Baltzer Science Publishers, p. 27-44. 1998. 10. WYSIGOT. http://www.wysigot.com/, 11. Nguyen, B., et al., Monitoring XML Data on the Web. in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001. 12. Jacob, J. WebVigiL: Sentinel specification and user-intent based change detection for Extensible Markup Language (XML), in MS Thesis. The University of Texas at Arlington. 2003. 13. Chakravarthy, S., et al., A Learning-Based Approach to fetching Pages in WebVigiL. in Proc of the 19th ACM Symposium on Applied Computing, March 2004. 14. Chakravarthy, S. and Mishra, D., Snoop: An Expressive Event Specification Language for Active Databases. Data and Knowledge Engineering, vol. 14(10): p. 1--26, 1994. 15. Jacob, J., et al., WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments(to be published), in Web Dynamics Book. SpringerVerlag. 2003. 16. Chakravarthy, S., et al., Composite Events for Active Databases: Semantics, Contexts and Detection, in Proc. Int’l. Conf. on Very Large Data Bases VLDB: Santiago, Chile. p. 606-617. 1994. 17. Sanka, A. A Dataflow Approach to Efficient Change Detection of HTML/XML Documents in webVigiL, in MS Thesis. The University of Texas at Arlington, August 2003. 18. Jacob, J., Sachde, A., and Chakravarthy, S., CX-DIFF: A Change Detection Algorithm for XML Content and change Presentation Issues for WebVigiL, ER Workshops October 2003: 273-284.

On Modelling Cooperative Retrieval Using an Ontology-Based Query Refinement Process Nenad Stojanovic1 and Ljiljana Stojanovic2 1

Institute AIFB, Research Group Knowledge Management, University of Karlsruhe, 76128 Karlsruhe, Germany [emailprotected]

2

FZI - Research Center for Information Technologies at the University of Karlsruhe, 76128 Karlsruhe, Germany [emailprotected]

Abstract. In this paper we present an approach for the interactive refinement of ontology-based queries. The approach is based on generating a lattice of the refinements, that enables a step-by-step tailoring of a query to the current information needs of a user. These needs are implicitly elicited by analysing the user’s behaviour during the searching process. The gap between a user’s need and his query is quantified by measuring several types of query ambiguities, which are used for ranking of the refinements. The main advantage of the approach is a more cooperative support in the refinement process: by exploiting the ontology background, the approach supports finding “similar” results and enables efficient relaxing of failing queries.

1 Introduction Although a lot of research was dedicated to improving the cooperativeness of an information access process [1], almost all of them were focused on resolving the problem of an empty answer set. Indeed, either due to false presuppositions concerning the content of the knowledge base which lead to the stonewalling behaviour of the retrieval system, or due to the misconceptions (concerning the schema of the domain) which cause mismatches between a user’s view on the world and the concrete conceptualisation of the domain, when a query fails it is more cooperative to identify the cause of failure, rather than just to report the empty answer set. If there is no a cause per se for the query’s failure it is then worthwhile to report the part of the query which failed. Further, some types of query’s generalizations [2] or relaxations [3], [4] were proposed for weakening a user’s query in order to allow him to find some relevant results. The growing nature of the web information content implies a users behaviour’s pattern that should be treated in a more collaborative way in the modern retrieval systems: users tend to make short queries which they refine (expand) subsequently. Indeed, in order to be sure to get any answer to a query, a user forms as short as possible query and depending on the list of answers, he tries to narrow his query in several refinement steps. Probably the most expressive examples are product catalogue applications that serve as web interfaces to the large product databases. The main problem here is that a user cannot express clearly his need for a product by using only 2-3 terms, i.e. a user’s query represents just an approximation of his information need [5]. Therefore, a user tries in several refinement steps to filter the list of retrieved products, so that only the products which are most relevant for his information need P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 434–449, 2004. © Springer-Verlag Berlin Heidelberg 2004

On Modelling Cooperative Retrieval

435

remain. Unfortunately, most of the retrieval systems do not provide a cooperative support in the query refinement process, so that a user is “forced” to change his query on his own in order to find the most suitable results. Indeed, although in an interactive query refinement process [6] a user is provided with a list of terms that appear frequently in retrieved documents, a more semantic analysis of the relationships between these terms is missing. For example, if a user made a query for a metallic car, then the refinements that include the value of the car’s colour of the car can be treated more relevant than the refinements regarding the speed of the car, since the feature metallic is strongly related to the colour. At least, such reasoning can be expected from a human shop assistant. Obviously, if a retrieval system has more information about the model of the underlying product data, then a more cooperative (human-like) retrieval process can be created. In our previous work we have developed a query refinement process, called Librarian Agent Query Refinement process, that uses an ontology for modelling an information repository [7],[8]. That process is based on incrementally and interactively tailoring a query to the current information need of a user, whereas that need is discovered implicitly by analysing the user’s behaviour during the search process. The gap between the user’s query and his information need is defined as the query ambiguity and it is measured by several ambiguity parameters that take into account the used ontology as well as the content of the underlying information repository. In order to provide a user with suitable candidates for the refinement of his query, we calculate the so-called Neighbourhood of that query. It contains the query’s direct neighbours regarding the lattice of queries defined by considering the inclusion relation between query results. In this paper we extend this work by involving more user’s-related information in the query refinement phase of the query refinement process. In that way our approach ensures continual adaptation of the retrieval system to the changing preferences of users. Due to the reluctance of users to give explicit information about the quality of the retrieval process, we base our work on the implicit user’s feedback, a very popular information retrieval technique for gathering user’s preferences [9]. From a user’s point of view, our approach provides more cooperative retrieval process: In each refinement step a user is provided with a complete but minimal set of refinements, which enables him to develop/express his information need in a step-by-step fashion. Secondly, although all users’ interactions are anonymous, we personalize the searching process and achieve the so-called ephemeral personalization by implicitly discovering a user’s need. The next benefit is the possibility to anticipate which alternative resources can be interesting for the user. Finally, this principle enables coping with a user’s requests that cannot be fulfilled in the given repository (i.e. the requests that returns zero results), a hard-solvable problem for existing information retrieval approaches. The paper is organised as follows: In the second Section we present the extended Librarian Agent Query Refinement process and discuss its cooperative nature. Section 3 provides related work and Section 4 contains concluding remarks.

2 Librarian Agent Query Refinement Process The goal of the Librarian Agent Query Refinement process [8] is to enable a user to efficiently find results relevant for his information need in an ontology-based infor-

436

Nenad Stojanovic and Ljiljana Stojanovic

mation repository, even if some problems we sketched in the previous section appear in the searching process. These problems lead to some misinterpretations of a user’s need in his query, so that either a lot of irrelevant results and/or only a few relevant results are retrieved. In the Librarian Agent Query Refinement process, potential ambiguities (i.e. misinterpretations) of the initial query are firstly discovered and assessed (cf. the so-called Ambiguity-Discovery phase). Next, these ambiguities are interpreted regarding the user’s information need, in order to estimate the effects of an ambiguity on the fulfilment of the user’s goals (cf. the so-called AmbiguityInterpretation phase). Finally, the recommendations for refining the given query are ranked according to their relevance for fulfilling the user’s information need and according to the possibility to disambiguate the meaning of the query (cf. the so-called Query Refinement phase). In that way, the user is provided with a list of relevant query refinements ordered according to their capabilities to decrease the number of irrelevant results or/and to increase the number of relevant results. In the next three subsections we explain these three phases further, whereas the first phase is just sketched here, since its complete description is given in [8]. In order to present the approach in a more illustrative way, we refer to examples based on the ontology presented in Fig. 1. Table 1 represents a small product catalog, indexed/annotated with this ontology. Each row represents the features assigned to a product (a car), e.g. product P8 is a cabriolet, its colour is green metallic and it has an automatic gear changing system. The features are organised in an isA hierarchy, for example the feature (concept) “BlueColor” has two specializations “DarkBlue” and “WhiteBlue” which means that a dark or white blue car is also a blue colour car.

Fig. 1. The car-feature ontology used throughout the paper

2.1 Phase 1: Ambiguity Discovery We define query ambiguity as an indicator of the gap between the user’s information need and the query that results from that need. Since we have found two main factors that cause the ambiguity of a query: the vocabulary (ontology) and the information repository, we define two types of the ambiguity that can arise in interpreting a query: (i) the semantic ambiguity, as the characteristic of the used ontology and (ii) the content-related ambiguity, as the characteristic of the repository. In the next two subsections we give more details on them.

On Modelling Cooperative Retrieval

437

2.1.1 Semantic Ambiguity The goal of an ontology-based query is to retrieve the set of all instances which fulfil all constraints given in that query. In such a logic query the constraints are applied on the query variables. For example in the query: x is a query variable and hascolorvalue(x,BlueColour) is a query constraint. The stronger these constraints are (by assuming that all of them correspond to the user’s need), the more relevant the retrieved instances are for the user’s information need. Since an instance in an ontology is described through (i) the concept it belongs to and (ii) the relations to other instances, we see two factors which determine the semantic ambiguity of a query variable: the concept hierarchy: How general is the concept the variable belongs to the relation-instantiation: How descriptive/strong are constraints applied to that variable Consequently, we define the following two parameters in order to estimate these values: Definition 1: Variable Generality VariableGenerality(X) = Subconcepts(Type(X)) + 1, where Type(X) is the concept the variable X belongs to, Subconcepts(C) is the number of subconcepts of the concept C. Definition 2: Variable Ambiguity where Relation(C) is the set of all relations defined for the concept C in the ontology, AssignedRelations(C,Q) is the set of all relations defined in the set Relation(C) and which appear in the query Q. AssignedConstraints(X,Q) is the set of all constraints related to the variable X that appear in the query Q. The total ambiguity of a variable is calculated as the product of these two parameters, in order to model uniformly the directly proportional effect of both parameters to the ambiguity. Note that the second parameter is typically less than 1. We now define the ambiguity as follows:

438

Nenad Stojanovic and Ljiljana Stojanovic

Finally, the Semantic Ambiguity for the query Q is calculated as follows:

where Var(Q) represents the set of variables that appear in the query Q. By analysing these ambiguity parameters it is possible to discover which of the query variables introduces the highest ambiguity in a query. Consequently, this variable should be refined in the query refinement phase. 2.1.2 Content-Related Ambiguity An ontology defines just a model how the entities from a real domain should be structured. If there is a part of that model that is not instantiated in the given domain, then that part of the model cannot be used for calculating ambiguity. Therefore, we should use the content of the information repository to prune the results from the ontologyrelated analyses of a user’s query. 2.1.2.1 Query Neighbourhood We introduce here the notation for ontology-related entities that are used in the rest of this subsection: Q(O) is a query defined against the ontology O. The setting in this paper encompasses positive conjunctive queries. However, the approach can be easily extended to queries that include negation and disjunction. is the set of all possible elementary queries (queries that contain only one constraint) for an ontology O. KB(O) is the set of all relation instances (facts) which can be proven in the given ontology O. It is called the knowledge base. A(Q(O)) is the set of answers (in the logical sense) for the query Q regarding the ontology O. Definition 3: Ontology-based information repository An Ontology-based information repository IR is the structure (R, O, ann), where: R is a set of elements that are called resources, O is an ontology, which defines the vocabulary used for annotating these resources. We say that the repository is annotated with ontology O and a knowledge base KB(O); ann is a binary relation between a set of resources and a set of facts from the knowledge base KB(O), We write meaning that a fact is assigned to the resource r (i.e. a resource r is annotated with a fact Definition 4: Resources-Attributes group (user’s request) A Resources-Attributes group in an IR=(R, O, ann) is a tuple

where

is called a set of is called a set of It follows: ann(r,i)} , i.e. this is the set of resources which are annotated with all attributes of the query Q’.

On Modelling Cooperative Retrieval

439

Definition 5: Structural equivalence (=) between two user’s requests is defined by: It means that two user’s requests are structurally equivalent if their sets of result resources are equivalent. Definition 6: Cluster of users’ queries (in the rest of text: Query cluster) is a set of all structurally equivalent user’s requests where is called the set of (attribute set) and contains the union of attributes of all requests that are equivalent. For a user’s request it is calculated in the following manner It holds: is called the set of set of the query Formally:

(resource set) and is equal to the

The Query cluster which contains all existing resources in IR (i.e. a cluster for which is called the root cluster. The set of all Query clusters is denoted by Definition 7: Structural subsumption (parent-child relation) (...,group=“A”, Matches in Group A); (null,Brazil,

); and (null,Brazil,Brazil vs Scotland). Structure Matching. This subsystem discovers all structure mappings between elements in the XML and HTML DOMs. We adopt two constraints used in GLUE system [4] as a guide to determine whether two nodes are structurally matched: Neighbourhood Constraint: “two nodes match if nodes in their neighbourhood also match”, where the neighbourhood is defined to be the children. Union Constraint: “if every child of node A matches node B, then node A also matches node B”. Note that there could be a range of possible matching cases, depending on the completeness and precision of the match. In the ideal case, all components of the structures in the two nodes fully match. Alternatively, only some of the components are matched (a partial structural match). In the case of partial structure matching between two nodes, there are some extra nodes, i.e. children of the first node that do not match with any children of the second node; and/or vice versa. Since extra nodes do not have any match in the other document, they are ignored in the structure matching process. Therefore, the above constraints need to be modified to construct the definition of structure matching which accommodates both partial and full structure matching:

484

Stella Waworuntu and James Bailey

Neighbourhood Constraint: “XML node X structurally matches HTML node H if H is not an extra HTML node and every non-extra child of H either text matches or structurally matches a non-extra descendant of X”. Union Constraint: “X structurally matches H if every non-extra child of H either text matches or structurally matches X”. As stated in the above constraints, we need to examine the children of the two nodes being compared in order to determine if a structure matching exists. Therefore, structure matching is implemented using a bottom-up approach that visits each node in the HTML DOM in post-order and searches for a matching node in the entire XML DOM. If the list of substring mappings is still empty after the structure matching process finishes, we add a mapping from the XML root element to the HTML body element, if it exists, or to the HTML root element, otherwise. Revisiting the Soccer example (Fig. 2), some of the discovered structure mappings are:(null,match, tr) (neighbourhood constraint), (null,match, table) (union constraint) and (null,soccer, body). Sequence Checking. Up to this point, the mappings generated by the text matching and structure matching subsystems are limited to 1-1 mappings. In cases where the XML and HTML documents have more complex structure, these mappings may not be accurate and this can affect the quality of the XSLT rules generated from these mappings. Consider the following example: In Fig. 2, the sequence of the children of XML node soccer is made up of nodes with the same name, match; whereas the sequence of the children of the matching HTML node body follows a specific pattern: it starts with h1 and is followed repetitively by h2 and table. Using only the discovered 1-1 mappings, it is not possible to create an XSLT rule for soccer that resembles this pattern, since match maps only to table according to structure matching. In other words, there will be no template that will generate the HTML node h2. Focusing on the structure mapping (match,table) and the substring mappings {(team[1],h2),(team[2],h2)}, we can see in the DOM trees that the children of match, i.e. team[1] and team[2], are not mapped to the descendant of table. Instead, they map to the sibling of table, i.e. h2. Normally, we expect that the descendant of match maps only to the descendant of table, so that the notion of 1-1 mapping is kept. In this case, there is an intuition that match should not only map to table, but also to h2. In fact, match should map to the concatenation of nodes h2 and table, so that the sequence of the children of body is preserved when generating the XSLT rule. This is called a 1-m mapping, where an XML node maps to the concatenation of several HTML nodes. The 1-m mapping (match,h2 ++ table) can be found by examining the subelement sequence of soccer and the subelement sequence of body described above. Note that the subelement sequence of a node can be represented using a regular expression, which is a combination of symbols representing each subelement and metacharacters: To obtain this regular expression, XTRACT [5], a system for inferring a DTD of an element, is employed. In our example, the regular expression representing the subelement sequence of soccer is whereas

XSLTGen

485

the one representing subelement sequence of body is We then check whether the elements in the first sequence conform to the elements in the second sequence, as follows: According to the substring mapping (soccer,group,h1), element conforms to an attribute of soccer and thus, we ignore it and remove from the second sequence. Comparing with we can see that element match should conform to elements table) since the sequence corresponds directly to the sequence i.e. they both are in repetitive pattern, denoted by However, element match conforms only to element table, as indicated by the structure matching (match,table). The verification therefore fails, which indicates that the structure matching (match,table) is not accurate. Consequently, based on the sequences and we deduce the accurate 1-m mapping: (null,match,h2 ++ table). The main objective of the sequence checking subsystem is to discover 1-m mappings using the technique of comparing two sequences described above. XSLT Stylesheet Generation. This subsystem constructs a template rule for each mapping discovered in and (the list of 1-m mappings); and puts them together to compose an XSLT stylesheet. We do not consider the substring mappings in because in substring mappings, it is possible to have a situation where the text value of the HTML node is a concatenation of text values from two or more XML nodes. Hence, it is impossible to create template rules for those XML nodes. Moreover, the HTML text value may contain substrings that do not have matching XML text value (termed as extra string). Considering these situations, we implement a procedure that generates a template for each distinct HTML node in The XSLT stylesheet generation process begins by generating the list of substring rules. We then construct a stylesheet by creating the <xsl: stylesheet> root element and subsequently, filling it with template rules for the 1-m mappings in the structure mappings in and the exact mappings in The template rules for the 1-m mappings have to be constructed first, since within that process, they may invalidate several mappings in and and thus, the template rules for those omitted mappings do not get used . In each mapping list and the template rule is constructed for each distinct mapping, to avoid having some conflicting template rules. In the next three subsections, we give more detail on the XSLT generation process. Discussion of 1-m mappings is left out due to space constraints. Substring Rule Generation. The substring rule generator creates a template from an XML node or a set of XML nodes to each distinct HTML node presented in the substring mapping list The result of this subsystem is a list of substring rules SUB_RULES, where each element of the list is a tuple (html_node, rule). Due to space constraints, we omit the detailed description of our algorithm for generating the substring rule itself. The following example illustrates how substring rule generation works. Consider the following substring mappings discovered in the Soccer example (Fig. 2): (null, Scotland,Brazil vs Scotland) and (null,

486

Stella Waworuntu and James Bailey

Brazil,Brazil vs Scotland). The HTML string is “Brazil vs Scotland”, while the set of XML strings is {“Brazil,”Scotland”}. By replacing parts of the HTML string that appear in the set of XML strings with the corresponding XSLT instruction, the substring rule is: <xsl:value-of select=“team[1]”/> vs <xsl:value-of select=“team[2]”/>, where “vs” is an extra string. Constructing a Template Rule for an Exact Mapping in Each template rule begins with an XSLT <xsl:template> element and ends by closing that element. For a mapping the pattern of the corresponding template rule is and the only XSLT instruction used in the template is <xsl:value-of>. In this procedure, we only construct a template rule when is a mapping between an XML ELEMENT_NODE to an HTML node or a concatenation of HTML nodes. The reason that we ignore mappings involving XML ATTRIBUTE_NODEs is that the template for this mapping will be generated directly within the construction of the template rule for structure mapping and 1-m mapping. In text matching, there could be mappings from an XML node to a concatenation of HTML nodes, hence, we need to create a template for each HTML node in E.g., the template rule for an exact mapping (null, line,text()++br) is:

Constructing a Template Rule for a Structure Mapping in Recall that in structure matching, one of the mappings in must be the mapping whose XML component is the root of the XML document. Let denote this special mapping. The template for begins with copying the root of the HTML document and its subtree, excluding the HTML component and its subtree. The next step in constructing the template for mapping follows the steps performed for the other mappings in For any mapping in the opening tag for is created, then a template for each child of is created, and finally, the tag is closed. E.g., Suppose there is a structure mapping (null,match,table) discovered in Soccer (Fig. 2). And suppose we have exact mappings (null,date,td[1]), (null,team[1],td[2]), (null,team[2],td[3]) in The template rule representing this structure mapping is:

Refining the XSLT Stylesheet. In some cases, the (new) HTML document obtained by applying the generated XSLT stylesheet to the XML document may not be accurate, i.e. there are differences between this (new) HTML document and the original (user-defined) HTML document. By examining those differences, we can improve the accuracy of the XSLT stylesheets generated. This

XSLTGen

487

step is applicable when we have a set of complete and accurate mappings between the XML and HTML documents, but the generated XSLT stylesheet is erroneous. If the discovered mappings themselves are incorrect or incomplete, then this refinement step is not effective and it is better to address the problem by improving the matching techniques. An indicator that we have complete and accurate mappings is that each element in the new HTML document corresponds exactly to the element in the original HTML document at the same depth. One possible factor that can cause the generated XSLT stylesheet to be inaccurate, is the wrong ordering of XSLT instructions within a template. This situation typically occurs when we have XML nodes with the same name but different order or sequence of children. Therefore, the main objective of the refinement step is to fix the order of the XSLT instructions within the template matches of the generated XSLT stylesheet, so that the resulting HTML document is closer to or exactly the same as the original HTML document. A naive approach to the above problem is to use brute force and attempt all possible orderings of instructions within templates until the correct one is found (there exist no differences between the new HTML and the original HTML). However, this approach is prohibitively costly. Therefore, we adopt a heuristic approach, which begins by examining the differences between the original HTML document and the one produced by the generated XSLT stylesheet. We employ a change-detection algorithm [2], that produces a sequence of edit operations needed to transform the original HTML document to the new HTML document. The types of edit operations returned are insert, delete, change, and move. To carry out the refinement, the edit operation that we focus on is the move operations, since we want to swap around the XSLT instructions in a template match to get the correct order. In order for this to work, we require that there are no missing XSLT instructions for any template match in the XSLT stylesheet. After examining all move operations, this procedure is started over using the fixed XSLT stylesheet. This repetition is stopped when no move operations are found in one iteration; or, the number of move operations found in one iteration is greater than those found in the previous iteration. The second condition is there to prevent the possibility of fixing the stylesheet incorrectly. We want the number of move operations to decrease in each iteration until it reaches zero.

3

Empirical Evaluation

We have conducted experiments to study and measure the performance of XSLTGen. To give the reader some idea on how our system performs, we evaluated XSLTGen on four examples taken from a popular XSLT book3 and a real-life data taken from MSN Messenger chat history. These datasets exhibit a wide variety of characteristics ranging from 10 - 244 element nodes. Originally, they were pairs of (XML document, XSLT stylesheet). To get the HTML document associated with each dataset, we apply the original XSLT stylesheet to the XML 3

http://www.wrox.com/books/0764543814.shtml

488

Stella Waworuntu and James Bailey

document using Xalan4 XSLT processor. We then manually determined the correct mappings between the XML and HTML DOMs in each dataset. For each dataset, we applied XSLTGen to find the mappings between the elements in the XML and HTML DOMs, and generate the corresponding XSLT stylesheet. We then measured the matching accuracy, i.e. the percentage of the manually determined mappings that XSLTGen discovered, and the quality of the XSLT stylesheet inferred by XSLTGen. To evaluate the quality of the XSLT stylesheet generated by XSLTGen in each dataset, we applied the generated XSLT stylesheet back to the XML document using Xalan and then compared the resulting HTML with the original HTML document using HTMLDiff5. HTMLDiff is a tool for analysing changes made between two revisions of the same file. It is commonly used for analysing HTML and XML documents. The differences may be viewed visually in a browser, or be analysed at the source level. The results of the matching accuracy are impressive. XSLTGen achieves high matching accuracy across all five datasets. Exact mappings reach 100% accuracy in four out of five datasets. In the dataset Chat Log, exact mappings reach 86% accuracy. This is caused by the undiscovered mappings from XML ATTRIBUTE NODEs to HTML ATTRIBUTE NODEs, which violates our assumption in Section 2.2 that the value of an HTML ATTRIBUTE_NODE is usually specific to the display of the HTML document in Web browsers and is not generated from a text within the XML document. Substring mappings achieve 100% accuracy in the datasets Itinerary and Soccer. In contrast, substring mappings achieve 0% accuracy in the dataset Poem. This poor performance is caused by incorrectly classifying substring mappings as exact mappings during the text matching process. In the datasets Books and Chat Log, substring mappings do not exist. Structure mappings achieve perfect accuracy in all datasets except Poem. In the dataset Poem, structure mappings achieves 80% accuracy because an XML node is incorrectly matched with an HTML TEXT_NODE in text matching, while it should be matched with other HTML node in structure matching. Following the success of the other mappings, 1-m mappings achieve 100% accuracy in the datasets Itinerary and Soccer. In the datasets Books, Poem and Chat Log, there are no 1-m mappings. This results indicate that in most of these cases, the XSLTGen system is capable of discovering complete and accurate mappings. The results returned by HTMLDiff are also impressive. The new HTML documents have a very high percentage of correct nodes. In the datasets Itinerary and Soccer, the HTML documents being compared are identical, which is shown by the achievement of 100% in all types of nodes. In the dataset Poem, the two HTML documents have exactly the same appearance in Web browsers, but according to HTMLDiff, there are some missing whitespaces in each line within the paragraphs of the new HTML document. That is why the percentage of correct TEXT_NODEs in this dataset is very low (14%). The reason of this low percentage is that in the text matching subsystem, we remove the leading and trailing whitespaces of a string before the matching is done. The improvement 4 5

http://xml.apache.org/xalan-j/index.html http://www.componentsoftware.com/products/HTMLDiff/

XSLTGen

489

stage does not fix the stylesheet since there are no move operations. In the dataset Books, the difference occurs in the first column of the table. In the original HTML document, the first column is a sequence of numbers 1, 2, 3 and 4; whereas in the new HTML document, the first column is a sequence of 1s. The numbers 1, 2, 3 and 4 in the original HTML document are represented using four extra nodes. However, our template rule constructor assumes that all extra nodes that are cousins (their parent are siblings and have the same node name) have the same structure and values. Since the four extra nodes have different text values in this dataset, the percentage of correct TEXT_NODEs in the new HTML document is slightly affected (86%). Lastly, the differences between the original and the new HTML documents in the dataset Chat Log are caused by the undiscovered mappings mentioned in the previous paragraph. Because of this, it is not possible to fix the XSLT stylesheet. However, the percentage of correct ATTRIBUTE_NODEs is still acceptable (75%). We have tested XSLTGen on many other examples and the results are very similar to those obtained in this experiment. However, there are some problems that prevent XSLTGen from obtaining even higher matching accuracy. First, in a few cases, XSLTGen is not able to discover some mappings between XML ATTRIBUTE_NODEs and HTML ATTRIBUTE_NODEs because these mappings violate our assumption stated in Section 2.2. This problem can be alleviated by considering HTML ATTRIBUTE_NODEs in the matching process. Undiscovered mappings are also caused by incorrectly matching some nodes, which is the second problem faced in the matching process. Incorrect matchings typically occur when an XML or an HTML TEXT_NODE has some ELEMENT_NODE siblings. In some cases, these nodes should be matched during the text matching process, while in other cases they should be matched in structure matching. Here, the challenge will be in developing matching techniques that are able to determine whether a TEXT_NODE should be matched during text matching or structure matching. The third problem concerns with incorrectly classified mappings. This problem only occurs between a substring mapping and an exact mapping, when the compared strings have some leading and trailing whitespaces. Determining whether whitespaces should be kept or removed is a difficult choice. Besides this, as the theme of our text matching subsystem is text-based matching (matching two strings), the performance of the matching process decreases if the supplied documents contain mainly numerical data. In this case, the mappings discovered, especially substring mappings, are often inaccurate and conflicting, i.e. more than one HTML node is matched with an XML node. Finally, the current version of XSLTGen does not support the capability to automatically generate XSLT stylesheets with complex functions (e.g. sorting). This is a very challenging task and an interesting direction for future work.

4

Related Work

There is little work in the literature about automatic XSLT stylesheets generation. The only prior work of which we are aware of is XSLbyDemo [10], a system

490

Stella Waworuntu and James Bailey

that generates an XSLT stylesheet by example. In this system, the process of generating XSLT stylesheet begins with transforming the XML document to an initial HTML page, which is an HTML page using a manually created XSLT stylesheet, taking into account the DTD of the XML document. The user then modifies the initial HTML page using a WYSIWYG editor and their actions are recorded in an operation history. Based on the user’s operation history, a new stylesheet is generated. Obviously, this system is not automatic, since the user is directly involved at some stages of the XSLT generation process. Hence, it is not comparable to our fully automatic XSLTGen system. Specifically, our approach differs from XSLbyDemo in three key ways: (i) Our algorithm produces a stylesheet that transforms an XML document to an HTML document, while XSLbyDemo generates transformations from an initial HTML document to its modified HTML document, (ii) Our generated XSLT can be applied directly to other XML documents from the same document class, whereas using XSLbyDemo, the other XML documents have to be converted to their initial HTML pages before the generated stylesheet can be applied, (iii) Finally, our users do not have to be familiar with a WYSIWYG editor and the need of providing structural information through the editing actions. The only thing that they need to possess is knowledge of a basic HTML tool. In the process of generating XSLT, semantic mappings need to be found. There are a number of algorithms available for tree matching. Work done in [12, 13] on the tree distance problem or tree-to-tree correction problem and in [2] known as the change-detection algorithm, compare and discover the sequence of edit operations needed to transform the source tree into the result tree given. These algorithms are mainly based on structure matching, and their input comprises of two labelled trees of the same type, i.e. two HTML trees or two XML trees. The text matching involved is very simple and limited since it compares only the labels of the trees. Clearly, these algorithms do not accommodate our needs, since we require an algorithm that matches an XML tree with an HTML tree. However, these algorithms are certainly useful in our refinement stage since within that subsystem, we are comparing two HTML documents. In the field of semantic mapping, a significant amount of work has focused on schema matching (refer to [11] for survey). Schema matching is similar to our matching problem in the sense that two different schemas are compared, which have different sets of element names and data instances. However, the two schemas being compared are mostly from the same domain and therefore, their element names are different but comparable. Besides using structure matching, most of the schema mapping systems rely on element name matchers to match schemas. The TransSCM system [9] matches schema based on the structure and names of the SGML tags extracted from DTD files by using concept of labelled graphs. The Artemis system [1] measures similarity of element names, data types and structure to match schemas. In XSLTGen, it is impossible to compare the element names since XML and HTML have completely different tag names. XMapper [7] is another system for finding semantic mappings between structured documents within a given domain, particularly XML sources. This system

XSLTGen

491

uses an inductive machine learning approach to improve accuracy of mappings for XML data sources, whose data types are either identical or very similar, and the tag names between these data sources are significantly different. In essence, this system is suitable for our matching process in XSLTGen since the tag names of XML and HTML documents are absolutely different. However, this system requires the user to select one matching tag between two documents, which violates our principle intention of creating a fully automatic system. Recent work in the area of ontology matching also focuses on the problem of Finding semantic mappings between two ontologies. One ontology matching system that we are aware of is GLUE system [4]. GLUE also employs machine learning techniques to semi-automatically create such semantic mappings. Given two ontologies: for each node in one ontology, the purpose is to find the most similar node in the other ontology using the notions of Similarity Measures and Relaxation Labelling. Similar to our matching process, the basis used in the similarity measure and relaxation labelling are data values and the structure of the ontologies, respectively. However, GLUE is only capable of finding 1-1 mappings whereas our XSLTGen matching process is able to discover not only 11 mappings but also 1-m and sometimes m-1 mappings (in substring mappings). The main difference between mapping in XSLTGen and other mapping systems, is that in XSLTGen we believe that mappings exist between the elements in the XML and HTML documents, since the HTML document is derived from the XML document by the user; whereas in other systems, the mapping may not exist. Moreover, the mappings generated by the matching process in XSLTGen are used to generate code (an XSLT stylesheet) and that is why the mappings found have to be accurate and complete, while in schema matching and ontology matching, the purpose is only to find the most similar nodes between the two sources, without further processing of the results. To accommodate the XSLT stylesheet generation, XSLTGen is capable of finding 1-1 mappings, 1-m mappings and sometimes m-1 mappings; whereas the other mapping systems focus only on discovering 1-1 mappings. Besides this, the matching subsystem in XSLTGen has the advantage of having very similar and related data sources, since the HTML data is derived from the XML data. Hence, they can be used as the primary basis to find the mappings. In other systems, the data instances in the two sources are completely different, the only association that they have is that the sources come from the same domain. Following this argument, XSLTGen discovers the mappings between two different types of document, i.e. an XML and an HTML document, whereas the other systems compare two documents of the same type. Finally, another important aspect which differs XSLTGen from several other systems, is that the process of discovering the mappings which will then be used to generate XSLT stylesheet is completely automatic.

5

Conclusion

With the upsurge in data exchange and publishing on the Web, conversion of data from its stored representation (XML) to its publishing format (HTML)

492

Stella Waworuntu and James Bailey

is increasingly important. XSLT plays a prominent role in transforming XML documents into HTML documents. However, it is difficult for users to learn. We have devised XSLTGen, a system for automatically generating an XSLT stylesheet, given an XML document and its corresponding HTML document. This is useful for helping users to learn XSLT. The main strong characteristics of the generated XSLT stylesheets are accuracy and reusability. We have described how the text matching, structure matching and sequence checking enables XSLTGen to discover not only 1-1 semantic mappings between the elements in the XML document and those in the HTML document, but also 1-m mappings and sometimes m-1 mappings. We have also described a fully automatic XSLT generation system that generates XSLT rules based on the mappings found. Our experiments showed that XSLTGen can achieve high matching accuracy and produce high quality stylesheets.

References 1. S. Bergamaschi, S. Castano, S.D.C.D. Vimeracati, and M. Vincini. An Intelligent Approach to Information Integration. In Proceedings of the 1st International Conference on Formal Ontology in Information Systems, pages 253–267, Trento, Italy, June 1998. 2. S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. In Proceedings of the 1996 International Conference on Management of Data, pages 493–504, Montreal, Canada, June 1996. 3. J. Clark. XSL Transformation (XSLT) Version 1.0. W3C Recommendation, November 1999. http: //www.w3.org/TR/xslt. 4. A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to Map between Ontologies on the Semantic Web. In Proceedings of the 11th International Conference on World Wide Web, pages 662–673, Honolulu, USA, May 2002. 5. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery, 7(1):23–56, January 2003. 6. M. Kay. XSLT Programmer’s Reference. Wrox Press Ltd., 2000. 7. L. Kurgan, W. Swiercz, and K.J. Cios. Semantic Mapping of XML Tags using Inductive Machine Learning. In Proceedings of the 2002 International Conference on Machine Learning and Applications, pages 99–109, Las Vegas, USA, June 2002. 8. M. Leventhal. XSL Considered Harmful. http : //www.xml.com/pub/a/1999/05/xsl/xslconsidered_1.html, 1999. 9. T. Milo and S. Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. In Proceedings of 24th International Conference on Very Large Data Bases, pages 122–133, New York, USA, August 1998. 10. K. Ono, T. Koyanagi, M. Abe, and M. Hori. XSLT Stylesheet Generation by Example with WYSIWYG Editing. In Proceedings of the 2002 International Symposium on Applications and the Internet, Nara, Japan, March 2002. 11. E. Rahm and P.A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, 10(4):334–350, December 2001. 12. S.M. Selkow. The Tree-to-Tree Editing Problem. Information Processing Letters, 6(6): 184–186, December 1977. 13. K.C. Tai. The Tree-to-Tree Correction Problem. Journal of the ACM, 26(3):422– 433, July 1979.

Efficient Recursive XML Query Processing in Relational Database Systems Sandeep Prakash1, Sourav S. Bhowmick1, and Sanjay Madria2 1

School of Computer Engineering Nanyang Technological University Singapore [emailprotected]

2

Department of Computer Science University of Missouri-Rolla Rolla, MO 65409 [emailprotected]

Abstract. There is growing evidence that schema-conscious approaches are a better option than schema-oblivious techniques as far as XML query performance is concerned in relational environment. However, the issue of recursive XML queries for such approaches has not been dealt with satisfactorily. In this paper we argue that it is possible to design a schema-oblivious approach that outperforms schema-conscious approaches for certain types of recursive queries. To that end, we propose a novel schema-oblivious approach called SUCXENT++ that outperforms existing schema-oblivious approaches such as XParent by up to 15 times and schema-conscious approaches (Shared-Inlining) by up to 3 times for recursive query execution. Our approach has up to 2 times smaller storage requirements compared to existing schema-oblivious approaches and 10% less than schema-conscious techniques. In addition, existing schemaoblivious approaches are hampered by poor query plans generated by the relational query optimizer. We propose optimizations in the XML query to SQL translation process that generate queries with more optimal query plans.

1 Introduction Recursive XML queries are considered to be quite significant in the context of XML query processing [3] and yet this issue has not been addressed satisfactorily in existing literature. Recursive XML queries are XML queries that contain the descendant axis (//). The use of the ‘//’ is quite common in XML queries due to the semi-structured nature of XML data [3]. For example, consider the XML document in Figure 2. The element item could occur either under europe or africa. Consider the scenario where a user needs to retrieve all item elements. The user will have to execute the path expression Q = /site//item. Another scenario could be that the document structure is not completely known to the user except that each item has a name and price. Suppose, the user needs to P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 493–510, 2004. © Springer-Verlag Berlin Heidelberg 2004

494

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

find out the price of the item with name “Gold Ignot”. Q = //item[name=“ Gold lgnot”]/price will be the corresponding path expression. Efficient execution of XML queries, recursive or otherwise, is largely determined by the underlying storage approach. There has been a substantial research effort in storing and processing XML data using existing relational databases [1, 6,2]. These approaches can be broadly classified as: (a) Schema-conscious approach: This method first creates a relational schema based on the DTD of the XML documents. Examples of such approach is the inlining approach [5]. (b) Schema-oblivious approach: This method maintains a fixed schema which is used to store XML documents irrespective of their DTD. Examples of schemaoblivious approaches are the Edge approach [1], XRel [7] and XParent [2]. Schema-oblivious approaches have obvious advantages such as the ability to handle XML schema changes better as there is no need to change the relational schema and a uniform query translation approach. Schema-conscious approaches, on the other hand, have the advantage of more efficient query processing [6]. Also, no special relational schema needs to be designed for schema-conscious approaches as it can be generated on the fly based on the DTD of the XML document (s). In this paper, we present an efficient approach to process recursive XML queries using a schema-oblivious approach. At this point, one would question the justification of this work for two reasons. First, this issue may have already been addressed. Surprisingly, this is not the case as highlighted in [3]. Second, a growing body of work suggests that schema-conscious approaches perform better than schema-oblivious approaches. In fact, Tian et al. have demonstrated in [6] that schema-conscious approaches generally perform substantially better in terms of query processing and storage size. However, the Edge approach [1] was used as the representative schema-oblivious approach for comparison. Although the Edge approach is a pioneering relational approach, we argue that it is not a good representation of the schema-oblivious approach as far as query processing is concerned. In fact, XParent [2] and XRel [7] have been shown to outperform the Edge approach by up to 20 times, with XParent outperforming XRel [2]. However, this does not mean that XParent outperforms schema-conscious approaches. In fact as we will show in Section 6, schema-conscious approaches still outperform XParent. Hence, it may seem that schema-conscious generally outperforms schema-oblivious in terms of query processing. In this paper we argue that it is indeed possible to design a schema-oblivious approach that can outperform schema-conscious approaches for certain types of recursive queries. To justify our claim, we propose a novel schema-oblivious approach, called SUCXENT++ (Schema Unconcious XML Enabled System (pronounced “succinct++”)), and investigate the performance of recursive XML queries. We only store the leaf nodes and the associated paths together with two additional attributes for efficient query processing (details follow in Section 3). SUCXENT++ outperforms existing schema-oblivious techniques, such as XParent, by up to 15 times and shared-inlining - a schema-conscious approach - by up to 3 times for recursive queries with characteristics described in Section 6. In addition,

Efficient Recursive XML Query Processing in Relational Database Systems

495

Fig. 1. Sample DTD.

Fig. 2. Sample XML document.

SUCXENT++ can reconstruct shredded documents up to 2 times faster than Shared-Inlining. The main reasons SUCXENT++ performs better than existing approaches are 1) Significantly lower storage size and, consequently, lower I/O-cost associated with query processing, 2) Fewer number of joins in the corresponding SQL queries and, 3)Additional optimizations, discussed in Section 5, that are made to improve the query plan generated by the relational query optimizer. In summary, the main contributions of this paper are: (1) A novel schema-oblivious approach whose storage size depends only on the number of leaf nodes in the document. (2) Optimizations to improve the query plan generated by the relational query optimizer. Traditional schema-oblivious approaches

496

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

have been hampered by the poor query plan selection of the underlying relational query optimizer [6,8]. (3) To the best of our knowledge, this is the first attempt to show that it is indeed possible to design a schema-oblivious approach that can outperform schema-conscious approaches as far as the execution of certain types of recursive XML queries is concerned.

2

Related Work

All existing schema-oblivious approaches store, at the very least, every node in the XML document. The Edge approach [1] essentially captures edge information of the tree that represents the XML document. However, resolving ancestor-descendant relationships requires the traversal of all the edges from the ancestor to the descendant (or vice-versa). The system proposed by Zhang et. al in [8] labels each node with its preorder and postorder traversal numbers. Then, ancestor-descendant relationships can be resolved in constant time using the property preorder(ancestor) < prearder(descendant) and postorder (ancestor) > postorder (descendant). It still results in as many joins as there are path separators. To solve the problem of multiple joins, XRel [7] stores the path of each node in the document. Then, the resolution of path expressions only requires the paths (which can be represented as strings) to be matched using string matching operators. However, the XRel approach still makes use of the containment property mentioned above to resolve ancestor-descendant relationships. It involve joins with (< or >) operators that have been shown to be quite expensive due to the manner in which an RDBMS processes joins [8]. In fact, special algorithms such as the Multi-predicate merge sort join algorithm [8] have been proposed to optimize these operations. However, to the best of our knowledge there is no off-the-shelf RDBMS that implements these algorithms. XParent [2] solves the problem of by using an Ancestor table that stores all the ancestors of a particular node in a single table. It then replaces with equi-joins over this set of ancestors. However, this approach results in an explosion in the database size as compared to the original document. The number of relational joins is also quite substantial. XParent requires a join between the LabelPath, Data Path, Element and Ancestor tables for each path in the query expression. The joins are quite expensive especially when the Ancestor table is involved as it can be quite large in size, SUCXENT++ is different from existing approaches in that it only stores leaf nodes and their associated paths. We store two additional attributes, called BranchOrder and BranchOrderSum, for each leaf node that capture the relationship between leaf nodes. Essentially, they allow the determination of common nodes between the paths of any two leaf nodes in a constant time. This results in a substantial reduction in storage size and query processing time. In addition, we propose optimizations that enable the underlying relational query optimizer to generate near-optimal query plans for our approach, resulting in a substantial performance improvement. Our studies indicate that these optimizations can be applied to other schema-oblivious approaches as well.

Efficient Recursive XML Query Processing in Relational Database Systems

Fig. 3. XParent schema.

497

Fig. 4. SUCXENT++ schema.

Fig. 5. SUCXENT++: XML data in RDBMS.

Schema-oblivious approaches are not influenced by recursion in the schema. However, the Edge approach uses recursive SQL queries using the SQL99 with construct to evaluate recursive XML queries. XParent and XRel handle recursive queries like any other query. Unlike these schema-oblivious approaches, schemaconscious strategies have to treat recursion in both schema and queries as special cases. In [3], the authors propose a generic algorithm to translate recursive XML queries for schema-conscious approaches using the SQL99 with construct. However, no performance evaluation of the resulting SQL queries is presented and it is assumed that schema-conscious approaches will outperform schema-oblivious approaches. SUCXENT++ also treats recursive XML queries like other queries. It also implements optimizations to generate SQL translations of recursive XML queries that enable the relational query optimizer to produce better query plans resulting in significant performance gains.

498

3

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

Storing XML Data

In this section, we first discuss the SUCXENT++ schema. This will be followed by a formal algorithm to reconstruct XML documents from their relational form. The document in Figure 2 is used as a running example.

3.1

SUCXENT++ Schema

The schema is shown in Figure 4 and the shredded document in Figure 5. The semantics of the schema is as follows. The Document table is used for storing the names of the documents in the database. Each document has unique id recorded in DocID.Path is used to record the path of all the leaf nodes. For example, the path of the first leaf node name in Figure 2 is /site/regions/europe/item/ name. This table maintains path_ids, relative path expressions and their length recorded as instances of PathID, PathExp and Length respectively. This is to reduce the storage size so that we only need to store path_id in the PathValue table. The Length attribute is useful for resolving recursive queries. PathValue stores only the leaf nodes. The DocID attribute indicates which XML document a particular leaf node belongs to. The PathID attribute maintains the id of the path of a particular leaf node as stored in Path. LeafOrder records the node order of leaf nodes in an XML tree. For example, when the sample XML document is parsed, the leaf node name with value “Gold Ignot” is encountered as the first leaf node. Therefore, it is assigned a LeafOrder value of 1.Branchorder of a leaf node is the level at which it intersects the preceding leaf node i.e., it is the level of the highest common ancestor of the leaf nodes under consideration. Consider the leaf node with LeafOrder=2 in Figure 2. This leaf node intersects the leaf node with LeafOrder=1 at the node item which is at level 4. So, the Branchorder value for this node is 4. Similarly, the node name with value “Item2” has Branchorder=2 (intersecting the node to the left at regions). PathValue stores the textual content of the leaf nodes in the column LeafValue. The attribute Branchorder in this table is useful for reconstructing the XML documents from their shredded relational format as discussed in Section 3.2. The significance of DocumentRValue and BranchOrderSum in PathValue is elaborated in Section 4 and CPathId in Path is discussed in Section 5. For the remainder of the paper, we will refer to LeafOrder and Branchorder as Order information.

3.2

Extraction of XML Documents

The algorithm for reconstruction is presented in Figure 6. The input to the algorithm is a list of leaf nodes arranged in ascending LeafOrder. Each leaf node path is first split into its constituent nodes (lines 5 to 7). If the document construction has not yet started (line 10) then the first node obtained by splitting the first leaf node path is made the root (lines 11 to 15). When the next leaf node is processed we only need to look at the nodes after Branchorder of that node as the nodes up to this level have already been added to the document

Efficient Recursive XML Query Processing in Relational Database Systems

499

Fig. 6. Extraction algorithm.

(lines 20 to 22). The remaining are now added to the document (lines 27 to 32). Document extraction is completed once all the leaf nodes have been processed. In addition to reconstructing the whole document, this algorithm can be used to construct a document fragment given a partial list of consecutive leaf nodes.

4

Recursive Query Processing

Consider the recursive query XQuery 1 in Figure 7. A tree representation of the query is shown in Figure 8. This query returns those price leaf nodes that intersect the constraint-satisfying text leaf node at item. Consider how XParent resolves this query. The schema for XParent is shown in Figure 3. XParent evaluates this query by locating leaf nodes from the Data table that satisfy the constraint on text. This involves a join between the LabelPath a nd Data to satisfy the path constraint /site/regions/africa/item//text and a predicate on the Data to satisfy the value constraint. Next, LabelPath and Data tables are joined again to obtain those leaf nodes that satisfy /site/regions/ africa/item/price. These two results sets are joined using the Ancestor table to find nodes that have a common ancestor at level 4 ( at item). Thus, the final SQL query involves five joins - two between the LabelPath and Data, two between the Data and Ancestor and one between two Ancestor tables (SQL query translation details for XParent can be found in [2]). These joins can be quite expensive due to the large size of Ancestor. XRel follows a similar approach to

Fig. 7. Running example.

Fig. 8. Query Tree.

500

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

resolving path expressions except that it uses the ancestor-descendant containment property instead of an Ancestor table. This produces resulting in performance worse than XParent. A detailed evaluation of XRel vs. XParent can be found in [2].

4.1

The SUCXENT++ Approach

In order to reduce the I/O cost involved in query evaluation, SUCXENT++ only stores the leaf nodes of a document. However, the attributes discussed till now are insufficient for query processing. The schema needs to be extended as follows. An attribute BranchOrderSum, denoted as is assigned to a leaf node with LeafOrder In addition, we store an attribute RValue, in the DocumentRvalue table for each level, in the document. Essentially, these allow the determination of common nodes between the paths of any two leaf nodes in a constant time. This results in a substantial reduction in storage size and query processing time. Given an XML document with maximum depth D the RValue and BranchOrderSum assignment is done as follows. (1) RValue is assigned recursively based on the equation: where (a) is the maximum number of consecutive leaf nodes with (b) (2) Let us denote the BranchOrder of a node with LeafOrder as Then, the BranchOrderSum of this node is We illustrate the above attributes with an example. Consider the document in Figure 2. For simplicity, ignore the parlist element. Then, the depth of the document in 6. So, and This means that The maximum number of consecutive leaf nodes with BranchOrder 5 is 1. Therefore, The maximum number of consecutive leaf nodes withBranchOrder 4 is 3 (e.g., price, text, keyword under the first item element). So, BranchOrderSum of the first leaf node is 0. Since BranchOrder of the second leaf node is 4 and BranchOrderSum of the second leaf node is 3. The values for the complete document are shown in DocumentRValue and PathValue of Figure 5. Lemma 1. If then nodes with LeafOrders and intersect at a level greater than That is, where is the level at which nodes with leaf orders and intersect. The proof for the above lemma is not presented here due to space constraints. The attributes RValue and BranchOrderSum allow the determination of the intersection level between any two leaf nodes in a more or less constant time, whereas in XParent, it depends on the size of the Ancestor and Data tables as a join between these tables is required to determine the ancestor node at a particular level. This reduces the query processing time drastically. Since this is achieved without storing separate ancestor information, the storage requirements are also reduced significantly. We will now discuss how these attributes are useful in query processing. Consider XQuery1. The BranchOrderSum value for the first constraint satisfying text is 6. The BranchOrderSum value for the first price node is 3. Also,

Efficient Recursive XML Query Processing in Relational Database Systems

501

10. Using the property proven above we conclude that these two nodes have ancestors till a since Since, item is at level 4 in both cases it is clear that they have a common item node and, therefore, satisfy the query. Similarly, we can conclude that the first text node and the item node with name Item3 intersect at a level > 1 (since and and therefore do not form a part of the query result.

4.2

SQL Translation

We have implemented an algorithm to translate XQuery queries to SQL in SUCXENT++. Due to space constraints we discuss the translation procedure informally. Consider the recursive query of Figure 7 (XQuery 1) and its corresponding SQL translation (SQL 1). The translation can be explained as follows: (1) Lines 5, 7 and 8 translate the part of the query that seeks an entry with contains(text,“Gold Ignot”). Note that we store only the leaf nodes, their textual content and path_id) in the PathValue table. The actual path expression corresponding to the leaf node is stored in the Path table. Therefore, we need to join the two to obtain leaf nodes that correspond to the path /site/regions/africa/item//text and contain the phrase “Gold Ignot”. Notice that the corresponding SQL translation has the LIKE clause to resolve the // relationship. This is how recursive queries are handled in SUCXENT++. (2) Lines 6 and 7 do the same for the extraction of leaf nodes that correspond to the path /site/regions/africa/item/price. (3) Line 9 ensures that the leaf nodes extracted in Lines 5 to 8 belong to the same document. (4) Line 10 ensures that the two sets of leaf nodes intersect at least level 4. The reason a level 4 ancestor is needed is that the two paths in the query intersect at level 4. It calculates the absolute value of the difference between the BranchOrderSum values and ensures that it is below the RValue for level 4. (5) Line 1 returns the properties of the leaf nodes corresponding to the price element. These properties are needed to construct the corresponding XML fragment based on the algorithm in Figure 6. Say, the return clause in Figure 7 was $b. Then, line 6 in the translation would change to p2.PathExp LIKE ’/site/regions/africa/item%’ to extract all leaf nodes that have paths beginning with $b. This way, elements and their children can be retrieved. Compared to XParent, SUCXENT++ uses only the PathValue, Path and DocumentRValue tables to evaluate a query. The size of the PathValue and Path tables is the same as that of the Data and LabelPath tables in XParent. DocumentRValue has the same number of rows as the depth of the document as compared to the Ancestor table in XParent which stores the ancestor list of every node in the document. This results in substantially better query performance in addition to much smaller storage size.

5

Optimizat ions

A preliminary performance evaluation using the above translation procedure yielded some interesting results. We checked the query plans generated by the

502

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

Fig. 9. Initial query plan.

Fig. 10. Path optimization.

Fig. 11. Multiple-queries optimization.

query optimizer and noticed that the join between the Path and PathValue tables took a significant portion of the query processing time. This was because for most of the queries this join was being performed last. For example, in SQL 1 of Figure 7 the joins in lines 8 to 10 were evaluated first and only then was the join between Path and PathValue tables performed. The initial query plan is shown in Figure 9. We have not shown the DocumentRValue table in the plan, even though the query optimizer includes it, as it does not influence the optimization. The two Hash-Joins (labelled 1 and 2) in this plan are both very expensive. The first takes the PathValue table (with alias v2) as one of its inputs. The second join takes the result of this join as one of its inputs. Both these inputs are quite substantial in size resulting in very expensive join operations. In order to improve the above query plans we propose three optimizations that are discussed below. Optimization for Simple Path Expressions. The join expression v1.PathId = p1.Id and p1.PathExp = path is replaced with v1.PathId = n where n is the PathId value corresponding to path in the table Path. Similarly, v1.PathId = p1.Id and p1.PathExp LIKE path% is replaced with v1.PathId >= n and v1.PathId = 1 and p.CPathId , =, [][] <selectClause> :: = SELECT <pageList> :: = query [, query...] :: = FROM <matrixIdentifier> FROM <matrixList> :: = WHERE LINKWT integer :: = GROUP BY <pageList> <pageList> :: = pageIdentifier [, pageIdentifier...] <matrixList> :: = matrixIdentifier [, matrixIdentifier...] There are four main clauses in a query expression: the select, from, condition, and group . Among them the select and the from clauses are compulsory, while the condition and the group clauses are optional. Similar to SQL, WUML is a simple declarative language but is powerful enough to express query on the log information stored as navigation matrices. We execute a WUML expression by translating it into a sequence of NLA operations using Algorithm 2. Suppose is a set of Web pages in the select clause, is a set of matrices in the

WUML: A Web Usage Manipulation Language for Querying Web Log Data

Fig. 1. WUML query tree

from clause,

573

Fig. 2. Optimized WUML query tree

is a set of

Web pages in the group clause, and are two non-negative integers. Note that the input WUML query expression is assumed to be syntactically valid. If or then Fig. 1 depicts a query tree which illustrates the basic idea. Now, we present a set of examples, which illustrates the usage of the WUML expressions and the translation into the corresponding sequence of NLA operations. Let M, and be navigation matrices. We want to know how frequently the pages and were visited. WUML expression: SELECT FROM SUM NLA operation: We want to find out the essential difference of preferences between the two groups of users in and We consider those links having the weight > 3. WUML expression: SELECT FROM DIFF WHERE LINKWT > 3. NLA operation: We want to get the navigation details in an information unit [9] consisting of the pages and We may gain insight to decide whether it is better to combine these three Web pages or not. So we consider them as a group. WUML expression: SELECT FROM SUM GROUP BY NLA operation: We want to know whether some pages were visited by users after 3 clicks. If they were seldom visited or lately visited in a user session, we may decide to remove or update them to make them more popular. WUML expression: SELECT FROM POWER 3 M. NLA operation: Now we want to get the specific information of a particular set of Web pages. WUML expression: SELECT FROM INTERSECT WHERE LINKWT > 6. NLA operation:

574

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Let us again consider the query, We will see in Sect. 6 that the running time of NLA operators are proportional to the number of non-zero elements in the executed matrix. Therefore, the optimal plan is to first execute the NLA operators which can minimize the number of non-zero elements in the matrix. For the sake of efficiency, the projection should be executed as early as possible. So a better NLA execution plan of can be obtained as follows: We now summarize some optimization rules as depicted in Fig. 2. First, projection should be done as early as possible since it can eliminate some nonzero elements. Note that projection is not distributive under difference and power. Second, since selection is not distributive under some binary operators such as difference, we do not change the execution order. Finally, grouping creates a view different from the underlying Web topology. Therefore, it should be done at the last step except some operators taking another navigation matrix whose structure is the same as the grouped one. Note that these rules are simple heuristics to sort NLA operations. We still need to find out a more systematic way to generate an optimized execution plan for a given WUML expression.

5

Storage Schemes for Navigation Matrices

As the navigation matrices generated from the Web log files are usually sparse, the storage scheme of a matrix greatly affects the performance of WUML. In this section we introduce three storage schemes, COO, CSR, and CSC, to study their impacts on the NLA operations. In literature, the technique of storing sparse matrices has been intensively studied [3,8]. In our WUML environment, we store the navigation matrix as three separate parts: the first row (i.e. the weights of the links starting from S), the last column(i.e. the weights of the links ending in F) and the square matrix despite the rows and columns of S and F. We employ two vectors, and which contains an array for the non-zero values as well as corresponding indices, to store the first row and the last column. Table 2 and 3 show examples using the matrix in Table 1. As for the third part, we implement the storage schemes proposed in [8]. We illustrate the schemes using the matrix in Table 1. The Coordinate (COO) storage scheme is the most straightforward structure to represent a sparse matrix. As illustrated in Table 4, it records each nonzero

WUML: A Web Usage Manipulation Language for Querying Web Log Data

575

entry together with its column and row index in three arrays in row-first order. Similar to COO, the Compressed Sparse Row (CSR) storage scheme also consists of three arrays. It differs from COO in the Compressed Row array which stores the location of the first non-zero entry in that row. Table 5 shows the structure of CSR. The Compressed Sparse Column (CSC) storage scheme, as shown in Table 6, is similar to CSR. It has three arrays: Nonzero array to hold the non-zero values in column-first order, Compressed Column array to hold the location of the first non-zero entry of that column, Row array for the row indices. CSC is the transpose of CSR. There are also other sparse matrix storage schemes, such as Compressed Diagonal Storage (CDS) and Jagged Diagonal Storage (JDS) [12]. However, they are used for storing a banded sparse matrix. In reality, the navigation matrix should not be banded. Therefore, these schemes are not studied in our experiments.

6

Experimental Results and Analysis

We carry out a set of experiments to compare the performances of the three storage schemes introduced in Sect. 5. We also study the usability and efficiency of WUML on different data sets. The data set we used is a set of synthetic Web logs on different Web topology, which are generated by a log generator designed in [10]. The parameters used to generate the log files are described in Table 7. Among these four parameters, PageNum and MeanLink are dependent on the underlying Web topology while the other two are not. These experiments are run on Pentium 4, 2.5GHz, and 1G of RAM machine configuration.

6.1

Construction Time of Storage Schemes

We choose three data sets: and in which the components represent the parameters LogSize, UserNum, PageNum and MeanLink, respectively. Then we construct three storage schemes based on the generated log files from to

576

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Our measurement of the system response time includes I/O processing time and CPU processing time. As shown in Fig. 3, the response time grows significantly as the parameters increase. Since most of the time is consumed in reading the log files, the construction time for the same given data set varies slightly among the three storage schemes. But it still takes more time to construct COO than the other two schemes, since there is no compressed array for COO. CSC needs more time than CSR because the storage order in CSC is column-first while reading in the file is in row-first order.

Fig. 3. Construction Time

6.2

Fig. 4. Running Four Operators

Processing Time of Binary Operators

We present the CPU processing time results of four binary operators: sum,union, difference and intersection. Each time we tune one of the four parameters to see how the processing time changes on COO, CSR and CSC storage schemes. For each parameter, we carry out experiments on ten different sets of Web logs. We first compare the processing time of each single operator under different storage schemes. Then we present the processing time of each storage scheme under different operators. Tuning LogSize. We set UserNum and PageNum as 3000, MeanLink as 5. The results are shown in Fig. 5. When LogSize increases, the processing time of the same operator on each storage scheme also increases. The reason is that the number of non-zero elements in the navigation matrix grows with the increase of LogSize, and therefore it needs more time to do the operations. Tuning PageNum. We set LogSize as 5000, UserNum as 3000, and MeanLink as 5. The results are presented in Fig. 7. With the growth of PageNum, the CPU time for each operator on specified storage scheme grows quickly. It is because PageNum is a significant parameter when constructing the navigation matrix. The more pages in the Web site, the larger dimension of a navigation matrix, and consequently, the more time needed to construct the navigation matrix. Tuning UseNum. Figure 6 shows the results when LogSize is 5000, PageNum is 3000 and MeanLink is 5. The processing time remains almost unchanged when UserNum grows. The main reason is that, although different user may have different behavior when traversing the Web site, the number of non-zero elements in navigation matrix is roughly the same due to the fixed LogSize.

WUML: A Web Usage Manipulation Language for Querying Web Log Data

577

Fig. 5. The CPU time by tuning LogSize

Tuning MeanLink. We use the log files with LogSize of 5000, UserNum of 3000 and PageNum of 3000. The results are shown in Fig. 8 which indicates that, with the increase of MeanLink, the processing time decreases. Note that for sum, COO always outperforms others, while CSR and CSC perform almost the same (see Fig. 5(a), 6(a), 7(a) and 8(a)). The similar phenomenon can be observed in Fig. 5(d), 6(d), 7(d) and 8(d) for intersection. As shown in Fig. 5(b), 6(b), 7(b) and 8(b), the processing time for union on three storage schemes has no significant difference. Finally, from Fig. 5(c), 6(c), 7(c) and 8(c), the performances of CSR and CSC are much better than COO for difference. Note also that from Fig. 4, difference requires the most processing time, and sum needs the least. The Web logs used are of 5000 LogSize, 1000 UserNum, 3000 PageNum and 5 MeanLink. The reason for this result is as follows. As we have mentioned, we do not need to check the balance of Web pages and the validity of the navigation matrix for sum. Therefore, it takes the least time. For union, we only need to check the balance of Web pages without checking the validity of the output matrix. But for difference and intersection, we have to check both the page balance and matrix validity, which is rather time-consuming. It can be found that intersection does not need much time since there are very few non-zero elements in the output matrix.

578

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Fig. 6. The CPU time by tuning UserNum

6.3

Performance of Unary Operators

Power. We use log files with 5000 LogSize, 3000 UserNum, 5 MeanLink, (100, 500,1000) PageNum. Each matrix multiplies twice (i.e. power = 2). As shown in Fig. 9, COO performs much worse than CSR and CSC. We also see that power is a rather time-consuming operator. Projection and Selection. Since projection and selection are commutative, we study the time cost by swapping them on the navigation matrix with 5000 LogSize, 5000 PageNum, 3000 UserNum and 5 MeanLink. As shown in Fig. 10, doing projection before selection is more efficient than doing selection and then projection. According to this result, we can do some optimization when interpreting some queries. Moreover, COO outperforms CSR and CSC. From the experimental results shown above, we have the following observations. First, from construction point of view, CSR is the best. Second, COO is the best for sum and intersection. Third, CSR and CSC perform almost the same for difference and power, and greatly outperform COO. Finally, COO, CSR and CSC perform the same for union. Taking these observations into consideration, CSR is the best for our WUML expressions. Although COO performs better in sum and intersection, it needs too much time for difference which is intolerant. Although the performance of CSC is the same as CSR with respect to the operations, CSC needs more time to be constructed. We also observe that

WUML: A Web Usage Manipulation Language for Querying Web Log Data

579

Fig. 7. The CPU time by tuning PageNum

the time growth for each operator is linear to the growth of parameters, which indicates that the usability and scalability of WUML is acceptable in practice.

7

Concluding Remarks

We presented NLA which consists of a set of operators on navigation matrices and proposed an efficient algorithm VALID space and time complexities) to ensure the validity of an output matrix by NLA operators. Within NLA, we develop a query language WUML and study the mapping between the WUML statements and NLA expressions. To choose an efficient storage scheme for the sparse navigation matrix, we carried out a set of experiments on different synthetic Web log data sets, which are generated by tuning different parameters such as the number of pages, the number of mean links and the number of users. By the experimental results on three storage schemes of COO, CSC and CSR, we can see that the CSR scheme is relatively efficient for NLA. As for future work, we plan to develop a full-fledged WUML system to preform both analyzing and mining the real Web log data sets. We are also studying a more complete set of optimization heuristic rules for the NLA operators in order to generate a better execution plan for an input WUML expression.

580

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Fig. 8. The CPU time by tuning MeanLink

Fig. 9. Power (in log scale)

Fig. 10. Projection and Selection

References l. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of VLDB, 1994. 2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of ICDE, 1995. 3. Nawaaz Ahmed et al. A framework for sparse matrix code synthesis from high-level specifications. In Proc. of the 2000 ACM/IEEE Conf. on Supercomputing, 2000. 4. M. Baglioni et al. Preprocessing and mining web log data for web personalization. The 8th Italian Conf. on AI, 2829:237–249, September 2003.

WUML: A Web Usage Manipulation Language for Querying Web Log Data

581

5. B. Berendt and M. Spiliopoulou. Analyzing navigation behavior in web sites integrating multiple information systems. The VLDB Journal, 9(1), 2000. 6. L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the world wide web. Journal of Artificial Intelligence Research, 27(6):1065–1073, 1995. 7. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern discovery on the world wide web. ICTAI, 1997. 8. N. Goharian, A. Jain, and Q. Sun. Comparative analysis of sparse matrix algorithms for information retrieval. Journal of Sys., Cyb. and Inf., 1(1), 2003. 9. Wen-Syan Li et al. Retrieving and organizing web pages by information unit. In Proc. of WWW, pages 230–244, 2001. 10. W. Lou. loggen: A generic random log generator: Design and implementation. Technical report, CS Department, HKUST, December 2001. 11. Wilfred Ng. Capturing the semantics of web log data by navigation matrices. Semantic Issues in E-Commerce Systems. Kluwer Academic, pages 155–170, 2003. 12. Y. Saad. Krylov subspace methods on supercomputers. SIAM J. of Sci. Stat. Comput., 10:1200–1232, 1989. 13. M. Spiliopoulou and L.C. Faulstich. WUM: A web utilization miner. In Proc. of EDBT Workshop, WebDB98, 1998. 14. J. Srivastava et al. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000.

An Agent-Based Approach for Interleaved Composition and Execution of Web Services Xiaocong Fan, Karthikeyan Umapathy, John Yen, and Sandeep Purao School of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802 {zfan,kxu110,jyen,spurao}@ist.psu.edu

Abstract. The emerging paradigm of web services promises to bring to distributed computing the same flexibility that the web has brought to the publication and search of information contained in documents. This new paradigm puts severe demands on composition and execution of workflows that must survive and respond to changes in the computing and business environments. Workflows facilitated by web services must, therefore, allow dynamic composition in ways that cannot be predicted in advance. Utilizing the notions of shared mental models and proactive information exchange in agent teamwork research, we propose a solution that interleaves planning and execution in a distributed manner. This paper proposes a generic model, gives the mappings of terminology between Web services and team-based agents, describes a comprehensive architecture for realizing the approach, and demonstrates its usefulness with the help of an example. A key benefit of the approach is the proactive failures handling that may be encountered during execution of complex web services.

1

Introduction

The mandate for effective composition of web services comes from the need to support complex business processes. Web services allow a more granular specification of tasks contained in workflows, and suggest the possibility of gracefully accommodating short-term trading relationships, which can be as brief as a single business transaction [1]. Facilitating such workflows requires dynamic composition of complex web services that must be monitored for successful execution. Drawing from research in workflow management systems [2], the realization of complex web services can be characterized by the following elements: (a) creation of execution order of operations from the short-listed Web services; (b) enacting the execution of the services in the sequenced order; and (c) administrating and monitoring the execution process. The web services composition problem has, therefore, been recognized to include both the coordination of sequence of services execution and also managing the execution of services as a unit [3]. Much current work in web service composition continues to focus on the first ingredient, i.e. discovery of appropriate services and planning for the sequencing P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 582–595, 2004. © Springer-Verlag Berlin Heidelberg 2004

An Agent-Based Approach for Interleaved Composition

583

and invocation of these web services [4]. Effective web services composition, however, must encompass concerns beyond the planning stage, including the ability to handle errors and exceptions that may arise in a distributed environment. Monitoring of the execution and exception handling for the web services must, therefore, be part of an effective strategy for web service composition [5]. An approach to realizing this strategy is to incorporate research from intelligent agents, in particular, team-based agents [6]. The natural match between web services and intelligent agents - as modularized intelligence - has been alluded to by several researchers [7]. The objective of the research reported in this paper is to go a step further: to develop an approach for interleaved composition and execution of web services by incorporating ideas from research on team-based agents. In particular, an agent architecture will be proposed such that agents can respond to environmental changes and adjust their behaviors proactively to achieve better QoS (quality of services).

2 2.1

Prior Work Composition of Web Services

Web services are loosely coupled, dynamically locatable software, which provides a common platform-independent framework that simplifies heterogeneous application integration. Web services use a service-oriented architecture (SOA) that communicates over the web using XML messages. The standard technologies for implementing the SOA operations include Web Services Description Language (WSDL), Universal Description, Discovery and Integration (UDDI), Simple Object Access Protocol (SOAP) [8], and Business Process Execution Language for Web Services (BPEL4WS). Such function-oriented approaches have provided guidelines for planning of web service compositions [4,9]. However, the technology to compose web services has not kept pace with the rapid growth and volatility of available opportunities [10]. While the composition of web services requires considerable effort, its benefit can be short-lived and may only support short-term partnerships that are formed during execution and disbanded on completion [10]. Web services composition can be conceived as two-phase procedure, involving planning and execution [11]. The planning phase includes determining series of operations needed for accomplishing the desired goals from user query, customizing services, scheduling execution of composed services and constructing a concrete and unambiguously defined composition of services ready to be executed. The execution phase involves process of collaborating with other services to attain desirable goals of the composed services. The overall process has been classified along several dimensions. The dimension most relevant for our discussion is: pre-compiled vs. dynamic composition [12]. Compared with the pre-compilation approach, dynamic compositions can better exploit the present state of services, provide runtime optimizations, as well as respond to changes in the business environment. But on the other hand, dynamic compositions of web services is a particularly difficult problem because of the continued need to

584

Xiaocong Fan et al.

provide high availability, reliability, and scalability in the face of high degrees of autonomy and heterogeneity with which services are deployed and managed on the web [3]. The use of intelligent agents has been suggested to handle the challenges.

2.2

Intelligent Agents for Web Service Composition

There is increasing recognition that web services and intelligent agents represent a natural match. It has been argued that both represent a form of “modularized intelligence” [7]. The analogy has been carried further to articulate the ultimate challenge as the creation of effective frameworks, standards and software for automating web service discovery, execution, composition and interoperation on the web [13]. Following the discussion of web service composition above, the role of intelligent agents may be identified as on-demand planning, and proactively responding to changes in the environment. In particular, planning techniques have been applied to web services composition. Kay [14] describes the ATL Postmaster system that uses agent-based collaboration for service composition. A drawback of the system is that the ATL postmaster is not fault-tolerant. If a node fails, the agents residing in it are destroyed and state information is lost. Maamar and et. al. [15] propose a framework based on software agents for web services composition, but fail to tie their framework to web services standards. It is not clear how their framework will function with BPEL4WS and other web services standards and handle exceptions. Srivastava and Koehler [4], while discussing use of planning approaches to web services composition, indicate planning alone is not sufficient; and useful solutions must consider failure handling as well as composition with multiple partners. Effective web service composition, thus, requires expertise regarding available services, as well as process decomposition knowledge. A particular flavor of intelligent agents, called team-based agents, allows expertise to be distributed, making them a more appropriate fit for web services composition. Team-based agents are a special kind of intelligent agents with distributed expertise (knowledge) and emphasize on cooperativeness and proactiveness in pursuing their common goals. Several computational models of teamwork have been proposed including [16], STEAM [17] and CAST [6]. These models allow multiple agents to solve (e.g., planning, task execution) complex problems collaboratively. In web services composition, team-based agents can facilitate a distributed approach to dynamic composition, which can be scalable, facilitate learning about specific types of services across multiple compositions, and allow proactive failure handling. In particular, the CAST architecture (Collaborative Agents for Simulating Teamwork) [6] offers a feasible solution for dynamic web services composition. Two key features of CAST are (1) CAST agents can work collaboratively using a shared mental model of the changing environment; (2) CAST agents proactively inform each other of changes that they perceive to handle any exceptions that arise in achieving a team goal. By collaboratively monitoring the progress of a shared process, a team of CAST agents can not only

An Agent-Based Approach for Interleaved Composition

585

initiate helping behaviors proactively but can also adjust their own behaviors to the dynamically changing environment. In the rest of this paper, we first propose a generic team-based agent framework for dynamic web-service composition, and then extend the existing CAST agent architecture to realize the framework.

3

A Methodology for Interleaved Composition and Execution

We illustrate the proposed methodology with the help of an example that demonstrates how team-based agents may help with dynamic web services composition. The example concerns dynamic service outsourcing in a virtual software development organization, called ‘VOSoft’. VOSoft possesses expertise in designing and developing software packages for customers from a diversity of domains. It usually employs one of two developing methodologies (or software processes): prototype-based approach (Mp) is preferred for software systems composed of tightly-coupled modules (integration problems reveal earlier), and unit-based approach (Mu) is preferred for software systems composed of loosely-coupled modules (more efficient due to parallel tasks). Suppose a customer “WSClient” engages VOSoft to develop CAD designsoftware for metal casting patterns. It is required that the software is able to (a) read AutoCAD drawings automatically, (b) develop designs for metal casting patterns, and (c) maintain all the designs and user details in a database. Based on its expertise, VOSoft designs the the software as being composed of three modules: database management system (DMS), CAD, and pattern design. Assume VOSoft’s core competency is developing the application logic that is required for designing metal casting patterns, but it cannot develop CAD software and the database module. Hence, VOSoft needs to compose a process where the DMS and CAD modules could be outsourced to competent service providers. In this scenario, several possible exceptions may be envisioned. We list three below to illustrate the nature and source of these exceptions. First, non-performance by a service provider will result in a service failure exception, which may be resolved by locating another service to perform the task. Second, module integration exceptions may be raised if two modules cannot interact with each other. This may be resolved by adding tasks to develop common APIs for the two modules. Third, the customer may change or add new functionality, which may necessitate the change of the entire process. It is clear that both internal (capability of web services) as well as external (objectives being pursued) changes can influence the planning and execution of composite web services in such scenarios. It thus requires an approach being able to monitor service execution and proactively handle services failures.

3.1

Composing with Team-Based Agents

A team-based agent A is defined in terms of (a) a set of capabilities (service names), denoted as (b) a list of service providers under its management,

586

Xiaocong Fan et al.

and (c) an acquaintance model (a set of agents known to A, and their respective capabilities: An agent in our framework may play multiple roles. First, every agent is a Web-service manager. An agent A knows which providers in can offer a service or at least knows how to find a provider for S (e.g. by searching the UDDI registry) if none of the providers in are capable of performing the service. Services in are primitive to agent A in the sense that it can directly delegate the services to appropriate service providers. Second, an agent becomes a service composer upon being requested of a complex service. An agent is responsible for composing a process using known services when it receives a user request that falls beyond its capabilities. In such situations, the set of acquaintances, forms a community of contacts available to agent A. The acquaintance model is dynamically modified based on the agent’s collaboration with other agents (e.g., assigning credit to those with successful collaborations). This additional, local knowledge supplements the global knowledge about publicly advertised web services (say, on the UDDI registry). Third, an agent becomes a team leader when it tries to forming a team to honor a complex service.

Fig. 1. Team formation and Collaboration

3.2

Responding to Request for a Complex Service

An agent, upon receiving a complex service request, initiates a team formation process: (1) The agent (say, C) adopts “offering service S” as its persistent goal. (2) If (i.e., S is within its capabilities), agent C simply delegates S to a competent provider (or first finds a service provider, if no provider known to C is competent).

An Agent-Based Approach for Interleaved Composition

587

(3) If (i.e., agent C cannot directly serve S), then C tries to compose a process (say, H) using its expertise and the services in (i.e., it considers its own capabilities and the capabilities of those agents in its acquaintance model), then starts to form a team: i. Agent C identifies teammates by examining agents in its acquaintance model who have the capability to contribute to the process (i.e. and where is the set of services used in process H). ii. Agent C chooses willing and competent agents from (e.g., using contractnet protocol [18]) as teammates, and shares the process H with them with a view to working together as a team jointly working on H.

(4) If the previous step fails, then agent C either fails in honoring the external request (is penalized), or, if possible, may proactively discover a different agent (either using or a using UDDI) and delegate S to it. [Example] Suppose agent VOSoft composes a top-level process as shown in Figure 1(a). In the process, the “contract” service is followed by a choice point, where VOSoft needs to make a decision on which methodology ( M p or Mu) to choose. If Mu is chosen, then services DMS-WS, CAD-WS and Pattern-WS are required; if Mp is chosen, then services need to be more refined so that interactions between service providers in the software development process could be carried out frequently to avoid potential integration problems at later stages. Now, suppose VOSoft chooses branch Mu, and manages to form a team including agents T1, T3 and VOSoft to collaboratively satisfy the user’s request. It is possible that agent T4 was asked but declined to join the team for certain reason (e.g., lack of interest or capacity). After the team is formed, each agent’s responsibility is determined and mutually known. As a team leader, agent VOsoft is responsible for coordinating others’ behavior to work towards their common goal, and making decisions at critical points (e.g., adjust the process if service fails). Agent T1 is responsible for service DMS-WS; and agent T3 is responsible for service CAD-WS. As service managers, both T1 and T3 are responsible for choosing an appropriate service provider for service DMS-WS and CAD-WS, respectively.

3.3

Collaborating in Service Execution

The sharing of high-level process enables agents in a team to perform proactive teamwork behaviors during service execution. Proactive Service Discovery: Knowing the joint responsibility of the team and individual responsibility of team members, one agent can help another find web services. For example, in Figure 1(b), agent T1 is responsible for contributing service D-design. Agent T3, who happened to identify a service provider for service D-design while interacting with the external world, can proactively inform T1 about the provider. This can not only improve T1’s competency regarding service D-design, but also can enhance T3’s credibility in T1’s acquaintance model.

588

Xiaocong Fan et al.

Proactive Service Delegation: An agent can proactively contract out services to competent teammates. For example, suppose branch Mu is selected and service CAD-WS is a complex service for T3, who has composed a process for CAD-WS as shown in Figure 1(b). Even though T3 can perform C-design and C-code, services C-test and C-debug are beyond its capabilities. In order to provide the committed service CAD-WS, T3 can proactively form another team and delegate the services to the recruited agents (i.e., T6). It might be argued that agent VOSoft would have generated a high-level process with more detailed decomposition, say, the sub-process generated by T3 for CAD-WS were embedded (in the place of CAD-WS) as a part of the process generated by VOSoft. If so, T6 would have been recruited as VOSoft’s teammate, and no delegation will be needed. However, the ability to derive a process at all decomposition levels is too stringent a requirement to place on any single agent. One benefit of using agent teams is that one agent can leverage the knowledge (expertise) distributed among team members even though each of them only have limited resources. Proactive Information Delivery: Proactive information delivery can occur in various situations. (i) A complex process may have critical choice points where several branches are specified, but which one will be selected depends on the known state of the external environment. Thus, teammates can proactively inform the team leader about those changes in states that are relevant to its decision-making, (ii) Upon making a decision, other teammates will be informed of the decision so that they can better anticipate potential collaboration needs, (iii) A web service (say, the service Test in branch Mu) may fail due to many reasons. The responsible agent can proactively report the service failures to the leader so that the leader can decide how to respond to the failure: choose an alternative branch (say, Mp), or request the responsible agent to re-attempt the service from another provider.

4

The CAST-WS Architecture

We have designed a team-based agent architecture CAST-WS (Collaborative Agents for Simulating Teamwork among Web Services) to realize our methodology (see Figure 2). In the following, we describe the components of CAST-WS and explain their relationships.

4.1

Core Representation Decisions

The core representation decisions that drive the architecture involve mapping concepts from team-based agents to composition and execution of complex web services with an underlying representation that may be common to both domains. Such a representation is found in Petri nets [19]. The CAST architecture utilizes hierarchical predicate-transition nets as the underlying representation for specifying plans created and shared among agents. In the web service domain, the dominant standard for specifying compositions, BPEL4WS can also be interpreted based on a broad interpretation of the Petri net formalism. Another

An Agent-Based Approach for Interleaved Composition

589

Fig. 2. The CAST-WS Architecture

key building block for realizing complex web services, protocols for conversations among web services [20], uses state-based representations that may be mapped to Petri-net based models for specifying conversation states and their evolution. As a conceptual model, therefore, a control-oriented representation of workflows, complex web services and conversations can share the Petri-net structure, with the semantics provided by each of the domains. The mapping between teambased agents and complex web services is summarized in Table 1 below.

Following this mapping, we have devised the following components of the architecture: service planning (i.e. composing complex web services), team coordination (i.e. knowledge sharing among web services), and executing (i.e. realizing the execution of complex web services).

4.2

WS-Planning Component

The Planning component is responsible for composing services and forming teams. This component includes three modules. The service discovery module is used by service planner to lookup in UDDI registry for required services. The

590

Xiaocong Fan et al.

team formation module, together with acquaintance model, is used to find team agents who can support the required services. A web service composition starts from user’s request. The agent who gets the request is the composer agent, who is in charge of fulfilling the request. Upon receiving a request, the composer agent turns the request into its persistent goal and invokes its service planner module to generate a business process for it. The architecture uses hierarchical predicate-transition nets (PrT nets) to represent and monitor business processes. PrT Nets consists of the following components: (1) a set P of token places for controlling the process execution; (2) a set T of transitions, which represent either an abstraction of a sub PrT net (i.e. an invocation of some sub-plans), or an operation (e.g., primitive web service). A transition is associated with preconditions (predicates), which is used to specify conditions for continuing the process. (3) a set of arcs over P × T that describes the order of execution that the team will follow; and (4) a labeling function on arcs, which are tuples of agents and bindings for variables. The services that are used by the service planner for composing a process come from two sources: the UDDI directory and the acquaintance model. Assume from the requested service we can derive a set of expected effects, which will be the goals to be achieved by CAST agents. Given any set of goals G, a partial order (binary relation) can be defined over (1) (2) (3) Given and as

its pre-set, denoted as is defined as such that its post-set, denoted as is defined Given and such that any and G2 are independent iff such that or is indetachable from G iff and vice versa. Element such that or The following algorithm is used by the service planner to generate a Petri-net process for a given goal (service request).

An Agent-Based Approach for Interleaved Composition

4.3

591

The Team Coordination Component

The team coordination component is used to coordinate with other agents during service execution. This component includes an inference engine with a built-in knowledge base, a process shared by all team members, a PrT interpreter, a plan adjustor, and an inter-agent coordination module. Knowledge base holds the (accumulated) expertise needed for service composition. The inter-agent coordination module, embedded with team coordination strategies and conversation policies [21], is used for behavior collaboration among teammates. Here we mainly focus on the process interpreter and the plan adaptor. Each agent in a team uses its PrT net interpreter to interpret the business process generated by its service planner, monitor the progress of the shared process and takes its turn to perform tasks assigned to it. If the assigned task is a primitive web service, the agent invokes the service through its BPEL4WS process controller. If a task is assigned to multiple agents, the responsible agents coordinate their behavior (e.g., not compete for common resources) through the inter-agent coordination module. If an agent faces an unassigned task, it evaluates constrains associated with the task and tries to find a competent teammate for the task. If the assigned task is a complex service (i.e. further decomposition required) and is beyond its capabilities, the agent treats it as an internal request, composes a sub-process for the task, and forms another team to solve it. The plan adjustor uses the knowledge base and inference engine to adjust and repair the process whenever an exception or a need for change in the process arises. The algorithm used by the plan adjustor utilizes the failure handling policy implemented in CAST. Due to the hierarchical organization of the team process, each CAST agent maintains a stack of active process and sub-processes. A sub-process returns the control to its parent process when its execution is com-

592

Xiaocong Fan et al.

pleted. Failure handling is interleaved with (abstract) service executing: execute a service; check termination conditions; handle failures, and propagate failures to the parent process if needed. The algorithm captures four kinds of termination modes resulting from a service execution. The first (i.e. return 0) result indicates the service is completed successfully. The second (i.e. return 1) indicates that the process is terminated abnormally but the expected effects from the service has already been achieved “magically” (e.g. by proactive help from teammates). The third (i.e., return 2) indicates that the process is not completed and is likely at an impasse. In this case, if the current service is just one alternative of a choice point, another alternative can be selected to re-attempt the service. Otherwise, the failure is propagated to the upper level. The fourth (i.e. return 3) indicates that the process is terminated because the service has become irrelevant. This may happen if the goal or context changes. In this case, the irrelevance is propagated to the parent service, which checks its own relevance. The plan adjustor algorithm is shown below.

4.4

The WS-Execution Component

A service manager agent executes the primitive services (or a process of primitive services) through the WS-Execution component. The WS-Execution component consists of a commitment manager, a capability manager, a BPEL4WS process controller, an active process, and a failure detector. The capability manager maps services to known service providers. The commitment manager is used to schedule the services assigned to it in an appropriate order. An agent ultimately needs to delegate those contracted services to appropriate service providers. The process controller generates a BPEL4WS process based on the WSDL of the selected service providers and the sequence indicated in the PrT process. The failure detector identifies execution failure by

An Agent-Based Approach for Interleaved Composition

593

Fig. 3. The relations between generated team process and other modules

checking the termination conditions associated with services. If a termination condition has been reached, the failure detector throws an error and the plan adjustor module is invoked. If it is a service failure, the plan adjustor simply asks the agent to choose another service provider and re-attempt the service; if it is a process failure (unexpected changes make the process unworkable), the plan adjustor back-tracks the PrT process, tries to find another (sub-)process that would satisfy the task, and uses it to fix the one that failed.

4.5

The Example Revisited

Figure 3 shows how web service composition for VOSoft may be performed with interleaved planning and execution. The figure shows the core (hierarchical) Petri net representation used by the CAST architecture, and the manner in which each of the modules in the architecture use this representation. Due to the dynamic nature of the process, it is not feasible to show all possible paths that the execution may take. Instead, we show one plausible path, indicating the responsibilities for each of the modules in the architecture such as planning, team formation, undertaking execution, sensing changes in the internal or external environment (that may lead to exceptions), proactive information sharing, and how these will allow adapting the process to changes in the environment (proactive exception handling). The result is an interleaved process that includes planning and execution. The figure shows mapping to elements of the web service tech-

594

Xiaocong Fan et al.

nology stack (e.g. BPEL4WS specification), which allows use of the proposed architecture with current proposals from W3C.

5

Discussion

As business processes, specified as workflows and executed with web services, need to be adaptive and flexible, approaches are needed to facilitate this evolution. The methodology and architecture we have outlined addresses this concern by pushing the burden of ensuring this flexibility to the web services participating in the process. To achieve this, we have adapted and extended research in the area of team-based agents. A key consequence of this choice is that our approach allows interleaving of execution with planning, providing several distinct advantage over current web service composition approaches to facilitate adaptive workflows. First, it supports an adaptive process that suitable for the highly dynamic and distributed manner in which web services are deployed and used. The specification of a joint goal allows each team member to contribute relevant information to the composer agent, who can make decisions at critical choice points. Second, it elicits a hierarchical methodology for process management where a service composer can compose a process at a coarse level appropriate to its capability and knowledge, leaving further decomposition to competent teammates. Third, it interleaves planning with execution, providing a natural vehicle for implementing adaptive workflows. Our work in this direction so far has provided us with the fundamental insight that further progress in effective and efficient web service composition can be made by better understanding how distributed and partial knowledge about the availability and capabilities of web services, and the environment in which they are expected to operate, can be shared among the team of agents that must collaborate to perform the composed web service. Our current work involves extending the ideas to address these opportunities and concerns, and reflecting the outcomes in the ongoing implementation.

References 1. Heuvel, v.d., Maamar, Z.: Moving toward a framework to compose intelligent web services. Communications of the ACM 46 (2003) 103–109 2. Allen, R.: Workflow: An introduction. In Fisher, L., ed.: The Workflow Handbook 2001. Workflow Management Coalition (2001) 15–38 3. Pires, P., Benevides, M., Mattoso, M.: Building reliable web services compositions. In: Web, Web-Services, and Database Systems 2002. LNCS-2593. Springer (2003) 59–72 4. Koehler, J., Srivastava, B.: Web service composition: Current solutions and open problems. In: ICAPS 2003 Workshop on Planning for Web Services. (2003) 28–35 5. Oberleitner, J., Dustdar, S.: Workflow-based composition and testing of combined e-services and components. Technical Report TUV-1841-2003-25, Vienna University of Technology, Austria (2003)

An Agent-Based Approach for Interleaved Composition

595

6. Yen, J., Yin, J., Ioerger, T., Miller, M., Xu, D., Volz, R.: CAST: Collaborative agents for simulating teamworks. In: Proceedings of IJCAI’2001. (2001) 1135–1142 7. Bernard, B.: Agents in the world of active web-services. Digital Cities (2001) 343– 356 8. Manes, A.T.: Web services: A manager’s guide. Addison-Wesley Information Technology Series (2003) 47–82 9. Casati, F., Shan, M.C.: Dynamic and adaptive composition of e-services. Information Systems 26 (2001) 143–163 10. Sheng, Q., Benatallah, B., Dumas, M., Mak, E.: SELF-SERV: A platform for rapid composition of web services in a peer-to-peer environment. In: Demo Session of the 28th Intl. Conf. on Very Large Databases. (2002) 11. McIlraith, S., Son, T.C.: Adopting Golog for composition of semantic web services. In: Proceedings of the International Conference on knowledge representation and Reasoning (KR2002). (2002) 482–493 12. Chakraborty, D., Joshi, A.: Dynamic service composition: State-of-the-art and research directions. Technical Report TR-CS-01-19, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore, USA (2001) 13. Ermolayev, V.: Towards cooperative distributed service composition on the semantic web. Talks at Informatics Colloquium (2003) 14. Kay, J., Etzl, J., Rao, G., Thies, J.: The ATL postmaster: a system for agent collaboration and information dissemination. In: Proceedings of the second international conference on Autonomous agents, ACM (1998) 15. Maamar, Z., Sheng, Q., Benatallah, B.: Interleaving web services composition and execution using software agents and delegation. In: AAMAS’03 Workshop on web Services and Agent-based Engineering. (2003) 16. Jennings, N.R.: Controlling cooperative problem solving in industrial multi-agent systems using joint intentions. Artificial Intelligence 75 (1995) 195–240 17. Tambe, M.: Towards flexible teamwork. Journal of Artificial Intelligence Research 7 (1997) 83–124 18. Smith, R.G.: The contract net protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers 29 (1980) 1104–1113 19. van der Aalst, W., vanHee, K.: Workflow Management: Models, Methods, and Systems. MIT Press (2002) 20. Hanson, J.E., Nandi, P., Kumaran, S.: Conversation support for business process integration. In: Proc. of the IEEE International Enterprise Distributed Object Computing Conference. (2002) 65–74 21. Umapathy, K., Purao, S., Sugumaran, V.: Facilitating conversations among web services as speech-act based discourses. In: Proceedings of the Workshop on Information Technologies and Systems. (2003) 85–90

A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows San-Yih Hwang1,2,*, Haojun Wang2,**, Jaideep Srivastava2, and Raymond A. Paul3 1

Department of Information Management National Sun Yat-sen University, Kaohsiung 80424, Taiwan [emailprotected] 2

Department of Computer Science University of Minnesota, Minneapolis 55455, USA {haojun,srivasta}@cs.umn.edu 3

Department of Defense, United States

Abstract. Web services promise to become a key enabling technology for B2B e-commerce. Several languages have been proposed to compose Web services into workflows. The QoS of the Web services-based workflows may play an essential role in choosing constituent Web services and determining service level agreement with their users. In this paper, we identify a set of QoS metrics in the context of Web services and propose a unified probabilistic model for describing QoS values of (atomic/composite) Web services. In our model, each QoS measure of a Web service is regarded as a discrete random variable with probability mass function (PMF). We describe a computation framework to derive QoS values of a Web services-based workflow. Two algorithms are proposed to reduce the sample space size when combining PMFs. The experimental results show that our computation framework is efficient and results in PMFs that are very close to the real model.

1 Introduction Web services have become a de facto standard for achieving interoperability among business applications over the Internet. In a nutshell, a Web service can be regarded as an abstract data type that comprises a set of operations and data (or message types). Requests to and responses from Web service operations are transmitted through SOAP (Simple Object Access Protocol), which provides XML-based message delivery over an HTTP connection. The existing SOAP protocol uses synchronous RPC for invoking operations in Web services. However, in response to an increasing need to facilitate long running activities new proposals have been made to extend SOAP to allow asynchronous message exchange (i.e., requests and responses are not synchronous). One notable proposal is ASAP (Asynchronous Service Access Protocol) [1], which allows the execution of long-running Web service operations, * **

San-Yih Hwang was supported in part by Fulbright Scholarship. Haojun Wang was supported in part by the NSF under grant ISS-0308264.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 596–609, 2004. © Springer-Verlag Berlin Heidelberg 2004

A Probabilistic QoS Model and Computation Framework

597

and also non-blocking Web services invocation, in a less reliable environment (e.g., wireless networks). In the following discussion, we use the term Web service, to refer to an atomic activity, which may encompass either a single Web service operation (in the case of asynchronous Web services) or a pair of invoke/respond operations (in the case of synchronous Web services), and the term WS-workflow to refer to a workflow composed of a set of Web service invocations threaded into a directed graph. Several languages have been proposed to compose Web services into workflows. Notable examples include WSFL (Web Service Flow Language) [13] and XLANG (Web Services for Business Process Design) [16]. The ideas of WSFL and XLANG have converged and been superceded by BPEL4WS (Business Process Execution Language for Web Services) specification [2]. Such Web services-based workflows may subsequently become (composite) Web services, thereby enabling nested Web Services Workflows (WS-workflows). While the syntactic description of Web services can be specified through WSDL (Web Service Description Language), their semantics and quality of service (QoS) are left unspecified. The concept of QoS has been introduced and extensively studied in computer networks, multimedia systems, and real-time systems. QoS was mainly considered as an overload management problem that measures non-functional aspects of the target system, such as timeliness (e.g., message delay ratio) and completeness (e.g., message drop percentage). More recently, the concept of QoS is finding its way into application specification, especially in describing the level of service provided by a server. Typical QoS metrics at the application level include throughput, response time, cost, reliability, fidelity, etc [12]. Some work has been devoted to the specification and estimation of workflow QoS [3, 7]. However, previous work in workflow QoS estimation either focused on the static case (e.g., computing the average or the worst case QoS values) or relied on simulation to compute workflow QoS in a broader context. While the former has limited applicability, the later requires substantial computation before reaching stable results. In this paper, we propose a probability-based QoS model on Web services and WS-workflows that allows for efficient and accurate QoS estimation. Such an estimation serves as the basis for dealing with Web services selection problem [11] and service level agreement (SLA) specification problem [6]. The main contributions of our research are: 1. We identify a set of QoS metrics tailored for Web services and WS-workflows and give an anatomy of these metrics. 2. We propose a probability-based WS-workflow QoS model and its computation framework. This computation framework can be used to compute QoS of a complete or partial WS-workflow. 3. We explore alternative algorithms for computing probability distribution functions of WS-workflow QoS. The efficiency and accuracy of these algorithms are compared.

This paper is organized as follows. In Section 2 we define the QoS model in the context of WS-workflows. In Section 3 we present the QoS computation framework for WS-workflows. In Section 4 we describe algorithms for efficiently computing the

598

San-Yih Hwang et al.

QoS values of a WS-workflow. Section 5 presents preliminary results of our performance evaluation. Section 6 reviews related work. Finally, Section 7 concludes this paper and identifies directions for future research.

2 QoS Model for Web Services 2.1 Web Services QoS Metrics Many workflow-related QoS metrics have been proposed in the literature [3, 7, 9, 11, 12]. Typical categories of QoS metrics include performance (e.g., response time and throughput), resources (e.g., cost, memory/cpu/bandwidth consumption), dependability (e.g., reliability, availability, and time to repair), fidelity, transactional properties (e.g., ACID properties and commit protocols), and security (e.g., confidentiality, nonrepudiation, and encryption). Some of the proposed metrics are related to the system capacity for executing a WS-workflow. For example, metrics used to measure the power of servers, such as throughput, memory/cpu/bandwidth consumption, time to repair (TTR), and availability, falls in the category called system-level QoS. However, the capacities of servers for executing Web services (e.g., man power for manual activities and computing power for automatic activities) are unlikely to be revealed due to autonomy consideration, and may change over time without notification. These metrics might be useful in some workflow context such as intra-organizational workflows (for determining the amount of resources to spend on executing workflows). For interorganizational workflows, where a needed Web service may be controlled by another organization, QoS metrics in this category generally cannot be measured, and are thus excluded from further discussion. Another QoS metrics require all instances of the same Web service to share the same values. In this case, it is better to view these metrics as service classes rather than quality of service. Metrics of service class include those categorized as transactional properties and security. In this paper we focus on those WS-workflow QoS metrics that measure a WS-workflow instance and whose value may change across instances. These metrics, called instance-level QoS metrics, include response time, cost, reliability, and fidelity rating. Note that cost is a complicated metric and could be a function of both service class and/or other QoS values. For example, a Web service instance that imposes weaker security requirements or incurs longer execution time might be entitled to lower cost. Some services may adopt a different pricing scheme that charges based on factors other than usage (e.g., membership fee or monthly fee). In this paper, we consider the pay-per-service pricing scheme, which allows us to include cost as an instance-level QoS metric. In summary, our work considers four metrics: Response time (i.e., time elapsed from the submission of a request to the receiving of the response), Reliability (i.e., the probability that the service can be successfully completed), Fidelity (i.e., reputation rating) and Cost (i.e., the amount of money paid for executing an activity), which can be equally applicable to both atomic Web services and WS-workflows (also called composite Web services). These QoS metrics are defined such that different instances of the same Web service may have different QoS values.

A Probabilistic QoS Model and Computation Framework

599

2.2 Probabilistic Modeling of Web Services QoS We use a probability model for describing Web service QoS. In particular, we use probability mass function (PMF) on finite scalar domain as the QoS probability model. In other words, each QoS metric of a Web service is viewed as a discrete random variable, and the PMF indicates the probability that the QoS metric assumes a particular value. For example, the fidelity F of an example Web service with five grades (1-5) may have the following PMF: Note that it is natural to describe Reliability, Fidelity rating and Cost as random variables and to model them as PMFs with domains being {0 (fail), 1 (success)}, a set of distinct ratings, and a set of possible costs respectively. However, it is less intuitive to use PMF for describing response time whose domain is inherently continuous. By viewing response time at a coarser granularity, it is possible to model response time as a discrete random variable. Specifically, we partition the range of response time into a finite sequence of sub-intervals and use a representative number (e.g., the mean) to indicate each sub-interval. For example, suppose that the probabilities of a Web service being completed in one day, two to four days, and five to seven days, are 0.2, 0.6, and 0.2, respectively. The PMF of its response time X is represented as follows: (3 is the mean of [2, 4]), and (6 is the mean of [5, 7]) As expected, finer granularity on response time will yield more accurate estimation with higher overhead in representation and computation. We explore these tradeoffs in our experiments.

2.3 WS-Workflow Composition For an atomic Web service, its QoS PMFs can be derived from its past records of invocations. For a newly developed WS-workflow that is composed of a set of atomic Web services, we need a way to determine its QoS PMFs. Different workflow composition languages may provide different constructs for specifying the control-flow among constituent activities (e.g., see [14, 15] for a comparison of the expressive powers of various workflow and Web services composition languages). Kiepuszewski et al. [8] define a structured workflow model that consists of only four constructs: sequential, or-split/or-join, and-split/and-join, and loop, which allows for recursive construction of larger workflows. Although it has been shown that this structured workflow model is unable to model arbitrary workflows [8], it is nevertheless powerful enough to describe many real-world workflows. In fact, there exist some commercial workflow systems that support only structured workflows, such as SAP R/3 and Filenet Visual workflo. In this paper, as an initial step of the study, we focus our attention on structured workflows. To distinguish between exclusive or and (multiple choice) or, which is crucial in deriving WS-workflow QoS, we extend the structured workflow model to include five constructs:

600

San-Yih Hwang et al.

1. sequential: a sequence of activities 2. parallel (and split/and join): multiple activities that can be concurrently executed and merged with synchronization. 3. conditional (exclusive split/exclusive join): multiple activities among which only one activity can be executed. 4. fault-tolerant (and split/exclusive join): multiple activities that can be concurrently executed but merged without synchronization. 5. loop: a block of activities a guarded by a condition “LC”. Here we adopt while loop in our following discussion.

3 Computing QoS Values of WS Compositions We now describe how to compute the WS-workflow QoS values for each composition construct introduced earlier. We identify five basic operations for manipulating random variables, namely (i) addition, (ii) multiplication, (iii) maximum, (iv) minimum, and (v) conditional selection. Each of these operations takes as input a number of random variables characterized by PMFs and produces a random variable characterized by another PMF. The first four operations are quite straightforward, and their detailed descriptions are omitted here due to space limitations. For their formal definitions, interested readers are referred to [5]. The conditional selection, denoted as is defined as following1. Let be n random variables, with being the probability that is selected by the conditional selection operation CS. Note the selection of any random variable is exclusive, i.e., exactly one of these would be selected. The result of is a new random variable Z with

Specifically, the

of Z is as follows:

For each activity a, we consider four QoS metrics, namely response time, cost, reliability, and fidelity, denoted T(a), C(a), R(a), and F(a) respectively2. A WS-workflow composed of activities using some composition construct is denoted The QoS values of w, under various composition constructs, are shown in Table 1. We assume that the fidelity of w using sequential or parallel composition is a weighted sum of the fidelities of its constituent activities. The fidelity weight of each 1

Ensure not to confuse the conditional selection by the weighted sum The weighted sum results in a random variable whose domain may not be the union of the domains of the constituent activities. While weighted sum is used for computing the average value of a set of scalar values, it should not be used to compute the PMF resulted from the conditional selection of a set of random variables. 2 Note that each QoS metric of an activity is NOT a scalar value but a discrete random variable characterized by a PMF.

A Probabilistic QoS Model and Computation Framework

601

activity can be either manually assigned by the designer, or automatically derived from past history, e.g. by using linear regression. For the conditional construct, exactly one activity will be selected at run-time. Thus, the fidelity of w is the conditional selection of the fidelity of its constituent activities with the associated probabilities. For the fault-tolerant construct, the fidelity of the activity that is the first to complete becomes the fidelity of w. Thus, where

A loop construct is defined as a repetition of a block guarded by a condition “LC”, i.e., this block is repetitively executed till the condition “LC” no longer holds. Cardoso et al. assumed a geometric distribution on the number of iterations [3]. However, the memoryless property of the geometric distribution fails to capture a common phenomenon that a repeated execution of a block usually has a better chance to exit the loop. Gillmann et al [7] assumed the number of iterations to be uniformly distributed, which again may not hold in many applications. In this paper, rather than assuming a particular distribution, we simply regard the number of iterations as a PMF with a finite scalar domain. Let be the PMF of the number of iterations of a loop structure L defined on a block a, where c is the maximum number of iterations. Let T(a), C(a), R(a), F(a) denote the PMFs of the response time, cost, reliability, and fidelity of a respectively. If a is executed for l times, the response time is The response time of L is the conditional selection on with probabilities

Thus, the response time of L is Similar arguments can be applied to the computation of

cost and reliability. Regarding fidelity, let be the probability of executing at least one iteration and When a is executed at least once, the fidelity of a loop structure, in our view, is determined simply by its last execution of a. Let denote the fidelity that a is executed at least once (i.e., and be the fidelity that a is not executed. The fidelity of L is therefore computed as follows:

4 Efficient Computation of WS-Workflow QoS 4.1 High Level Algorithm A structured WS-workflow can be recursively constructed by using the five basic constructs. Figure 1 shows an example WS-workflow, namely PC order fulfillment. This WS-workflow is designed to tailor-make and to deliver personal computers at a customer’s request. At the highest level, the WS-workflow is a sequential construct that consists of Parts procurement, Assembly, Test, Adjustment, Shipping, and Cus-

602

San-Yih Hwang et al.

Fig. 1. An example WS-workflow PC order fulfillment

Customer notification. Parts procurement is a parallel construct that comprises of CPU procurement, HDD procurement, and CD-ROM procurement. CPU procurement in turn is a conditional construct composed of Intel CPU procurement and AMD CPU procurement. Adjustment is a loop construct on Fix&Test, which is iteratively executed until the quality of the PC is ensured. Customer notification is a fault-

A Probabilistic QoS Model and Computation Framework

603

Fig. 2. Pseudo code for computing QoS of a WS-workflow

tolerant construct that consists of Email notification and Phone notification. The success of either notification marks the completion of the entire WS-workflow. The QoS of the entire WS-workflow can be recursively computed. The pseudocode is listed in Figure 2. Note that SequentialQoS(A.activities), ParallelQoS (A.activities), ConditionalQoS(A.activities), FaultTolerantQoS(A.activities), LoopQoS(A. activities) are used to compute the four QoS metric values for sequential, parallel, conditional, fault tolerant, and loop constructs respectively. Their pseudo codes are quite clear from our discussion in Section 3 and omitted here for brevity.

4.2 Sample Space Reduction When combining PMFs of discrete random variables with respect to a given operation, the sample space size of the resultant random variable may become huge. Consider adding k discrete random variables each having n elements in their respective domains. The sample space size of the resultant random variable, in the worst case, is of the order of In order to keep the domain of a PMF after each operation at a reasonable size, we propose to group the elements in the sample space. In other words, several consecutive scalar values in the sample space will be represented by a single value and the aggregated probability is computed. The problem is formally described below. Let the domain of X be where and the PMF of X be We called another random variable Y an aggregate random variable of X if there exists a partition of where such that domain of Y is

604

San-Yih Hwang et al.

and the PMF for Y is

The aggregate error of Y with re-

spect to X, denoted aggregate_error(Y, X), is the mean square error defined as follows:

Aggregate Random Variable Problem Given a random variable X of domain size s and a desired domain size m, the problem is to find an aggregate random variable Y of domain size m such that its aggregate error with respect to X is minimized. Dynamic Programming Method An optimal solution to this problem can be obtained by formulating it as a dynamic program. Let e(i,j,k) be the optimal aggregate error of partitioning into k subsequences. We have the following recurrence:

where error(i, j) is the aggregated error introduced in representing a single value. Specifically,

by

where

The time complexity of the dynamic programming algorithm is

and its

space complexity is Greedy Method To reduce the computation overhead, we propose a heuristic method for solving this problem. The idea is to continuously merge the adjacent pair of samples that gives the least error until a reasonable sample space size is reached. When an adjacent pair is merged, a new element x’ is created to replace and where

A Probabilistic QoS Model and Computation Framework

The error of merging

denoted

605

is computed as follows:

We can use a priority queue to store the errors of merging adjacent pairs. In each iteration, we perform the following steps: 1. Extract an adjacent pair with the least pair_error() value from the priority queue, say 2. Replace and by the new value x’ in the domain of X. 3. Compute if i>1 and if i y) then x else y lesser = lambda x.lambda y.if (x < y) then x else y The function flatmap applies a list-valued function f to each member of a list xs and is defined in terms of fold: flatmap f xs = fold f (++) [] xs flatmap can in turn be used to define selection, projection and join operators and, more generally, comprehensions. For example, the following comprehension iterates through a list of students and returns those students who are not members of staff: [x x >; not (member > x)] and it translates into: flatmap (lambda x.if (not (member > x)) then [x] else []) > Grouping operators are also definable in terms of fold. In particular, the operator group takes as an argument a list of pairs xs and groups them on their first component, while gc aggFun xs groups a list of pairs xs on their first component and then applies the aggregation function aggFun to the second component. 3

We refer the reader to [8] for details of IQL

644

Hao Fan and Alexandra Poulovassilis

There are several algebraic properties of IQL’s operators that we can use in order to incrementally compute materialised data and to reason about IQL expressions, specifically for the purposes of this paper in a schema/data evolution context (note that the algebraic properties of fold below apply to all the operators defined in terms of fold): (a) e ++[] = []++ e = e, e -- [] = e, [] -- e = [], distinct [] = sort [] = [] for any list-valued expression e. Since Void represents a construct for which no data is obtainable from a data source, it has the semantics of the empty list, and thus the above equivalences also hold if Void is substituted for []. (b) fold f op e [] = fold f op e Void = e, for any f, op, e (c) fold f op e (b1 ++ b2) = (fold f op e b1) op (fold f op e b2) for any f, op, e, b1, b2. Thus, we can always incrementally compute the value of fold-based functions if collections expand. (d) fold f op e (b1 -- b2) = (fold f op e b1) op’ (fold f op e b2) provided there is an operator op’ which is the inverse of op i.e. such that (a op b) op’ b = a for all a,b. For example, if op = + then op’ = -, and thus we can always incrementally compute the value of aggregation functions such as count, sum and avg if collections contract. Note that this is not possible for min and max since lesser and greater have no inverses. Although IQL is list-based, if the ordering of elements within lists is ignored then its operators are faithful to the expected bag semantics, and within AutoMed we generally do assume bag semantics. Under this assumption, (xs ++ ys) -- ys = xs for all xs,ys and thus we can incrementally compute the value of flatmap and all its derivative operators if collections contract4.

2.2

An Example

We will use schemas expressed in a simple relational data model and a simple XML data model to illustrate our techniques. However, we stress that these techniques are applicable to schemas defined in any data modelling language that has been specified within AutoMed’s Model Definitions Repository. In the simple relational model, there are two kinds of schema construct: Rel and Att. The extent of a Rel construct is the projection of the relation R onto its primary key attributes The extent of each Att construct where is an attribute (key or non-key) of R is the projection of relation R onto For example, the schema of table MAtab in Figure 2 consists of a Rel construct MAtab and four Att constructs MAtab, Dept MAtab, CID MAtab, SID and MAtab, Mark We refer the reader to [15] for an encoding of a richer relational data model, including the modelling of constraints. In the simple XML data model, there are three kinds of schema construct: Element, Attribute and NestSet. The extent of an Element construct consists 4

The distinct operator can also be used to obtain set semantics, if needed

Schema Evolution in Data Warehousing Environments

645

of all the elements with tag in the XML document; the extent of each Attribute construct consists of all pairs of elements and attributes such that element has tag and has an attribute with value and the extent of each NestSet construct consists of all pairs of elements such that element has tag and has a child element with tag We refer the reader to [21] for an encoding of a richer model for XML data sources, called XMLDSS, which also captures the ordering of children elements under parent elements and cardinality constraints. That paper gives an algorithm for generating the XMLDSS schema of an XML document. That paper also discusses a unique naming scheme for Element constructs so as to handle instances of the same element tag occurring at multiple positions in the XMLDSS tree. Figure 2 illustrates the integration of three data sources and which respectively store students’ marks for three departments MA, IS and CS.

Fig. 2. An example integration

Database for department MA has one table of students’ marks for each course, where the relation name is the course ID. Database for department IS is an XML file containing information of course IDs, course names, student IDs and students’ marks. Database for department CS has one table containing one row per student, giving the student’s ID, name, and mark for the courses CSC01, CSC02 and CSC03. and are the materialised conformed databases for each data source. Finally, the global database GD contains one table CourseSum(Dept,CID,Total,Avg) which gives the total and average mark for each course of each department. Note that the virtual union schema US (not shown) combines all the information from all the conformed schemas and consists of a virtual table Details(Dept,CID,SID,CName,SName,Mark). The following transformation pathways express the schema transformation and integration processes in this example. Due to space limitations, we have not given the remaining steps for deleting/contracting the constructs in the source schema of each pathway (note that this ‘growing’ and ‘shrinking’ of schemas is characteristic of AutoMed schema transformation pathways):

646

Hao Fan and Alexandra Poulovassilis

The removal of the other two tables in

3

is similar.

Expressing Schema and Data Model Evolution

In a heterogeneous data warehousing environment, it is possible for either a data source schema or the integrated database schema to evolve. This schema evolution may be a change in the schema, or a change in the data model in which the schema is expressed, or both. AutoMed transformations can be used to express the schema evolution in all three cases: (a) Consider first a schema S expressed in a modelling language We can express the evolution of S to also expressed in as a series of primitive transformations that rename, add, extend, delete or contract constructs of For example, suppose that the relational schema in the above example

Schema Evolution in Data Warehousing Environments

647

evolves so its three tables become a single table with an extra column for the course ID. This evolution is captured by a pathway which is identical to the pathway given above. This kind of transformation that captures well-known equivalences between schemas can be defined in AutoMed by means of a parametrised transformation template which is schema- and data-independent. When invoked with specific schema constructs and their extents, a template generates the appropriate sequence of primitive transformations within the Schemas & Transformations Repository – see [5] for details. which evolves (b) Consider now a schema S expressed in a modelling language into an equivalent schema expressed in a modelling language We can express this translation by a series of add steps that define the constructs of in in terms of the constructs of S in At this stage, we have an intermediate schema that contains the constructs of both S and We then specify a series of delete steps that remove the constructs of (the queries within these transformations indicate that these are now redundant constructs since they can be derived from the new constructs). For example, suppose that XML schema in the above example evolves into an equivalent relational schema consisting of single table with one column per attribute of This evolution is captured by a pathway which is identical to the pathway given above. Again, such generic inter-model translations between one data model and another can be defined in AutoMed by means of transformation templates. (c) Considering finally to an evolution which is both a change in the schema and in the data model, this can be expressed by a combination of (a) and (b) above: either (a) followed by (b), or (b) followed by (a), or indeed by interleaving the two processes.

4

Handling Schema Evolution

In this section we consider how the general integration network illustrated in Figure 1 is evolvable in the face of evolution of a local schema or the warehouse schema. We have seen in the previous section how AutoMed transformations can be used to express the schema evolution if either the schema or the data model changes, or both. We can therefore treat schema and data model change in a uniform way for the purposes of handling schema evolution: both are expressed as a sequence of AutoMed primitive transformations, in the first case staying within the original data model, and in the second case transforming the original schema in the original data model into a new schema in a new data model. In this section we describe the actions that are taken in order to evolve the integration network of Figure 1 if the global schema GS evolves (Section 4.1) or if a local schema evolves (Section 4.2). Given an evolution pathway from a schema S to a schema in both cases each successive primitive transformation within the pathway is treated one at a time. Thus, we describe in sections 4.1 and 4.2 the actions that are taken if consists of just

648

Hao Fan and Alexandra Poulovassilis

one primitive transformation. If is a composite transformation, then it is handled as a sequence of primitive transformations. Our discussion below assumes that the primitive transformation being handled is adding, removing or renaming a construct of S that has an underlying data extent. We do not discuss the addition or removal of constraints here as these do not impact on the materialised data, and we make the assumption that any constraints in the pathway have been verified as being valid.

4.1

Evolution of the Global Schema

Suppose the global schema GS evolves by means of a primitive transformation into This is expressed by the step being appended to the pathway of Figure 1. The new global schema is and its associated extension is GS is now an intermediate schema in the extended pathway and it no longer has an extension associated with it. may be a rename, add, extend, delete or contract transformation. The following actions are taken in each case: 1. If is rename then there is nothing further to do. GS is semantically equivalent to and is identical to GD except that the extent of in GD is now the extent of in then there is nothing further to do at the schema level. GS is 2. If is add semantically equivalent to However, the new construct in must now be populated, and this is achieved by evaluating the query over GD. is populated by an empty 3. If is extend then the new construct in extent. This new construct may subsequently be populated by an expansion in a data source (see Section 4.2). 4. If is delete or contract then the extent of must be removed from GD in order to create (it is assumed that this a legal deletion/contraction, e.g if we wanted to delete/contract a table from a relational schema, then first the constraints and then the columns would be deleted/contracted and lastly the table itself; such syntactic correctness of transformation pathways is automatically verified by AutoMed). It may now be possible to simplify the transformation network, in that if contains a matching transformation add or extend then both this and the new transformation can be removed from the pathway This is purely an optimization – it does not change the meaning of a pathway, nor its effect on view generation and query/data translation. We refer the reader to [19] for details of the algorithms that simplify AutoMed transformation pathways.

In cases 2 and 3 above, the new construct will automatically be propagated into the schema DMS of any data mart derived from GS. To prevent this, a transformation contract can be prefixed to the pathway Alternatively, the new construct can be propagated to DMS if so desired, and materialised there. In cases 1 and 4 above, the change in GS and GD may impact on the data marts derived from GS, and we discuss this in Section 4.3.

Schema Evolution in Data Warehousing Environments

4.2

649

Evolution of a Local Schema

Suppose a local schema evolves by means of a primitive transformation into As discussed in Section 2, there is automatically available a reverse transformation from to and hence a pathway from to The new local schema is and its associated extension is is now just an intermediate schema in the extended pathway and it no longer has an associated extension. may be a rename, add, delete, extend or contract transformation. In 1–5 below we see what further actions are taken in each case for evolving the integration network and the downstream materialised data as necessary. We first introduce some necessary terminology: If is a pathway and is a construct in S, we denote by the constructs of which are directly or indirectly dependent on either because itself appears in or because a construct of is created by a transformation add within where the query directly or indirectly references The set can be straight-forwardly computed by traversing and inspecting the query associated with each add transformation within in. 1. If is rename then schema is semantically equivalent to The new transformation pathway is The new local database is identical to except that the extent of in is now the extent of in then has evolved to contain a new construct whose 2. If is add extent is equivalent to the expression over the other constructs of The new transformation pathway is this means that has evolved to not include a construct 3. If is delete whose extent is derivable from the expression over the other constructs of and the new local database no longer contains an extent for The new transformation pathway is

In the above three cases, schema is semantically equivalent to and nothing further needs to be done to any of the transformation pathways, schemas or databases and GD. This may not be the case if is a contract or extend transformation, which we consider next. 4. If is extend then there will be a new construct available from that was not available before. That is, has evolved to contain the new construct whose extent is not derivable from the other constructs of If we left the transformation pathway as it is, this would result in a pathway from to which would immediately drop the new construct from the integration network. That is, is consistent but it does not utilize the new data.

However, recall that we said earlier that we assume no contract steps in the pathways from local schemas to their union schemas, and that all the data in should be available to the integration network. In order to achieve this, there are four cases to consider:

650

Hao Fan and Alexandra Poulovassilis

appears in and has the same semantics as the newly added in Since cannot be derived from the original there must be a transformation extend in We remove from the new contract c step and this matching extend step. This propagates into and we populate its extent in the materialised database by replicating its extent from (b) does not appear in but it can be derived from by means of some transformation T. In this case, we remove from the first contract c step, so that is now present in and in We populate the extent of in by replicating its extent from To repair the other pathways and schemas for we append T to the end of each As a result, the new construct now appears in all the union schemas. To add the extent of this new construct to each materialised database for we compute it from the extents of the other constructs in using the queries within successive add steps in T. We finally append the necessary new id steps between pairs of union schemas to assert the semantic equivalence of the construct within them. (c) does not appear in and cannot be derived from In this case, we again remove from the first contract c step so that is now present in schema To repair the other pathways and schemas for we append an extend step to the end of each As a result, the new construct now appears in all the conformed schemas The construct may need further translation into the data model of the union schemas and this is done by appending the necessary sequence, T, of add/delete/rename steps to all the pathways We compute the extent of within the database from its extent within using the queries within successive add steps in T. We finally append the necessary new id steps between pairs of union schemas to assert the semantic equivalence of the new construct(s) within them. (d) appears in but has different semantics to the newly added in In this case, we rename in to a new construct The situation reverts to adding a new construct to and one of (a)-(c) above applies. (a)

We note that determining whether can or cannot be derived from the existing constructs of the union schemas in (a)–(d) above requires domain or expert human knowledge. Thereafter, the remaining actions are fully automatic. In cases (a) and (b), there is new data added to one or more of the conformed databases which needs to be propagated to GD. This is done by computing and using the algebraic equivalences of Section 2.1 to propagate changes in the extent of to each of its descendant constructs gc in GS. Using these equivalences, we can in most cases incrementally recompute the extent of gc. If at any stage in there is a transformation add where no equivalence can be applied, then we have to recompute the whole extent of

Schema Evolution in Data Warehousing Environments

651

In cases (b) and (c), there is a new schema construct appearing in the This construct will automatically appear in the schema GS. If this is not desired, a transformation contract can be prefixed to 5. If is contract then the construct in will no longer be available from That is, has evolved so as to not include a construct whose extent is not derivable from the other constructs of The new local database no longer contains an extent for The new transformation pathway is Since the extent of is now Void, the materialised data in and GD must be modified so as to remove any data derived from the old extent of In order to repair we compute For each construct uc in we compute its new extent and replace its old extent in by the new extent. Again, the algebraic properties of IQL queries discussed in Section 2.1 can be used to propagate the new Void extent of construct in to each of its descendant constructs uc in Using these equivalences, we can in most cases incrementally recompute the extent of uc as we traverse the pathway In order to repair GD, we similarly propagate changes in the extent of each uc along the pathway Finally, it may also be necessary to amend the transformation pathways if there are one or more constructs in GD which now will always have an empty extent as a result of this contraction of For any construct uc in US whose extent has become empty, we examine all pathways If all these pathways contain an extend uc transformation, or if using the equivalences of Section 2.1 we can deduce from them that the extent of uc will always be empty, then we can suffix a contract gc step to for every gc in and then handle this case as paragraph 4 in Section 4.1.

4.3

Evolution of Downstream Data Marts

We have discussed how evolutions to the global schema or to a source schema are handled. One remaining question is how to handle the impact of a change to the data warehouse schema, and possibly its data, on any data marts that have been derived from it. In [7] we discuss how it is possible to express the derivation of a data marts from a data warehouse by means of an AutoMed transformation pathway. Such a pathway expresses the relationship of a data mart schema DMS to the warehouse schema GS. As such, this scenario can be regarded as a special case of the general integration scenario of Figure 1, where GS now plays the role of the single source schema, databases and GD collectively play the role of the data associated with this source schema and DMS plays the role of the global schema. Therefore, the same techniques as discussed in sections 4.1 and 4.2 can be applied.

652

5

Hao Fan and Alexandra Poulovassilis

Concluding Remarks

In this paper we have described how the AutoMed heterogeneous data integration toolkit can be used to handle the problem of schema evolution in heterogeneous data warehousing environments so that the previous transformation, integration and data materialisation effort can be reused. Our algorithms are mainly automatic, except for the aspects that require domain or expert human knowledge regarding the semantics of new schema constructs. We have shown how AutoMed transformations can be used to express schema evolution within the same data model, or a change in the data model, or both, whereas other schema evolution literature has focussed on just one data model. Schema evolution within the relational data model has been discussed in previous work such as [11,12,18]. The approach in [18] uses a first-order schema in which all values in a schema of interest to a user are modelled as data, and other schemas can be expressed as a query over this first-order schema. The approach in [12] uses the notation of a flat scheme, and gives four operators UNITE, FOLD, UNFOLD and SPLIT to perform relational schema evolution using the SchemaSQL language. In contrast, with AutoMed the process of schema evolution is expressed using a simple set of primitive schema transformations augmented with a functional query language, both of which are applicable to multiple data models. Our approach is complementary to work on mapping composition, e.g. [20, 14], in that in our case the new mappings are a composition of the original transformation pathway and the transformation pathway which expresses the schema evolution. Thus, the new mappings are, by definition, correct. There are two aspects to our approach: (i) handling the transformation pathways and (ii) handling the queries within them. In this paper we have in particular assumed that the queries are expressed in IQL. However, the AutoMed toolkit allows any query language syntax to be used within primitive transformations, and therefore this aspect of our approach could be extended to other query languages. Materialised data warehouse views need to be maintained when the data sources change, and much previous work has addressed this problem at the data level. However, as we have discussed in this paper, materialised data warehouse views may also need to be modified if there is an evolution of a data source schema. Incremental maintenance of schema-restructuring views within the relational data model is discussed in [10], whereas our approach can handle this problem in a heterogeneous data warehousing environment with multiple data models and changes in data models. Our previous work [7] has discussed how AutoMed transformation pathways can also be used for incrementally maintaining materialised views at the data level. For future work, we are implementing our approach and evaluating it in the context of biological data warehousing.

Schema Evolution in Data Warehousing Environments

653

References 1. J. Andany, M. Léonard, and C. Palisser. Management of schema evolution in databases. In Proc. VLDB’91, pages 161–170. Morgan Kaufmann, 1991. 2. Z. Bellahsene. View mechanism for schema evolution in object-oriented DBMS. In Proc. BNCOD’96, LNCS 1094. Springer, 1996. 3. B. Benatallah. A unified framework for supporting dynamic schema evolution in object databases. In Proc. ER’99, LNCS 1728. Springer, 1999. 4. M. Blaschka, C. Sapia, and G. Höfling. On schema evolution in multidimensional databases. In Proc. DaWaK’99, LNCS 1767. Springer, 1999. 5. M. Boyd, S. Kittivoravitkul, C. Lazanitis, P.J. McBrien, and N. Rizopoulos. AutoMed: A BAV data integration system for heterogeneous data sources. In Proc. CAiSE’04, 2004. 6. P. Buneman et al. Comprehension syntax. SIGMOD Record, 23(1):87–96, 1994. 7. H. Fan and A. Poulovassilis. Using AutoMed metadata in data warehousing environments. In Proc. DOLAP’03, pages 86–93. ACM Press, 2003. 8. E. Jasper, A. Poulovassilis, and L. Zamboulis. Processing IQL queries and migrating data in the AutoMed toolkit. Technical Report 20, Automed Project, 2003. 9. E. Jasper, N. Tong, P. McBrien, and A. Poulovassilis. View generation and optimisation in the AutoMed data integration framework. In Proc. 6th Baltic Conference on Databases and Information Systems, 2004. 10. A. Koeller and E. A. Rundensteiner. Incremental maintenance of schemarestructuring views. In Proc. EDBT’02, LNCS 2287. Springer, 2002. 11. L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. On the logical foundations of schema integration and evolution in heterogeneous database systems. In Proc. DOOD’93, LNCS 760. Springer, 1993. 12. L. V. S. Lakshmanan, F. Sadri, and S. N. Subramanian. On efficiently implementing SchemaSQL on an SQL database system. In Proc. VLDB’99, pages 471–482. Morgan Kaufmann, 1999. 13. M. Lenzerini. Data integration: A theoretical perspective. In Proc. PODS’02, 2002. 14. Jayant Madhavan and Alon Y. Halevy. Composing mappings among data sources. In Proc. VLDB’03. Morgan Kaufmann, 2003. 15. P. McBrien and A. Poulovassilis. A uniform approach to inter-model transformations. In Proc. CAiSE’99, LNCS 1626, pages 333–348. Springer, 1999. 16. P. McBrien and A. Poulovassilis. Schema evolution in heterogeneous database architectures, a schema transformation approach. In Proc. CAiSE’02, LNCS 2348, pages 484–499. Springer, 2002. 17. P. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proc. ICDE’03, pages 227–238, 2003. 18. Renée J. Miller. Using schematically heterogeneous structures. In Proc. ACM SIGMOD’98, pages 189–200. ACM Press, 1998. 19. N. Tong. Database schema transformation optimisation techniques for the AutoMed system. In Proc. BNCOD’03, LNCS 2712. Springer, 2003. 20. Yannis Velegrakis, Renée J. Miller, and Lucian Popa. Mapping adaptation under evolving schemas. In Proc. VLDB’03. Morgan Kaufmann, 2003. 21. L. Zamboulis. XML data integration by graph restrucring. In Proc. BNCOD’04, LNCS 3112. Springer, 2004.

Metaprogramming for Relational Databases Jernej Kovse, Christian Weber, and Theo Härder Department of Computer Science Kaiserslautern University of Technology P.O. Box 3049, D-67653 Kaiserslautern, Germany {kovse,c_weber,haerder}@informatik.uni-kl.de

Abstract. For systems that share enough structural and functional commonalities, reuse in schema development and data manipulation can be achieved by defining problem-oriented languages. Such languages are often called domainspecific, because they introduce powerful abstractions meaningful only within the domain of observed systems. In order to use domain-specific languages for database applications, a mapping to SQL is required. In this paper, we deal with metaprogramming concepts required for easy definition of such mappings. Using an example domain-specific language, we provide an evaluation of mapping performance.

1 Introduction A large variety of approaches use SQL as a language for interacting with the database, but at the same time provide a separate problem-oriented language for developing database schemas and formulating queries. A translator maps a statement in such problem-oriented language to a series of SQL statements that get executed by the DBMS. An example of such a system is Preference SQL, described by Kießling and Köstler [8]. Preference SQL is an SQL extension that provides a set of language constructs which support easy use of soft preferences. This kind of preferences is useful when searching for products and services in diverse e-commerce applications where a set of strictly observed hard constraints usually results in an empty result set, although products that approximately match the user’s demands do exist. The supported constructs include approximation (clauses AROUND and BETWEEN), minimization/maximization (clauses LOWEST, HIGHEST), favorites and dislikes (clauses POS, NEG), pareto accumulation (clause AND), and cascading of preferences (clause CASCADE) (see [8] for examples). In general, problem-oriented programming languages are also called domain-specific languages (DSLs), because they prove useful when developing and using systems from a predefined domain. The systems in a domain will exhibit a range of similar structural and functional features (see [4,5] for details), making it possible to describe them (and, in our case, query their data) using higher-level programming constructs. In turn, these constructs carry semantics meaningful only within this domain. As the activity of using these constructs is referred to as programming, defining such conP. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 654–667, 2004. © Springer-Verlag Berlin Heidelberg 2004

Metaprogramming for Relational Databases

655

structs and their mappings to languages that can be compiled or interpreted to allow their execution is referred to as metaprogramming. This paper focuses on the application of metaprogramming for relational databases. In particular, we are interested in concepts that guide the implementation of fast mappings of custom languages, used for developing database schemas and manipulating data, onto SQL-DDL and SQL-DML. The paper is structured as follows. First, in Sect. 2, we further motivate the need for DSLs for data management. An overview of related work is given by Sect. 3. Our system prototype (DSL-DA – domain-specific languages for database applications) that supports the presented ideas is outlined in Sect. 4. A detailed performance evaluation of a DSL for the example product line will be presented in Sect. 5. Sect. 6 gives a detailed overview of metaprogramming concepts. Finally, in Sect. 7, we summarize our results and give some ideas for the future work related to our approach.

2 Domain-Specific Languages The idea of DSLs is tightly related to domain engineering. According to Czarnecki and Eisenecker [5], domain engineering deals with collecting, organizing, and storing past experience in building systems in form of reusable assets. In general, we can rely that a given asset can be reused in a new system in case this system possesses some structural and functional similarity to previous systems. Indeed, systems that share enough common properties are said to constitute a system family (a more market-oriented term for a system family is a software product-line). Examples of software product-lines are extensively outlined by Clements and Northrop [4] and include satellite controllers, internal combustion engine controllers, and systems for displaying and tracing stock-market data. Further examples of more data-centric product lines include CRM and ERP systems. Our example product line for versioning systems will be introduced in Sect. 4. Three approaches can be applied to allow the reuse of “assets” when developing database schemas for systems in a data-intensive product line. Components: Schema components can be used to group larger reusable parts of a database schema to be used in diverse systems afterwards (see Thalheim [16] for an extensive overview of this approach). Generally, the modularity of system specification (which components are to be used) directly corresponds to the modularity of the resulting implementation, because a component does not influence the internal implementation of other components. This kind of specification transformations towards the implementation is referred to as vertical transformations or forward refinements [5]. Frameworks: Much like software frameworks in general (see, for example, Apache Struts [1] or IBM San Francisco [2]), schema frameworks rely on the user to extend them with system-specific parts. This step is called framework instantiation and requires certain knowledge of how the missing parts will be called by the framework. Most often, this is achieved by extending superclasses defined by the framework or implementing call-back methods which will be invoked by mechanisms such as reflection. In a DBMS, application logic otherwise captured by such methods can be defined by means of constraints, trigger conditions and actions, and stored procedures. A detailed

656

Jernej Kovse, Christian Weber, and Theo Härder

overview of schema frameworks is given by Mahnke [9]. Being more flexible than components, frameworks generally require more expertise from the user. Moreover, due to performance reasons, most DBMSs restrain from dynamic invocation possibilities through method overloading or reflection (otherwise supported in common OO programming languages). For this reason, schema frameworks are difficult to implement without middleware acting as a mediator for such calls. Generators: Schema generators are, in our opinion, the most advanced approach to reuse and are the central topic of this paper. A schema generator acts much like a compiler: It transforms a high-level specification of the system to a schema definition, possibly equipped with constraints, triggers, and stored procedures. In general, the modularity of the specification does not have to be preserved. Two modular parts of the specification can be interwoven to obtain a single modular part in the schema (these transformations are called horizontal transformations; in case the obtained part in the schema is also refined, for example, columns not explicitly defined in the specification are added to a table, this is called an oblique transformation, i.e., a combination of a horizontal and a vertical transformation.) It is important to note that there is no special “magic” associated with schema generators that allows them to obtain a ready-to-use schema out of a short specification. By narrowing the domain of systems, it is possible to introduce very powerful language abstractions that are used at the specification level. Due to similarities between systems, these abstractions aggregate a lot of semantics that is dispersed across many schema elements. Because defining this semantics in SQL-DDL proves labour-intensive, we rather choose to define a special domain-specific DDL (DS-DDL) for specifying the schema at a higher level of abstraction and implement the corresponding mapping to SQLDDL. The mapping represents the “reusable asset” and can be used with any schema definition in this DS-DDL. The data manipulation part complementary to DS-DDL is called DS-DML and allows the use of domain-specific query and update statements in application programs. Defining custom DS-DDLs and their mappings to SQL-DDL as well as fast translation of DS-DML statements is the topic we explore in this paper.

3

Related Work

Generators are the central idea of the OMG’s Model Driven Architecture (MDA) [13] which proposes the specification of systems using standardized modeling languages (UML) and automatic generation of implementations from models. However, even OMG notices the need of supporting custom domain-specific modeling languages. As noted by Frankel [6], this can be done in three different ways: Completely new modeling languages: A new DSL can be obtained by defining a new MOF-based metamodel. Heavyweight language extensions: A new DSL can be obtained by extending the elements of a standardized metamodel (e.g., the UML Metamodel). Lightweight language extensions: A new DSL can be obtained by defining new language abstractions using the language itself. In UML, this possibility is supported by UML Profiles.

Metaprogramming for Relational Databases

657

The research area that deals with developing custom (domain-specific) software engineering methodologies well suited for particular systems is called computer-aided method engineering (CAME) [14]. CAME tools allow the user to describe an own modeling method and afterwards generate a CASE tool that supports this method. For an example of a tool supporting this approach, see MetaEdit+ [11]. The idea of a rapid definition of domain-specific programming languages and their mapping to a platform where they can be executed is materialized in Simonyi’s work on Intentional Programming (IP) [5,15]. IP introduces an IDE based on active libraries that are used to import language abstractions (also called intentions) into this environment. Programs in the environment are represented as source graphs in which each node possesses a special pointer to a corresponding abstraction. The abstractions define extension methods which are metaprograms that specify the behavior of nodes. The following are the most important extension methods in IP. Rendering and type-in methods. Because it is cumbersome to edit the source graph directly, rendering methods are used to visualize the source graph in an editable notation. Type-in methods convert the code typed in this notation back to the source graph. This is especially convenient when different notations prove useful for a single source graph. Refactoring methods. These methods are used to restructure the source graph by factoring out repeating code parts to improve reuse. Reduction methods. The most important component of IP, these methods reduce the source graph to a graph of low-level abstractions (also called reduced code or R-code) that represent programs executable on a given platform. Different reduction methods can be used to obtain the R-code for different platforms. How does this work relate to our problem? Similar as in IP, we want to support a custom definition of abstractions that form both a custom DS-DDL and a custom DS-DML. We want to support the rendering of source graphs for DS-DDL and DS-DML statements to (possibly diverse) domain-specific textual representations. Most importantly, we want to support the reduction of these graphs to graphs representing SQL statements that can be executed by a particular DBMS.

4

DSL-DA System

In our DSL-DA system, the user starts by defining a domain-specific (DS) metamodel that describes language abstractions that can appear in the source graph (the language used for defining metamodels is a simplified variant of the MOF Model) for the DSDDL. We used the system to fully implement a DSL for the example product line of versioning systems which we also use in the next section for the evaluation of our approach. In this product line, each system is used to store and version objects (of some object type) and relationships (of some relationship type). Thus individual systems differ in their type definitions (also called information models [3]) as well as other features illustrated in the DS-DDL metamodel in Fig. 1 and explained below.

658

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 1. DS-DDL metamodel for the example product line

Object types can be versioned or unversioned. The number of direct successors to a version can be limited to some number (maxSuccessors) for a given versioned object type. Relationship types connect to object types using either non-floating or floating relationship ends. A non-floating relationship end connects directly to a particular version as if this version were a regular object. On the other hand, a floating relationship end maintains a user-managed subset of all object versions for each connected object. Such subsets are called candidate version collections (CVC) and prove useful for managing configurations. In unfiltered navigation from some origin object, all versions contained in every connected CVC will be returned. In filtered navigation, a version preselected for each CVC (also called the pinned version) will be returned. In case there is no pinned version, we return the latest version from the CVC. Workspace objects act as containers for other objects. However, only one version of a contained object can be present in the workspace at a time. In this way, workspaces allow a version-free view to the contents of a versioning system. When executed within a workspace, filtered navigation returns versions from the CVC that are connected to this workspace and ignores the pin setting of the CVC. Operations create object, copy, delete, create successor, attach/detach (connects/ disconnects an object to/from a workspace), freeze, and checkout/checkin (locks/ unlocks the object) can propagate across relationships. A model expressed using the DS-DDL metamodel from Fig. 1 will represent a source graph for a particular DS-DDL schema definition used to describe a given versioning system. To work with these models (manipulate the graph nodes), DSL-DA uses the DS-DDL metamodel to generate a schema editor that displays the graphs in a tree-like form (see the left-hand side of Fig. 2). A more convenient graphical notation of a source graph for our example versioning system that we will use for the evaluation in the next section is illustrated in Fig. 3. The metamodel classes define rendering and type-in methods that render the source graph to a textual representation and allow its editing (right-hand side of Fig. 2). More importantly, the metamodel classes define reduction methods that will reduce the

Metaprogramming for Relational Databases

659

Fig. 2. DS-DDL schema development with the generated editor

Fig. 3. Example DS-DDL schema used in performance evaluation

source graph to its representation in SQL-DDL. In analogy with the domain-specific level of the editor, the obtained SQL-DDL schema is also represented as a source graph; the classes used for this graph are the classes defined by the package Relational of the OMG’s Common Warehouse Metamodel (CWM) [12]. The rendering methods of these

660

Jernej Kovse, Christian Weber, and Theo Härder

classes are customizable so that by rendering the SQL-DDL source graphs, SQL-DDL schemas in SQL dialects of diverse DBMS vendors can be obtained. Once an SQL-DDL schema is installed in a database, how do we handle statements in DS-DML (three examples of such statements are given by Table 1)? As for the DSDDL, there is a complementary DS-DML metamodel that describes language abstractions of the supported DS-DML statements. This metamodel can be simply defined by first coming up with an EBNF for DS-DML and afterwards translating the EBNF symbols to class definitions in a straightforward fashion. The EBNF of our DS-DML for the sample product line for versioning systems is available through [17]. DS-DML statements can then be represented as source graphs, where each node in the graph is an instance of some class from the DS-DML metamodel. Again, metamodel classes define reduction methods that reduce the corresponding DS-DML source graph to an SQLDML source graph, out of which SQL-DML statements can be obtained through rendering. DS-DML is used by an application programmer to embed domain-specific queries and data manipulation statements in the application code. In certain cases, the general structure of a DS-DML statement will be known at the time the application is written and the parameters of the statement will only need to be filled with user-provided values at run time. Since these parameters do not influence the reduction, the reduction from DS-DML to SQL-DML can take place using a precompiler. Sometimes, however, especially in the case of Web applications, the structure of the DS-DML query will depend on the user’s search criteria and other preferences and is thus not known at compile time. The solution in this case is to wrap the native DBMS driver into a domain-specific driver that performs the reduction at run time, passes the SQL-DML statements to the native driver, and restructures the result sets before returning them to the user, if necessary. To handle both cases where query structure is known at compile time and when it is not, DSL-DA can generate both the precompiler and the domain-specific driver from the DS-DML metamodel, its reduction methods, and its rendering methods for SQLDML. We assumed the worst-case scenario in which all SQL-DML statements need to be reduced at run time for our evaluation in the next section to examine the effect of run time reduction in detail.

Metaprogramming for Relational Databases

661

5 Evaluation of the Example Product Line The purpose of the evaluation presented in this section is to demonstrate the following. Even for structurally complex DS-DML statements, the reduction process carried out at run time represents a very small proportion of costs needed to carry out the SQL-DML statements obtained by reduction. DS-DDL schemas that have been reduced to SQL-DDL with certain optimizations in mind imply reduction that is more difficult to implement. Somewhat surprisingly, this does not necessarily mean that such reduction will also take more processing time. Optimization considerations can significantly contribute to a faster execution of DS-DML statements once reduced to SQL-DML. To demonstrate both points, we implemented four very different variants of both DSDDL and DS-DML reduction methods for the example product line. The DS-DDL schema from Fig. 3 has thus been reduced to four different SQL-DDL schemas. In all four variants, object types from Fig. 3 are mapped to tables (called object tables) with the specified attributes. An object version is then represented as a tuple in this table. The identifiers in each object table include an objectId (all versions of a particular object, i.e., all versions within the same version tree, possess the same objectId), a versionId (identifies a particular version within the version tree) and a globalId, which is a combination of an objectId and a versionId. The four reductions differ in the following way. Variant 1: Store all relationships, regardless of relationship type, using a single “generic” table. For a particular relationship, store the origin globalId, objectId, versionId and the target rolename, globalId, objectId, and versionId as columns. Use an additional column as a flag denoting whether the target version is pinned. Variant 2: Use separate tables for every relationship type. In case a relationship type defines no floating ends or two floating ends, this relationship type can be represented by a single table. In case only one relationship end is floating, such relationship type requires two tables, one for each direction of navigation. Variant 3: Improve Variant 2 by considering maximal multiplicity of 1 on nonfloating ends. For such ends, the globalId of the connected target object is stored as a column in the object table of the origin object. Variant 4: Improve Variant 3 by considering maximal multiplicity of 1 of floating ends. For such ends, the globalIds of the pinned version and the latest version of the CVC for the target object can be stored as columns in the object table of the origin object. Our benchmark, consisting of 141,775 DS-DML statements was then run using four different domain-specific drivers corresponding to four different variants of reduction. To eliminate the need of fetching metadata from the database, we assumed that, once defined, the DS-DDL schema does not change, so each driver accessed the DS-DDL schema defined in Fig. 3 directly in the main memory. The overall time for executing a DS-DML statement is defined as where is the required DS-DML parsing time, the time required for reduction, the time required for rendering all resulting SQL-DML statements, and the time used to carry out these statements. Note that is independent of the variant, so we were mainly interested in the remaining three times as well as the overall time. The average and

662

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 4. Execution times for the category of select statements

Fig. 5. Execution times for the category of create relationship statements

Fig. 6. Overhead due to DS-DML parsing, reduction and rendering

values (in for the category of select statements are illustrated in Fig. 4. This category included queries over versioned data within and outside workspaces that contained up to four navigation steps. As evident from Fig. 4, Variant 4 demonstrates a very good performance and also allows the fastest reduction. On the other hand, due to materialization of the globalIds of pinned and latest versions for CVCs in Variant 4, Variant 2 proves faster for manipulation (i.e., creation and deletion of relationships). The values for the category of create relationship statements are illustrated in Fig. 5. Most importantly, the overhead time required due to the domain-specific driver proves to be only a small portion of As illustrated in Fig. 6, when using Variant4, the portion is lowest (0.8%) for the category of select statements

Metaprogramming for Relational Databases

663

Fig. 7. Properties of reduction methods

and highest (9.9%) for merge statements. When merging two versions (denoted as primary and secondary version), their attribute values have to be compared to their socalled base (latest common) version in the version graph to decide which values should be used for the result of the merging. This comparison, which is performed in the driver, accounts for a high value (9.1% of Note that is the minimal time an application spends executing SQL-DML statements in any case (with or without DSDML available) to provide the user with equivalent results: Even without DS-DML, the programmer would have to implement data flows to connect sequences of SQL-DML statements to perform a given operation (in our evaluation, we treat data flows as part of How difficult is it to implement the DS-DML reduction methods? To estimate this aspect, we used measures such as the count of expressions, statements, conditional statements, loops, as well as McCabe’s cyclomatic complexity [10] and Halstead effort [7] on our Java implementation of reduction methods. The summarized results obtained using these measures are illustrated by Fig. 7. All measures, except for the count of loops confirm an increasing difficulty to implement the reduction (e.g., the Halstead effort almost doubles from Variant 1 to Variant 4). Is there a correlation between the Halstead effort for writing a method and the times and We try to answer this question in Fig. 8. Somewhat surprisingly, a statement with a reduction more difficult to implement will sometimes also reduce faster (i.e., an increase in Halstead effort does not necessarily imply an increase in which is most evident for the category of select statements. The explanation is that even though the developer has to consider a large variety of different reductions for a complex variant (e.g., Variant 4), once the driver has found the right reduction (see Sect. 6), the reduction can proceed even faster than for a variant with less optimization considerations (e.g., Variant 1). For all categories in Fig. 8, a decreasing trend for values can be observed. However, in categories that manipulate the state of the CVC (note that operations from the category

664

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 8. Correlation of

and

to the Halstead effort

copy object propagate across relationships and thus manipulate the CVCs), impedance due to materializing the pin setting and the latest version comes into effect and often results in only minor differences in values among Variants 2-4.

6

Metaprogramming Concepts

Writing metacode is different and more difficult than writing code, because the programmer has to consider a large variety of cases that may occur depending on the form of the statement and the properties defined in the DS-DDL schema. Our key idea to developing reduction methods is the so-called reduction polymorphism. In OO programming languages, polymorphism supports dynamic selection of the “right” method depending on the type of object held by a reference (since the type is not known until run time, this is usually called late binding). In this way, it is possible to avoid disturbing conditional statements (explicit type checking by the programmer) in the code. In a similar way, we use reduction polymorphism to avoid explicit use of conditional statements in metacode. This means that for an incoming DS-DML statement, the domain-specific driver will execute reduction methods that (a) match the syntactic structure of the statement and (b) apply for the specifics of the DS-DDL schema constructs used in the statement. We illustrate both concepts using a practical example. Suppose the following DS-DML statement.

Metaprogramming for Relational Databases

665

Using our DS-DDL schema from Fig. 3 and reduction Variant 4, the statement gets reduced to the following SQL-DML statement (OT denotes object table, ATT the attachment relationship table, F a floating end, and NF a non-floating end).

First, any SELECT statement will match a very generic reduction method that will insert SELECT and FROM clauses into the SQL-DML source graph. A reduction method on the projection clause will reduce to a projection of identifiers (globalId, objectId, and versionId), user-defined attributes and the flag denoting whether the version is frozen. Note that because the maximal multiplicity of the end causedBy pointing from Cost to Task is 1, the table CostOT also contains the materialization of a pinned or latest version of some task, but the column for this materialization is left out in the projection, because it is irrelevant for the user. Next, a reduction method is invoked on the DS-DML FROM clause, which itself calls reduction methods on two DS-DML subnodes, one for each navigation step. Thus, the reduction of Offer-contains->Task results in conditions in lines 5–6 and the reduction of Task-ratedCosts->Cost results in conditions in lines 7–8. The reductions carried out in this example rely on two mechanisms, DS-DDL schema divergence and source-graph divergence. DS-DDL schema divergence is applied in the following way. The relationship type used in the first navigation step defines only one floating end while the one used in the second navigation step defines both ends as floating. Thus in the reduction of DSDDL, we had to map the first relationship type to two distinct tables (because relationships with only one floating end are not necessarily symmetric). Therefore, the choice of the table we use (isPartOfF_containsNF) is based on the direction of navigation. The situation would be even more different in case the multiplicity defined for the non-floating end would be 1, where we would have to use a foreign key column in the object table. Another important situation where schema divergence is used in our example product line is operation propagation. To deal with DS-DDL schema divergence, each reduction method for a given node comes with a set of preconditions related to DS-DDL schema that have to be satisfied for method execution. Source-graph divergence is applied in the following way. In filtered navigation within a workspace, we have to use the table causedByF_ratedCostsF to arrive at costs. The obtained versions are further filtered in lines 9, 11, and 13 to arrive only at costs attached to the workspace with globalId 435532. The situation would be different outside a workspace, where another table which stores the materialized globalIds of versions of costs that are either pinned or latest in the corresponding CVC would have to be used for the join. Thus the reduction of the second navigation step depends on

666

Jernej Kovse, Christian Weber, and Theo Härder

whether the clause USE WORKSPACE is used. To deal with source-graph divergence, each reduction method for a given node comes with a set of preconditions related to node neighborhood in the source graph that have to be satisfies for method execution. Due to source-graph divergence, line 3 of the DS-DML statement gets reduced to lines 9–15 of the SQL-DML statement. Obviously, it is a good choice for the developer to shift decisions due to divergence to many “very specialized” reduction methods that can be reused in diverse superordinated methods and thus abstract from both types of divergence. In this way, the subordinated methods can be explicitly invoked by the developer using generic calls and the driver itself selects the matching method. Four different APIs are available to the developer within a reduction method. Source tree traversal. This API is used to explicitly traverse the neighboring nodes to make reduction decisions not automatically captured by source-graph polymorphism. The API is automatically generated from the DS-DML metamodel. DS-DDL schema traversal. This API is used to explicitly query the DS-DDL schema to make reduction decisions not automatically captured by DS-DDL schema polymorphism. The API is automatically generated from the DS-DDL metamodel. SQL-DML API. This API is used to manipulate the SQL-DML source graphs. Reduction API. This API is used for explicit invocation of reduction methods on subordinated nodes in the DS-DML source graph.

7

Conclusion and Future Work

In this paper, we examined the topic of custom schema development and data manipulation languages which facilitate increased reuse within database-oriented software product lines. Our empirical evaluation, based on an example product line for versioning systems, shows that the portion of time required for mapping domain-specific statements to SQL at run time is below 9.9%. For this reason, we claim that domain-specific languages introduce great benefits in terms of raising the abstraction level in schema development and data queries at practically no cost. There is a range of topics we want to focus on in our future work. Is there a way to make DS-DMLs even faster? Complex reduction methods can clearly benefit from the following ideas. Source graphs typically consist of an unusually large number of objects that have to be created at run time. Thus the approach could benefit from instance pools for objects to minimize object creation overhead. Caching of SQL-DML source graphs can be applied to reuse them when reducing upcoming statements. Would it be possible to use parameterized stored procedures to answer DS-DML statements? This makes the reduction of DS-DML statements simpler, because a statement can be reduced to a single stored procedure call. On the other hand, it makes the reduction of DS-DDL schema more complex, because stored procedures capable of answering the queries have to be prepared. We assume this approach is especially useful when many SQL-DML statements are needed to execute a DS-

Metaprogramming for Relational Databases

667

DML statement. Implementing a stored procedure for a sequence of statements avoids excessive communication between (a) the domain-specific and the native driver and (b) between the native driver and the database. In a number of cases where a sequence of SQL-DML statements is produced as a result of reduction, these statements need not necessarily be executed sequentially. Thus developers of reduction methods should be given the possibility to explicitly mark situations where the driver could take advantage of parallel execution. In addition, dealing with DS-DDL schemas raises two important questions. DS-DDL schema evolution. Clearly, supplementary approaches are required to deal with modifications in a DS-DDL schema which imply a number of changes in existing SQL-DDL constructs. Product-line mining. Many companies develop and market a number of systems implemented independently despite their structural and functional similarities, i.e., without the proper product-line support. Existing schemas for these systems could be mined to extract common domain-specific abstractions and possible reductions, which can afterwards be used in future development of new systems.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

Apache Jakarta Project: Struts, available as: http://jakarta.apache.org/struts/ Ben-Natan, R., Sasson, O.: IBM San Francisco Developer’s Guider, McGraw-Hill, 1999 Bernstein, P.A.: Repositories and Object-Oriented Databases, in: SIGMOD Record 27:1 (1998), 34-46 Clements, P., Northrop, L.: Software Product Lines, Addison-Wesley, 2001 Czarnecki, K., Eisenecker, U.W.: Generative Programming: Methods, Tools, and Applications, Addison-Wesley, 2000 Frankel, D.S.: Model Driven Architecture: Applying MDA to Enterprise Computing, Wiley Publishing, 2003. Halstead, M.H.: Elements of Software Science, Elsevier, 1977 Kießling, W, Köstler, G: Preference SQL – Design, Implementation, Experiences, in: Proc. VLDB 2002, Hong Kong, Aug. 2002, 990-1001 Mahnke, W.: Towards a Modular, Object-Relational Schema Design, in: Proc. CAiSE 2002 Doctoral Consortium, Toronto, May 2002, 61-71 McCabe, T.J.: A Complexity Measure, in: IEEE Transactions on Software Engineering 2:4 (1976), 308-320 MetaCase: MetaEdit+ Product Website, available as: http://www.metacase.com/mep/ OMG: Common Warehouse Metamodel (CWM) Specification, Vol. 1, Oct. 2001 OMG: Model Driven Architecture (MDA) – A Technical Perspective, July 2001 Saeki, M.: Toward Automated Method Engineering: Supporting Method Assembly in CAME, presentation at EMSISE’03 workshop, Geneva, Sept. 2003 Simonyi, C.: The Death of Computer Languages, the Birth of Intentional Programming, Tech. Report MSR-TR-95-52, Microsoft Research, Sept. 1995 Thalheim, B.: Component Construction of Database Schemes, in: Proc. ER 2002, Tampere, Oct. 2002, 20-34 Weber, C., Kovse, J.: A Domain-Specific Language for Versioning, Jan. 2004, available as: http://wwwdvs.informatik.uni-kl.de/agdbis/staff/Kovse/DSVers/DSVers.pdf

Incremental Navigation: Providing Simple and Generic Access to Heterogeneous Structures* Shawn Bowers1 and Lois Delcambre2 1

San Diego Supercomputer Center at UCSD, La Jolla CA 92093, USA [emailprotected]

2

OGI School of Science and Engineering at OHSU, Beaverton OR 97006, USA [emailprotected]

Abstract. We present an approach to support incremental navigation of structured information, where the structure is introduced by the data model and schema (if present) of a data source. Simple browsing through data values and their connections is an effective way for a user or an automated system to access and explore information. We use our previously defined Uni-Level Description (ULD) to represent an information source explicitly by capturing the source’s data model, schema (if present), and data values. We define generic operators for incremental navigation that use the ULD directly along with techniques for specifying how a given representation scheme can be navigated. Because our navigation is based on the ULD, the operations can easily move from data to schema to data model and back, supporting a wide range of applications for exploring and integrating data. Further, because the ULD can express a broad range of data models, our navigation operators are applicable, without modification, across the corresponding model or schema. In general, we believe that information sources may usefully support various styles of navigation, depending on the type of user and the user’s desired task.

1 Introduction With the WWW at our fingertips, we have grown accustomed to easily using unstructured and loosely-structured information of various kinds, from all over the world. With a web browser it is very easy to: (1) view information (typically presented in HTML), and (2) download information for viewing or manipulating in tools available on our desktops (e.g., Word, PowerPoint, or Adobe Acrobat files). In our work, we are focused on providing similar access to structured (and semi-structured) information, in which data conforms to the structures of a representation scheme or data model. There is a large and growing number of structural representation schemes being used today including the relational, E-R, object-oriented, XML, RDF, and Topic Map models along with special-purpose representations, e.g., for exchanging scientific data. Each representation scheme is typically characterized by its choice of constructs for representing data and schema, allowing data engineers to select the representation best suited for their needs. However, there are few tools that allow data stored in different representations to be viewed and accessed in a standard way, with a consistent interface. *

This work supported in part by NSF grants EIA 9983518 and ITR 0225674.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 668–681, 2004. © Springer-Verlag Berlin Heidelberg 2004

Incremental Navigation

669

The goal of this work is to provide generic access to structured information, much like a web browser provides generic access to viewable information. We are particularly interested in browsing a data source where a user can select an individual item, select a path that leads from the item, follow the path to a new item, and so on, incrementally through the source. The need for incremental navigation is motivated by the following uses. First, we believe that simple browsing tools provide people with a powerful and easy way to access data in a structured information source. Second, generic access to heterogeneous information sources supports tools that can be broadly used in the process of data integration [8,10]. Once an information source has been identified, its contents can be examined (by a person or an agent) to determine if and how it should be combined (or integrated) with other sources. In this paper, we describe a generic set of incremental-navigation operators that are implemented against our Uni-Level Description (ULD) framework [4,6]. We consider both a low-level approach for creating detailed and complete specifications as well as a simple, high-level approach for defining specifications. The high-level approach exploits the rich structural descriptions offered by the ULD to automatically generate the corresponding detailed specifications for navigating information sources. Thus, our high-level specification language allows a user to easily define and experiment with various navigation styles for a given data model or representation scheme. The rest of this paper is organized as follows. In Section 2 we describe motivating examples and Section 3 briefly presents the Uni-Level Description. In Section 4, we define the incremental navigation operators and discuss approaches to specifying their implementation. Related work is presented in Section 5 and in Section 6 we discuss future work.

2 Motivating Examples When an information agent discovers a new source (e.g., see Figure 1) it may wish to know: (1) what data model is used (is it an RDF, XML, Topic Map, or relational source?), (2) (assuming RDF) whether any classes are defined for the source (what is the source schema?), (3) which properties are defined for a given class (what properties does the film class have?), (4) which objects exist for the class (what are the instances of the film class?) and (5) what kinds of values exist for a given property of a particular object of the class (what actor objects are involved in this film object?). This example assumes the agent (or user) understands the data model of the source. For example, if the data model used was XML (e.g., see Figure 2) instead of RDF, the agent could have started navigation by asking for all of the available element types (rather than RDF classes). We call this approach data-model-aware navigation, in which the constructs of the data model can be used to guide navigation. In contrast, we also propose a form of browsing where the user or agent need not have any awareness of the data-model structures used in a data source. The user or agent is able to navigate through the data and schema directly. As an example (again using Figure 1), the user or agent might ask for: (1) the kind of information the source contains, which in our example would include “films,” “actors,” and “awards,” etc., (2) (assuming the crawler is interested in films) the things that describe films, which

670

Shawn Bowers and Lois Delcambre

Fig. 1. An example of an RDF schema and instance.

Fig. 2. An example XML DTD (left) and instance document (right).

would include “titles” and relationships to awards and actors, (3) the available films in the source, and (4) the actors of a particular film, which is obtained by stepping across the “involved” link for the film in question. We call this form of browsing simple navigation.

3 The Uni-level Description The Uni-Level Description (ULD) is both a meta-data-model (i.e., capable of describing data models) and a distinct representation scheme: it can directly represent both schema and instance information expressed in terms of data-model constructs. Figure 3 shows how the ULD represents information, where a portion of an object-oriented data model is described. The ULD is a flat representation in that all information stored in the ULD is uniformly accessible (e.g., within a single query) using the logic-based operations described in Table 1. Information stored in the ULD is logically divided into three layers, denoted metadata-model, data model, and schema and data instances. The ULD meta-data-model, shown as the top level in Figure 3, consists of construct types that denote structural primitives. The middle level uses the structural primitives to define both data and schema constructs, possibly with conformance relationships between them. Constructs are necessarily instances of construct types, represented with ct-inst instance-of links.

Incremental Navigation

671

Fig. 3. The ULD meta-data-model architecture.

Similarly, every item introduced in the bottom layer, denoting actual data or schema items, is necessarily an instance of a construct in the middle layer, represented with c-inst instance-of links. An item in the bottom layer can be an instance of another item in the bottom layer, represented with d-inst instance-of links, as allowed by the conformance relationships specified in the middle layer. For example, in Figure 3, the class and object constructs are related through a conformance link, labeled conf, and their corresponding construct instances in the bottom layer, i.e., actor and the object with the name ‘Robert De Niro’ are related through a data instance-of link, labeled d-inst. The ULD offers flexibility through the conf and d-inst relationships. For example, an XML element that does not have an associated element type can be represented in the ULD; the element would simply not have a d-inst link to any XML element type. The ULD represents an information source as a configuration containing the constructs of a data model, the construct instances (both schema and data) of a source, and the associated conformance and instance-of relationships. A configuration can be viewed as an instantiation of Figure 3. Each configuration uses a finite set of identifiers to denote construct types, constructs, and construct instances as well as a finite set of ct-inst, c-inst, conf, and d-inst facts. We note that a configuration can be implemented as a logical view over an information source, and is not necessarily “materialized.” The ULD meta-data-model contains primitive structures for tuples, i.e., sets of name-value pairs; set, list, and bag collections; atomics, for scalar values such as strings and integers; and unions, for representing non-structural, generalization relationships

672

Shawn Bowers and Lois Delcambre

Fig. 4. The XML with DTD data model.

among constructs. The construct-type identifiers for these structures are denoted structct, set-ct, list-ct, bag-ct, atomic-ct, and union-ct, respectively. Figures 4, 5, and 6 give example descriptions of simplified versions of XML with DTDs, RDF with RDF Schema, and sample schema and data (from Figure 1) for the RDF model, respectively1. We note that there are potentially many ways to describe a data model in the ULD, and these examples show only one choice of representation. The XML data model shown in Figure 4 includes constructs for element types, attribute types, elements, attributes, content models, and content, where element types contain attribute types and content specifications, elements can optionally conform to element types, and attributes can optionally conform to attribute types. We simplify content models to sets of element types for which a conforming element must have at least one subelement for each corresponding type. The RDF data model with RDF Schema (RDFS) of Figure 5 includes constructs for classes, properties, resources, and triples. A triple in RDF contains a subject, predicate, and object, where a predicate can be an arbitrary resource, including a defined property. In RDFS, rdf:type, rdfs:subClassOf, and rdfs:subPropertyOf are considered special RDF properties for denoting instance and specialization relationships. However, we model these properties using conformance and explicit constructs. For example, a subclass relationship is represented by instantiating a subClassOf construct as opposed to using the special rdfs:subClassOf RDF property2. A ULD query is expressed as a Datalog program [1] and is executed against a configuration. As an example, the first query below finds all available class names within an RDF configuration. Note that upper-case terms denote variables and lower-case terms denote constants. The rule is read as “If C is an RDF class and the label of C is X, then X is a classname.” The second query returns the property names of all classes in an RDF configuration. This query, like the first, is expressed solely against the schema of the source. The third query below is expressed directly against data, and returns the URI of all RDF resources used as a property in at least one triple, where the resource may or may not be associated with schema. 1

We use uldValue and uldValuetype as special constructs to denote scalar values and value types [4, 6]. Also, uldString and uldURI are default atomic constructs provided by the ULD. 2 This ULD representation of RDF allows properties and isa relationships to be decoupled (compared with RDF itself). This approach does not limit the expressibility of RDF: partial, optional, and multiple levels of schema are still possible.

Incremental Navigation

673

Fig. 5. The RDF with RDF Schema data model.

Fig. 6. Portion of schema and data for RDF(S).

The following three queries are similar to the previous three, but are expressed against an XML configuration. The first query finds the names of all available element types in the source, the second finds, for each element-type name, its corresponding attribute-definition names, and the last finds all available attribute names as a data query.

Finally, the following query returns all constructs that serve as struct-ct schema constructs and their component selectors. This query is solely expressed against the data-model constructs.

4 Navigation Operators The ULD presents a complete, highly detailed description of a data source, with interconnected model, schema, and data information. In the ULD, each construct type,

674

Shawn Bowers and Lois Delcambre

construct, and instance is represented by an id and every id, in turn, has an associated value. The value can be either an atomic value (such as a literal in RDF) or a structured value (such as a set or bag of ids). We view navigation as a process of traversing a graph consisting of locations (nodes) and links (bi-directional edges), superimposed over a ULD source file. A location is either a construct type, construct, or instance in the ULD; thus a location is anything with an id. A link is a (simple or compound) path, from one location to another, through the connections in the ULD. A navigation binding consists of an implementation for the following functions. For a binding, we assume a finite set of location names and a finite set of link names where both and consist of atomic string values. Navigation consists of moving from one location name to another. The binding should include only those locations that are meaningful to the intended user community, with appropriate links. Starting Points. The operator sloc : returns all available entry points into an information source. We require the result of sloc to be a set of locations (as opposed to links). Note that stands for the power set of Links. The operator links : returns all out-bound links available from a particular location. For some locations, there may not be any links available, i.e., the links operator may return the empty set. Following Links. The operator follow : returns the set of locations that are at the end of a given link. We use the follow operator to prepare to move to a new location from our current location. Given the set of locations returned by the follow operator, the user or agent directing the navigation can choose one as the new location. Types. The operator types : returns the (possibly empty) set of types for a given location. We use the types operator to obtain locations that represent the schema for a data item. A particular location may have zero, one, or many associated types. Extents. The operator extent : returns the (possibly empty) set of instances for a given location. The extent operator computes the inverse of the types operator. As a simple example, the following (partial) navigation binding can be defined for the data shown in Figure 6

We express the navigation functions in Datalog using the predicates described below. An operator binding is a set of ULD queries, where the head of each query is a navigation operation (expressed as a predicate). Thus, operator bindings are defined as global-as-view mappings from the ULD (typically, over any configuration of a data model) to the navigation operations. We propose two ways to specify a navigation binding: as a set of low-level ULD queries and as a high-level specification that is used to automatically generate the appropriate navigation bindings. where represents a starting location. where is a link from location

Incremental Navigation

675

where is a location that is found by following link from some location where is a location of type where is in the extent of To illustrate, the following low-level binding queries present a view of an RDF configuration, where we only allow navigation from data items with corresponding schema. Thus, this example supports simple browsing. The starting locations are classes and the available links from a class are its associated properties and its associated instances. The definition uses an additional intensional predicate subClassClosure for computing the transitive closure of the RDF subclass relationship.

In general, with low-level binding queries a user can specify detailed and exact descriptions of the navigation operations for data sources. To specify higher-level bindings, a user selects certain constructs as locations and certain other constructs as links. Using this specification, the navigation operators are automatically computed by traversing the appropriate instances of locations and links in the configuration. Figure 7 shows an example of a high-level binding definition for RDF, where RDF classes, resources, and literals are considered sources for locations and RDF properties and triples are considered sources for links (Figure 10 shows a similar binding for XML, which we discuss later). We define a high-level binding specification as a tuple (L, N, S, F). The disjoint sets L and N consist of construct identifiers such that L is the set of constructs used as locations and N is the set of constructs used as links. The set gives the entry points of the binding. Finally, the set F contains link definitions (described below). Each construct in L and N has an associated naming definition that describes how to compute the name of an instance of the construct. The name would typically be viewed by the user during navigation. The naming definitions serve to map location and

676

Shawn Bowers and Lois Delcambre

Fig. 7. A high-level binding for simple navigation of RDF.

link instances to appropriate string values. For example, in Figure 7, RDF classes and properties are named by their labels, resources are named by their URI values, a literal value is used directly as its name, and the name of a triple is the name of its associated predicate value. The incremental operators in a high-level binding specification are computed automatically by traversing connected instances. We define the following generic rules to compute when two instances are connected. (Note that these connected rules only perform single-step traversal, and can be extended to allow an arbitrary number of steps, which we discuss at the end of this section.)

A connected formula is true when there is a structural connection between two instances. Note that the rules above do not consider the case when two items are linked by a relationship. Instead, the relationship is directly used by the types and extent operators, whereas connections are used by the links and follow operators. The link definitions of F have the form where: The behavior of the link construct is being described by the rest of the expression. For example, in Figure 7, the first link definition is for triple constructs. For the construct can serve to link an instance of to an instance of Thus, we can traverse from instances of to instances of via an instance of For example, in Figure 7, the first link definition says that we can follow a resource instance to a literal instance if they are connected by a triple. The expressions and further restrict how instances of can be used to link and instances, respectively. For example, in Figure 7, the first link definition states that triples link resources and literals through the triple’s hasSubj and hasObj selector, respectively. We define the and Given a linkdefinition and a connection from is true (where is the link instance), such that is an instance of (that is,

to

clauses as follows. such that is true if is true), is an instance of

Incremental Navigation

677

Fig. 8. Datalog rules to compute links and locations.

(that is, is true where is a link construct), and and are connected according to the expression Similarly, for an is true if such that is an instance of (a link construct), is an instance of and and are connected according to the expression Given the above definitions, we automatically compute each navigation operator using the Datalog queries in Figure 8. We assume each operator is represented as an intensional predicate (as before) and the binding specification B = (L, N, S, F) is stored as a set of unary extensional predicates and For example, the expression binds X to a location in L for binding B. We also assume that the name predicate is stored as an intensional formula (as defined in the binding). The first rule in Figure 8 finds the set of entry points: It obtains a construct in the set of starting locations, finds an instance of the construct, and then computes the name of the instance. The second rule finds the locations with links. For each named instance in the configuration that is connected to another instance, we use the linkSource predicate to check if it is a valid connection, we check to make sure that the link instance (represented as the variable Y) is valid, and then compute the name of the instance. The third rule is similar to the second, except it additionally uses the linkTarget to determine the new location. Finally, the last two rules use the d-inst relationship to find types and extents, respectively. To demonstrate the approach, we use the binding definition of Figure 7 and the sample configuration of Figure 9. This configuration shows part of Figure 1 as a graph whose nodes are construct instances and edges are either connections between structures or links. Consider the following series of invocations. 1. sloc = {‘film’, ‘thriller’}. According to the binding definition, the sloc operator returns all the labels of class construct instances. As shown in Figure 9, the only class construct instances are thriller and film. 2. links(‘film’) = {‘title’}. The links operator is computed by considering each connected construct instance of film until it finds a construct instance whose associated construct is in N. As shown, the only such instance is title, which is an RDF property. 3. extent(‘film’) = {‘#m1’}. The extent operator looks for the d-inst links of the given instance. As shown, the only such link for film is to m1. 4. follow(‘#m1’, ‘title’) = {‘The Usual Suspects’}. The follow operator starts in the same way as the links operator by finding instances (whose constructs are in N) that are connected to the given instance. For the given item, the only such instance in Figure 9 is t1. The follow operator then returns the hasObj component of t1, according to the link definition for triple.

678

Shawn Bowers and Lois Delcambre

Fig. 9. An example of labeled instances for RDF.

Note that for this example, the rules for computing the follow and links operators only consider instances that are directly connected to each other (through the connected predicate). Thus, invoking the operator links(‘thriller’) will not return a result (for our example) because there are no link construct instances directly connected to the associated RDF class. However, the properties of a class in RDF also include the properties of its superclasses. We can include such information by expanding the set of connection rules. One approach is to allow the navigation specifier to add a connected rule specifically for the subclass case. Alternatively, we can extend the connection definition to compute the transitive closure (of connections) using the following rule.

We also allow binding specifications to include multiple-step path expressions in F. For example, we add the following link definitions to the RDF binding specification to correctly support subclasses.

Finally, Figure 10 shows a high-level binding specification for the XML data model of Figure 4. The binding specification assumes the transitive connection relation defined above. Element types, elements, and atomic data serve as locations, with element types as starting locations. Attribute definitions, attributes, content definitions, and content serve as links. We use ‘hasChildType’ and ‘hasChild’ strings as the names of the links for content definitions and element content, respectively. Note the attDef link definition is a special case in which attDef links always lead to an empty set of locations (denoted using the empty set in Figure 10). Also, the ending ‘/’ in a link-definition path denotes traversal into the elements of a collection structure (as opposed to denoting the collection structure itself).

5 Related Work A number of approaches provide browsing capability for traditional databases. Motro [14] seeks to enable users who are (1) not familiar with the data model of the system, (2)

Incremental Navigation

679

Fig. 10. A high-level binding for direct navigation of XML.

not familiar with the organization of the database (i.e., the schema), (3) not proficient with the use of the system (i.e., the query language), (4) not sure what data they are looking for (but are looking for something interesting or suitable), and/or (5) not clear how to construct the desired query. As more structured information finds its way on the Web, we believe these issues become more pressing for users as well as for software agents wishing to exploit structured information. Database browsing typically assumes a fixed data model [2, 14, 12, 17, 3, 8, 18] (either relational, E-R, or object-oriented). Only a few systems allow browsing schema and data in isolation [3, 12, 14], where most support browsing data only through schema (i.e., navigating data using items of the schema). Hypertext systems, including those with structured data models, also use browser-based interfaces [15, 13]. These systems, as in database approaches, are developed for a single data model, and support limited browsing styles. The links and locations abstraction used by incremental navigation is similar in spirit to the graph-based model of RDF and RDF Schema. The Object Exchange Model (OEM) – the semi-structured representation of TSIMMIS [11, 16] – is another simple abstraction. Both TSIMMIS and some database browsing systems [2, 9, 14, 8] support user navigation mixed with user queries, which we would like to explore as an extension to our current navigation operators. Finally, Clio [18] provides some support for navigation, specifically to help users build data-transformation queries. Clio supports “data walks,” which display example data involved in each potential join path between two relations, and “data chases,” which display all occurrences of a specific value within a database.

6 Conclusion and Future Work We believe incremental navigation provides a simple, generic abstraction, consisting of links and locations (and types and extents when applicable), that can be applied over arbitrary data models. More than that, with the high-level binding approach, it becomes

680

Shawn Bowers and Lois Delcambre

relatively straightforward to specify the links and locations for a data model, thus enabling generic and uniform access to information represented in any underlying data model (described in the ULD). We believe that this approach can be extended beyond navigation to include querying information sources (i.e., querying links and locations) and for specifying high-level mappings between data sources. We have implemented a prototype browser [5] to demonstrate incremental navigation, both for data-model aware and simple navigation of RDF, XML, Topic Map, and relational sources. In addition, some of the ideas of incremental navigation appear in the Superimposed Schematics browser [7], which allows users to incrementally navigate an ER schema and data source. Based on these experiments, we believe incremental navigation is viable, and helps reduce the work required to develop such browsing tools. For future work, we intend to investigate whether additional ULD information can be used, such as data-model constraints, to help validate and generate operator bindings. We are also interested in defining a language to express path-based queries over the links and locations abstraction offered by incremental navigation. One issue is to determine whether algorithms and optimizations can be defined to efficiently compute (i.e., unfold) the binding-specification rules to answer such path queries. Finally, we believe that the incremental-navigation operators can be easily expressed as a standard web-service interface (where information sources have corresponding web-service implementations), providing generic, web-based access to heterogeneous information.

References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley Publishing Company, 1995. 2. B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, and P. Sudarshan. BANKS: Browsing and keyword searching in relational databases. In Proceedings of the TwentyEighth Very Large Data Bases (VLDB) Conference, 2002. 3. R. Agrawal, N. Gehani, and J. Srinivasan. Ode View: The graphical interface to Ode. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pages 34–43, 1990. 4. S. Bowers. The Uni-Level Description: A Uniform Framework for Managing Structural Heterogeneity. PhD thesis, OGI School of Science and Engineering, OHSU, December 2003. 5. S. Bowers and L. Delcambre. JustBrowsing: A generic API for exploring information. In Demo Session at the 21st International Conference on Conceptual Modeling (ER), 2002. 6. S. Bowers and L. Delcambre. The uni-level description: A uniform framework for representing information in multiple data models. In Proceedings of the 22nd International Conference on Conceptual Model (ER), volume 2813 of Lecture Notes in Computer Science, pages 45–58. Springer-Verlag, 2003. 7. S. Bowers, L. Delcambre, and D. Maier. Superimposed schematics: Introducing E-R structure for in-situ information selections. In Proceedings of the 21st International Conference on Conceptual Model (ER), volume 2503 of Lecture Notes in Computer Science, pages 90– 104. Springer-Verlag, 2002. 8. M. J. Carey, L. M. Haas, V. Maganty, and J. H. Williams. PESTO: An integrated query/browser for object databases. In Proceedings of 22nd International Conference on Very Large Data Bases (VLDB), pages 203–214. Morgan Kaufmann, 1996. 9. T. Catarci, G. Santucci, and J. Cardiff. Graphical interaction with heterogeneous databases. The VLDB Journal, 6(2):97–120, 1997.

Incremental Navigation

681

10. W. W. Cohen. Some practical observations on integration of web information. In Informal Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), pages 55–60, 1999. 11. J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom. Information translation, mediation, and mosaic-based browsing in the TSIMMIS system. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, page 483. ACM Press, 1995. 12. M. Kuntz and R. Melchart. Ergonomic schema design and browsing with more semantics in the Pasta-3 interface for E-R DBMSs. In Proceedings of the Eight International Conference on Enity-Relationship Approach, pages 419–33, 1989. 13. C. C. Marshall, F. M. Shipman III, and J. H. Coombs. VIKI: Spatial hypertext supporting emergent structure. In European Conference on Hypertext Technology (ECHT), pages 13– 23. ACM Press, 1994. 14. A. Motro. BAROQUE: A browser for relational databases. ACM Transactions on Office Information Systems, 4(2):164–181, 1986. 15. J. Nanard and M. Nanard. Should anchors be typed too? An experiment with macweb. In Proceedings of Hypertext, pages 51–62, 1993. 16. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of the Eleventh International Conference on Data Engineering, pages 251-260. IEEE Computer Society, 1995. 17. T. Rogers and R. Cattell. Entity-Relationship databases user interfaces. In Sixth International Conference on Entity-Relationship Approach, pages 353–365, 1997. 18. L. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. ACM Press, 2001.

Agent Patterns for Ambient Intelligence Paolo Bresciani, Loris Penserini, Paolo Busetta, and Tsvi Kuflik ITC-irst via Sommarive 18 I-38050 Trento-Povo, Italy {bresciani,penserini,busetta,kuflik}@itc.it

Abstract. The realization of complex distributed applications, required in areas such as e-Business, e-Government, and ambient intelligence, calls for new development paradigms, such as the Service Oriented Computing approach which accommodates for dynamic and adaptive interaction schemata, carried on on a per-to-peer level. Multi Agent Systems offer the natural architectural solutions to several requirements imposed by such an adaptive approach. This work discusses the limitation of common agent patterns, typically adopted in distributed information systems design, when applied to service oriented computing, and introduces two novel agent patterns, that we call Service Oriented Organization and Implicit Organization Broker agent pattern, respectivelly. Some design aspects of the Implicit Organization Broker agent pattern are also presented. The limitations and the proposed solutions are demonstrated in the development of a multi agent system which implements a pervasive museum visitors guide. Some of the architecture and design features serve as a reference scenario for the demonstration of both the current methods limitations and the contribution of the newly proposed agent patterns and associated communication framework.

1 Introduction Complex distributed applications emerging in areas such as e-Business, e-Government, and the so called ambient intelligence (i.e., “intelligent” pervasive computing [7]), needs to adopt forms of group communication that are deeply different from classical client-server and Web-based models (see, for instance, [13]). This strongly motivates forms of application-level peer-to-peer interaction, clearly distinct from the request/response style commonly used to access distributed services such as, e.g., Web Services adopting SOAP, XML, and RPC as communication protocol [6, 12]. The so called service oriented computing (SOC) is the paradigm that accommodates for the above mentioned more dynamic and adaptive interaction schemata. Service-oriented computing is applicable to ambient intelligence as a way to access environmental services, e.g., accessing sensors or actuators close to a user. Multi Agent Systems (MAS) naturally accommodate for the SOC paradigm. In fact, each service can be seen as an autonomous agent (or an aggregation of autonomous agents), possibly without global visibility and control over the global system, and characterized by unpredictable/intermitted connections with other agents of the system. However, we argue that some domain specificities – such as the necessity to continuously monitor the environment for understanding the context and adapting to the user needs, and the speed P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 682–695, 2004. © Springer-Verlag Berlin Heidelberg 2004

Agent Patterns for Ambient Intelligence

683

at which clients and service providers come and go within a physical environment populated with mobile devices – impose new challenging system architecture requirements that are not satisfied by traditional agent patterns proposed for request/response interactions. Moreover, in ambient intelligence applications, we often need to effectively deal with service composition based on dynamic agreements among autonomous peers, because group of peers collaborate at different levels and times during the service providing process life-cycle, analogously to the product life-cycle process introduced by Virtual Enterprise scenarios [14]. This group communication styles should be used as architectural alternatives or extensions to middle agents (e.g., matchmakers and brokers), simplifying the application logic and moving context-specific decision-making from high-level applications or intermediate agents down to the agents called to achieve a goal. For these reasons, in this paper, we propose novel agent patterns, which allow for dynamic, collective, and collaborative reconfiguration of service providing schemata. To illustrate our approach, we use the notion of service oriented organization. We call service oriented organization (SOO) a set of autonomous software agents that, in a given location at a given time, coordinate in order to provide a service; in other words, a SOO is a team of agents whose goal is to deliver a service to its clients. Examples of SOO are not restricted to Web Services and ambient intelligence; for instance, they include virtual enterprises or organizations [14,8], the name of which reflect the application area in which they have been adopted, i.e., e-Business. As well, this paper focuses on a special type of SOO that we call implicit organization broker (IOB), since it exploits a form of group communication called channelled multicast [3] to avoid explicit team formation and dynamically agree on the service composition. We will compare SOO and IOB to traditional agent patterns based on brokers or matchmakers. As a reference example to illustrate our ideas, we adopt an application scenario from Peach [15], an ongoing project for the development of an interactive museums guide. Using the Peach system, users can request information about exhibits; these may be provided by a variety of information sources and media types (museum server, online remote servers, video, etc.). As well, we adopt the Tropos software design methodology [5,2] to illustrate and compare the different agent patterns. Tropos adopts high-level requirements engineering concepts founded on notions such as actor, agent, role, position, goal, softgoal, task, resource, belief and different kinds of social dependency between actors [5,2,11]. Therefore, Tropos allows for a modeling level more abstract than other current methodologies as, e.g., UML and AUML [1]. Such properties well fit with our major interest, which is in modeling environmental constraints that affect and characterize agents’ roles and their intentional and social relationships, rather than in implementation and/or technological issues. Section 2 briefly recalls some background notions on Tropos, on service oriented computing, and on agent patterns. Section 3 describes and discusses an excerpt of the Peach project, adopted as a reference case to illustrate our arguments. Section 5.1 tries to overcome some limitations of traditional patterns by proposing two new agent patterns: the Service Oriented Organization and the Implicit Organization Broker. Section 5.2 aims at justifying group communication as fundamental to effectively deal with the proposed patterns, providing a rationale view and describing dynamic aspects. Some conclusions are given in Section 6.

684

Paolo Bresciani et al.

2 Background Tropos. The Tropos methodology [2,5] adopts ideas from Multi Agents Systems technologies and concepts from requirements engineering through the framework, is an organizational modeling framework for early requirements analysis [18], founded on notions such as actor, agent, role, goal, softgoal, task, resource, and different kinds of social dependency between actors. Actors represents any active entity, either individual or collective, and either human or artificial. Thus, an actor may represent a person or a social group (e.g., an enterprise or a department) or an artificial system, as, e.g., an interactive museum guide or each of its components (both hardware and software) at different levels of granularity. Actors may be further specialized as roles or agents. An agent represents a physical (human, hardware or software) instance of actor that performs the assigned activities. A role, instead, represents a specific function that, in different circumstances, may be executed by different agents – we say that the agent plays the role. Actors (agents and roles) are used in Tropos to describe different social dependency and interaction models. In particular, Actor Diagrams (see Figures 1, 3, and 4) describe the network of social dependencies among actors. An Actor Diagram is a graph, where each node may represent either an actor, a goal, a softgoal, a task or a resource. Links between nodes may be used to form paths like: depender dependum dependee, where the depender and the dependee are actors, and the dependum is either a goal, a softgoal, a task or a resource. Each path between two actors indicates that one actor depends on the other for something (represented by the dependum) so that the former may attain some goal/softgoal/task/resource. In other terms, a dependency describes a sort of “agreement” between two actors (the depender and the dependee), in order to attain the dependum. The depender is the depending actor, and the dependee the actor who is depended upon. The type of the dependum describes the nature of the dependency. Goal dependencies are used to represent delegation of responsibility for fulfilling a goal; softgoal dependencies are similar to goal dependencies, but their fulfillment cannot be defined precisely (for instance, the appreciation is subjective, or the fulfillment can occur only to a given extent); task dependencies are used in situations where the dependee is required to perform a given activity; and resource dependencies require the dependee to provide a resource to the depender. As exemplified in Figure 1, actors are represented as circles1; dependums – goals, softgoals, tasks and resources – are represented as ovals, clouds, hexagons and rectangles, respectively. Goals and softgoals introduced with Actor Diagrams can be further detailed and analyzed by means of the so called Goals Diagrams [2], in which the rationale of each (soft)goal is described in terms of goal decompositions, means-end-analysis and the like, as, e.g., in Figure 5. Tropos spans four phases of Requirements Engineering and Software Engineering activities [5,2]: Early Requirements Analysis, Late Requirements Analysis, Architectural Design, and Detailed Design. Its key premise is that agents and goals can be used as fundamental concepts for all the phases of the software development life cycle. Actor and Goal Diagrams are adopted from Early Requirements Analysis to architectural design. Here, we use them to describe the agent patterns we are interested in. 1

We do not adopt any graphical distinction between agents and roles: when needed, we clarify it in the text.

Agent Patterns for Ambient Intelligence

685

Service oriented computing. Service Oriented Computing (SOC) [12] provides a general, unifying paradigm for diverse computing environments such as grids, peer-to-peer networks, ubiquitous and pervasive computing. A service encapsulates a component made available on a network by a provider. The interaction between a client and a service normally follows a straightforward request/response style, possibly asynchronous; this is the case with Web Services, which adopt SOAP, XML, and RPC [6,12] as communication protocol. Two or more services can be aggregated to offer a single, more complex, service or even a complete business process; the process of aggregation is called service composition. As already noticed, MAS naturally accommodate for the SOC paradigm. Since each agent in a MAS may be either an arbiter or an intermediary for the user’s requested service, two common agent patterns that appear to be appropriate are the matchmaker and the broker (see, e.g., [10,16]). Agent patterns for SOC. To accommodate the different settings and agents that can be involved, and with the different roles that – from time to time – can be played by each agent, a pattern based approach for the description and design of the MAS architectures for SOC systems can be adopted. An agent pattern can be used to describe a problem commonly found in MAS design and to prescribe a flexible solution for that problem, so to ease the reuse of that solution [11,17,9]. The literature on Tropos adopts ideas from social patterns [5, 11] to focus on social and intentional aspects that are recurrent in multi-agent or cooperative systems. Here, we adopt Actor and Goal Diagrams to characterize MAS design patterns, focusing on how the goals assigned to each agent2 are fulfilled [2,11], rather than on how agents communicate with each other. In the very spirit of Tropos, which naturally carries out the importance of analyzing each problem at a high abstraction level, allowing to reduce and easily manage at ‘design time’ the system components complexity, we aim at enhancing the reuse of design experience and knowledge by means of the adoption of agent patterns. In our context, such patterns have to cope with the important issue of locating information/service providers, which is an architectural requirement. Indeed, as also investigated in [13], such a requirements strongly affect coordination issues in decentralized (pure) peer-to-peer scenarios. Thus, to support the peer-to-peer scenario, the matchmaker agent pattern (see Figure 1a) play a key/centric role in order to allow the whole system for the searching and matching capabilities, e.g., see [16]. At the same time, the focus on the service providing process life-cycle puts the consumer in the center, and when the consumer demands novel services the system architecture should provide them without overwhelming her with additional interactions. Moreover, in a decentralized scenario, it may have several local failures may happen, when trying to locate new services; hence, a huge number of interactions, before reaching the related provider, are possible. Of course, the reduction of the interaction complexity decreases the customer overload. Such a requirement calls for a broker pattern too, as detailed in Figure 1b (e.g., see [10]). 2

Indeed, accordingly with the Tropos terminology, we should speak about roles, but we drop, here, this distinction, to ease the reading of the diagrams.

686

Paolo Bresciani et al.

Fig. 1. a) Matchmaker agent pattern; b) Broker agent pattern, depicted by means of the Tropos Actor Diagrams.

The Tropos diagram, of Figure 1 .a, shows that each time a user’s information/service request arrives3, Consumer depends on Matchmaker to locate good provider. On the contrary, Figure 1.b shows that Consumer depends on Broker in order to forward requested service, that is, Broker plays an intermediary role between Provider and Consumer. In essence, both Broker and Matchmmaker depend on Provider to advertise service(s). Namely, the two patterns skills consist of mediating, among both consumers and providers, for some synergic collaborations to satisfy global goals. In particular, Matchmaker lets Consumer directly interact with Provider, while Broker handles all the interactions between Consumer and Provider.

3 A Reference Scenario The Peach project [15] focuses on the development of a mobile museum visiting guide system. The whole system is a MAS, which has been developed following the Tropos methodology. Indeed, agents perform their actions while situated in a particular environment that they can sense and affect. More specifically, in the typical Peach museum visiting guide scenario, a user (the visitor) is provided with several environmental interaction devices. The most evident to her is a personal hand-held mobile I/O device, namely a PDA. Other devices include: i) passive localization hot-spots, based on triangularization of signals coming from the PDA; ii) (pro)active stationary displays of different sizes and with different audio output quality. Depending on the dimensions, the displays may be used to deliver visual/audio information (images and/or motion pictures possibly with audio comments; text) to a single user at a time, or to a group of users. Given this environment, let us start from the following possible user–system interaction scenario: Example 1 (explicit communication). A museum visitor requests some information during her tour by using her mobile device. To deliver on such a goal, the PDA con3

In the context of our simplified Peach example (see below), the Consumer is the role plaid by the software agent acting as the interface for the human user, that is the User Assistant.

Agent Patterns for Ambient Intelligence

687

Fig. 2. Overview of the actor interactions.

tains an agent (the User Assistant) which, on behalf of the user, sends a presentation request to the museum central system. Here, three system actors take the responsibility of generating a presentation: the Presentation Composer, the User Modeler and the Information Mediator. Still using Tropos, we can get to detailed design and model the communication dimension of the system actors. To this end, Tropos adopts AUML [1] interaction diagrams. The communication diagram of Figure 2 presents the sequence of events from the time a request for presentation is issued until the presentation is presented to the user. The User Assistant, the Presentation Composer, and the User Modeler are generic roles that may be played by different software agents, e.g., there may be several different information mediators (for video, audio, text, pictures, animation, local and remote information sources and more), there may be several user assistants with different capabilities (hand-held devices, desk-top stations, wall mounted large plasma screens and more), and there may also be several different user modelers implementing various techniques to get users’ profiles. In any case, here, we are not interested in the specific agents implementing such functionalities (i.e., playing the assigned role), but, instead, in the roles themselves. In fact, they – i.e., the User Assistants, the Presentation Composer, and the Information Mediator – form ad-hoc service-oriented organizations, in order to achieve the service goal. Each SOO is characterized by members that collaborate at different levels and times during the service providing process life-cycle. After the goal is satisfied, the organization is dissolved and a new one will be formed – possibly including different agents, provided they play the listed roles – to serve a new request.

3.1 Discussion The previous section motivates the need of some agent patterns to effectively deal with distributed computing issues (e.g., see [11,17,16,10]). Nevertheless, if we proceed by adopting traditional agent patterns, as, e.g., the matchmaker and broker introduced in Section 2, probably we could not be able to capture few but interesting and vital architectural requirements that arise from our ambient intelligence scenario, specially if we want to fully exploit the flexibility – in terms of self organizing presentation delivery channels – that can be provided.

688

Paolo Bresciani et al.

In particular, to motivate our assertion, let us consider the following new scenario: Example 2 (implicit communication). Let us assume that, while walking around, the user is approaching some presentation devices that are more comfortable and suitable to handle the presentation than the mobile device (User Assistant), e.g., in terms of pixel resolution and audio quality. So, we may assume the User Assistant is autonomously capable to exploit its intelligent behavior by negotiating the most convenient presentation, on behalf of its human owner. Let us also assume that there are several different Presentation Composers for each single device (capable to generate video, text, animated explanation, audio, etc.) and that each Presentation Composer relies on different Information Mediators to provide the information required for presentation generation. Moreover, we may also assume that each Presentation Composer is able to proactively propose its best services (in terms of available or conveniently producible presentations) to the User Assistant, possibly through some mediation interface. As well, we expect that all the services (negotiated or proposed) are “dynamically validated”, that is, due to the fact that the environment and the user location are quickly changing, only the appropriate services are considered. Such a scenario calls for architecture flexibility in terms of dynamic group reconfiguration to support SOOs involvement. Traditional approaches allow for intentional relationships and request/response communication protocols among single agents only, and not among group of agents [9–11,17]. More specifically, we may assume that the User Assistant starts an interaction session that triggers the involvement of a group of system actors all with the ability of Presentation Composer, which in turn trigger the involvement of a group of system actors all with the ability of Information Mediator. Each Presentation Composer, instead, relays on the User Modeler to know the user profile to correctly build up a user-tailored presentation. Therefore, such an architecture has to adopt group communication in order to support an ‘intelligent’ pervasive computing model among users’ assistant devices and the system actor information/service providers. To cope with these new challenges, we can imagine that the system agents exploit a form of ‘implicit communication’, where they can autonomously build up SOOs in order to satisfy a request at the best they can do at that time. This is not possible by means of traditional approaches that adopt simple request/response based communication styles (e.g., [16]). In fact, as shown in Figure 1, using classical matchmaker and broker approaches, we assume that there is an advertise service dependency (e.g., based on a preliminary registration phase) forcing the system actors to rely on a centralized computing model.

4 Agent Patterns-Based Detailed Design The discussion above highlights the limits of traditional patterns when applied to our ambient intelligence pervasive computing scenario; hence, the necessity of characterizing our system architecture by means of new agent patterns.

4.1 The Service Oriented Organization In distributed computing and especially in ‘intelligent’pervasive computing based scenarios, each time an information consumer explicitly or implicitly causes a specific

Agent Patterns for Ambient Intelligence

689

Fig. 3. Actor Diagram for the Service Oriented Organization pattern.

service request, it inherently needs a searching capability in order to locate the service provider. In particular, in our scenario, when the User Assistants is looking for a Presentation Composer in order to ask for a personalized presentation, a matchmaker (e.g., the one presented in Section 2) or a facilitator architecture is required [10,11,13,16]. As previously discussed, the matchmaker pattern illustrated in Section 2 does not completely fit the requirements of our pervasive computing scenario (Example 2). Here, we define a new agent pattern – the Service Oriented Organization pattern – illustrated in Figure 3, which extends and adapts the matchmaker pattern of Figure 1.a. Here, the actor Matchmaker is replaced by Organization Matchmaker, which is further decomposed in two component system actors: Service oriented Organization and Initiator. The dependencies between Consumer and Organization Matchmaker (or, more specifically, Initiator) and between Consumer and Provider(s) are as before. The main difference, instead, is that now there is no advertise service goal dependency between Organization Matchmaker and Provider(s). In fact, our scenario call for dynamic group reconfiguration, which cannot be provided on the basis of a pre-declared and centrally recorded set of service capabilities, as foreseen in the classical matchmaker approach. The solution we propose, instead, is based on a proactive and, specially, dynamic capability of service proposal, on the basis of the actual, current requests or needs of services. In particular, our system low level communication infrastructure is based on a group communication, which has been designed to support channelled multicast [3]. That is, a form of group communication that allows messages addressed to a single agent or a group of agents (Provider(s)) to be received by everybody tuned on the channel, i.e., the agent “introspection” capability described in Section 5. Thus, Provider(s) depends now on Organization Matchmaker, or, more specifically, on Providers Organizer to have a call for service. That is, because of each SOO member adopts an IP channelled multicast approach that allows to overhear on channels (see Section 5 for details), the organizer simply sends its service request message on a specific channel and it waits for some providers offers4. On the basis of such calls, Provider(s) may notify their current services availability. Thus, the Providers Organizer depends on Provider(s) for propose service and, 4

In fact, channels are classified by topics and each provider is free to overhear on the preferred channels according to its main interest and capabilities.

690

Paolo Bresciani et al.

vice-versa, Provider(s) depend on the Providers Organizer for the final agreement on service provision (goal agree service). Moreover, in an ‘intelligent’ pervasive computing based scenario, the system awareness allows to proactively propose services to the consumer without any explicit service request. Thus, Initiator acts as interface towards Consumer. It is able to interpret Consumer’s requests and, specially, proactively propose not explicitly requested services, on the basis of Consumer’s profile and previous interaction history5. To this end, Initiator depends on Providers Organizer to get new acquaintances about Provider(s) and their services, while Providers Organizer depends on the Initiator to formulate request. In this way, we can drop the dependency Provide service description between Matchmakerand Consumer, whichis instead presentin the traditional matchmaker pattern. Finally, Initiator requires that Provider(s) timely notify service status in order to only propose active services.

4.2 The Implicit Organization Broker As observed in Section 1, ambient intelligence environments are often characterized by intermitted communication channels. This problem is even more relevant when proactive broadcasting is adopted, as in the scenario suggested by Example 2. In this case communications to/from the User Assistant need to be reduced at a minimum. To this end, we propose here to exploit the implicit communication paradigm towards the adoption of an implicit organizations broker (IOB) agent pattern that is inspired to the implicit organization introduced by [4]. That is, we define the IOB as a SOO formed by all the agents tuned on the same channel to play the same role (i.e., having same communication API) and willing to coordinate their actions. The term ‘implicit’ highlights the fact that there is no group formation phase – since joining an organization is just a matter of tuning on a channel – and no name for it – since the role and the channel uniquely identify the organization. Its members play the same role but they may do it in different ways; redundancy (as in fault tolerant and load balanced systems) is just a particular case where agents happen to be perfectly interchangeable. In particular, we can consider to have implicit organizations playing a kind of broker role. In other terms, each time the system perceives the visitor’s information needs, the system actors set up a SOO (as described in Section 4.1), which, in addition to the already presented matchmaking capabilities, can also manage the whole service I/O process; that is, the SOO is able to autonomously and proactively cope with the whole service providing process life cycle. Such a system ability enhances the ambient intelligence awareness, a system requirement that cannot be captured by adopting traditional agent patterns [10, 11]. Figure 4 introduces a IOB pattern as a refinement/adaptation of the SOO pattern introduced in Section 4.1. Provider(s) are now part of the organization itself, which plays the role of an Organization Broker. Thus, the latter include both Providers Organizer and Provider(s) (see the inside of the dashed-line rectangle). It is worth noticing that the IOB members are characterized by the same (required) skill (see ahead Section 5). The differences between the two traditional agent patterns of Figure 1 are naturally reflected also between the two patterns illustrated in Figures 3 and 4. In particular, Fig5

For example, every system actor, through environmental sensors, can perceive and profile users during their visits across museum media services, as in the scenario of Example 2.

Agent Patterns for Ambient Intelligence

691

Fig. 4. Actor Diagram for the Implicit Organization Broker (IOB) pattern.

ure 3 tries to capture intentional aspects for more general group communication scenarios, i.e., general SOO. On the contrary, Figure 4 gives a level of pattern based detailed design focusing more on special kind of SOO, tailor-made for ambient intelligence scenarios. In other words, Figure 3 does not consider a strictly ‘intelligent’ pervasive computing scenario that, on the contrary, characterizes our IOB of Figure 4. As well, it is worth noticing that the IOB pattern incorporate in the Initiator role both the roles of Consumer and Initiator of the SOO pattern. As already said, this is a consequence of the fact that, in ambient intelligence, some system actors concurrently play the consumer and initiator roles, which allows the system to enhance autonomy and proactivity skills. Moreover, and similarly to what happen in Figure 1 .b between the Consumer and the Broker, in Figure 4, the Initiator depends on the Organization Broker – or, more specifically, on the Providers Organizer – to forward requested service, in order to avoid User Assistant message/interaction overloading. Nevertheless, the IOB pattern allows for acquaintance increasing (for Initiator), so to consent a more precise service requests during future interactions, as already foreseen for the generic Service Oriented Organization pattern.

5 Supporting Implicit Organization Brokers The two agent patterns Service Oriented Organization and the Implicit Organization Broker presented so forth have been experimented within the Peach project to build an interactive, pervasive, museum guide. As mentioned, our patterns require a group communication infrastructure. To this end, we adopt the LoudVoice [4] experimental communication infrastructure based on channelled multicast and developed at our institute. Specifically, LoudVoice uses the fast but inherently unreliable IP multicast – which is not a major limitation in our domain, since the communication media in use are unreliable by their own nature. However, we had to deal with message losses and temporary network partitions by carefully crafting protocols and using time–based mechanisms to ensure consistency of mutual beliefs within organizations.

692

Paolo Bresciani et al.

Fig. 5. Goal Diagram for an agent’s role characterization by means of its capabilities.

5.1 Agent Roles Characterization Analyzing agent roles means figuring out and characterizing its main capabilities (e.g., internal and external services) required to achieve its intentional dependencies already identified by the agent patterns analysis of Section 4. Note that, a capability (or skill) is not necessarily justified by external requests (like a service), but it can be an internal agent characteristic, required to enhance its autonomous and proactive behavior. To deal with the rationale aspects of an agent at ‘design time’, that is, in order to look inside and to understand how an agent exploits its capabilities, we adopt the Goal Modeling Activity of Tropos [2]. In Figure 5, we adopt the means-end and AND/OR decomposition reasoning techniques of the Goal Modeling Activity [2,5,18]. Means-end analysis aims at identifying goals, tasks, and resources that provide means for achieving a given goal. AND/OR decomposition analysis combines AND and OR decompositions of a root goal into subgoals, modeling a finer goal structure. Notice that, we have modeled every agent capability as a goal to be achieved. For the sake of briefness, here we consider only the IOB pattern. According to Figures 5 and 4, each time Initiator formulates a request, Providers Organizer achieves its main goal cope with the request (i.e., the goal that Providers Organizer internally adopts to satisfy Initiator’s request) relying on its three principal skills: define providers, deal with fipa-acl performatives, and support organizational communication. The principal goal success depends on the satisfaction of all the three goals (i.e., AND decomposition). For the sake of simplicity, Figure 5 does not consider Initiator and its intentional relationships. An adequate organizational communication infrastructure is used to enhance the system actor autonomous and proactive behavior by means of group communication based on channelled multicast [3] (see goal provide channelled multicast) that allows messages to be exchanged over open channels identified by topic of conversation. Thus, a proper structuring of conversations among agents allows every listener to

Agent Patterns for Ambient Intelligence

693

Fig. 6. Interaction of organizations.

capture its partner intentions without any explicit request, thanks to its agent introspection skill (see goal allow agents introspection). Indeed, each actor is able to overhear on specific channels every ongoing interaction; hence, it can choose the best role to play in order to satisfy a goal, provide a resource, perform a task, without any external decision control, but only according to its internal beliefs and desires. Exploiting the provide channelled multicast ability, each actor can decide by itself what channels to listen to, by means of a subscription phase (represented by the tasks discover channels and maintain a channel list). This communication mechanisms well support group communication for service oriented and implicit organizations composed by members with the same interests or skills. Such organizations assist the User Assistant avoiding it to know how directly interact with the museum multi-media services.

5.2 Group Communication: Dynamics As described earlier, the museum visitor guide system is composed of several different types of agents. Rather than by individual agents, most components are formed by a group of coordinated agents, as presented by Example 2 of Section 3. Modeling this example requires the representation of implicit organizations, which cannot be done by a regular AUML communication diagram, as presented, e.g., in [2,5]. Therefore, in Figure 6, we propose a new type of diagram that deals with the group communication features required by the scenario introduced with Example 2. Here, the shaded rectangles and the dashed lines below them represent the implicit organizations, and the gray rectangles represent the communication internal to implicit organizations. Requests sent to an organization are presented as arrows terminating in a dot at the border

694

Paolo Bresciani et al.

of the organization; the organization reply is presented by an arrow starting from a dot on the organization border. Obviously, we consider an asynchronous message-based communication model. In the example diagram of Figure 6, the request for presentation is initiated by a certain User Assistant on behalf of a specific user. The request is addressed to the Implicit Organization of Presentation Composers. Presentation composers have different capabilities and require different resources. Hence every presentation composer requests user information on the user model and presentation data (availability, constraints, etc.) from the Implicit Organization of Information Mediators. In turn, the implicit organization of information mediators holds an internal conversation. Each member suggests the service it can provide. The “best” service is selected and returned, as a group decision, to the requesting presentation composer. At this stage, the presentation composers request additional information to the Implicit Organization of User Assistants, regarding the availability of assistants capable to show the presentation being planned. When all the information has been received, the implicit organization of presentation composers can reason and decide on the best presentation to prepare. This will be sent from the composers as a group response to the selected (user) assistant.

6 Conclusions Ambient intelligence scenarios characterized by service oriented organizations, where group of agents collaborate at different levels and times during the service providing life-cycle, generates new software architectural requirements that traditional agent patterns cannot satisfy. For example, ‘intelligent’ pervasive computing and peer-to-peer computing models naturally support group communication for ambient intelligence, but they also call for architecture flexibility in terms of dynamic group reconfiguration. Traditional request/response communication protocols are not appropriate to cope with service negotiation and aggregation that must be ‘dynamically validate’, since the environment conditions and the user location are quickly changing. For such reasons, we propose two new agent patterns (Service Oriented Organization and Implicit Organization Broker) and compare them with traditional patterns [10, 11]. Specifically, we adopt the agent oriented software development methodology Tropos [2,5], to effectively figure out the new requirements. For example, using Tropos, we can keep the agent conversation and social levels independent from complex coordination activities, thanks to an inherently pure peer-to-peer computing model [13]. Such a way of modeling has been thought for enriching the Tropos methodology detailed design phase with new capabilities, more oriented towards sophisticated software agents, which requires advanced modeling mechanisms to better fit group communication, goals, and negotiations. Thus, we have been able to capture important aspects of ambient intelligence requirements and to build up new agent patterns, more flexible and reusable than the traditional ones.

References 1. B. Bauer, J. P. Muller, and J. Odell. Agent uml: A formalism for specifying multiagent software systems. International Journal of Software Engineering and Knowledge Engineering, 11(3):1–24, 2001.

Agent Patterns for Ambient Intelligence

695

2. P. Bresciani, P. Giorgini, F. Giunchiglia, J. Mylopoulos, and A. Perini. TROPOS: An agentoriented software development methodology. Autonomous agents and Multi-agent Systems (JAAMAS), 8(3):203–236, May 2004. 3. P. Busetta, A. Donà, and M. Nori. Channeled multicast for group communications. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems, pages 1280–1287. ACM Press, 2002. 4. P. Busetta, M. Merzi, S. Rossi, and F. Legras. Intra-role coordination using group communication: A preliminary report. In F. Dignum, editor, Advances in Agent Communication, LNAI. Springer Verlag (to Appear), 2003. 5. J. Castro, M. Kolp, and J. Mylopoulos. Towards requirements-driven information systems engineering: The tropos project. Information Systems (27), pages 365–389, Elsevier, Amsterdam, The Netherlands, 2002. 6. F. Curbera, R. Khalaf, N. Mukhi, S. Tai, and S. Weerawarana. The next step in web services. Commun. ACM, 46(10):29–34, 2003. 7. K. Ducatel, M. Bogdanowicz, F. Scapolo, J. Leijten, and J.-C. Burgelman. Scenarios for ambient intelligence in 2010. Technical report, Information Society Technologies Programme of the European Union Commission (1ST), Feb. 2001. http://www.cordis.lu/ist/. 8. U. J. Franke. Managing Virtual Web Organizations in the 21th century: Issues and Challenges. Idea Group Publishing, Pennsylvania, 2001. 9. S. Hayden, C. Carrick, and Q. Yang. Architectural design patterns for multiagent coordination. In Proc. of the 3rd Int. Conf. on Agent Systems (Agents’99), 1999. 10. M. Klusch and K. Sycara. Brokering and matchmaking for coordination of agent societies: A survey. In A. Omicini, F. Zambonelli, M. Klusch, and R. Tolksdorf, editors, Coordination of Internet Agents: Models, Technologies, and Applications, pages 197–224. Springer-Verlag, Mar. 2001. 11. M. Kolp, P. Giorgini, and J. Mylopoulos. A goal-based organizational perspective on multiagents architectures. In Proceedings of the Eighth International Workshop on Agent Theories, architectures, and languages (ATAL-2001), 2001. 12. M. P. Papazoglou and D. Georgakopoulos. Introduction to the special section on Service Oriented Computing. Commun. ACM, 46(10):24–28, 2003. 13. L. Penserini, L. Liu, J. Mylopoulos, M. Panti, and L. Spalazzi. Cooperation strategies for agent-based p2p systems. WIAS: Web Intelligence and Agent Systems: An International Journal, IOS Press, 1(1):3–21, 2003. 14. L. Penserini, L. Spalazzi, and M. Panti. A p2p-based infrastructure for virtual-enterprise’s supply-chain management. In Proc. of the Sixth Int. Conference on Enterprise Information Systems (ICEIS-04). INSTICC-Institute for Systems and Technologies of Information, Control and Communication, vol.4, pp 316-321, 2004. 15. O. Stock and M. Zancanaro. Intelligent Interactive Information Presentation for Cultural Tourism. In Proc. of the International CLASS Workshop on Natural Intelligent and Effective Interaction in Multimodal Dialogue Systems, Copenhagen, Denmark, 28-29 June 2002. 16. K. Sycara, S. Widoff, M. Klusch, and J. Lu. Larks: Dynamic matchmaking among heterogeneous software agents in cyberspace. Autonomous Agents and Multi-Agent Systems, 5(2):173–203, 2002. 17. Y. Tahara, A. Ohsuga, and S. Honiden. Agent system development method based on agent patterns. In Proc. of the 21st Int. Conf. on Software Engineering (ICSE’99). IEEE Computer Society Press, 1999. 18. E. Yu. Modeling Strategic Relationships for Process Reengineering. PhD thesis, Department of Computer Science, University of Toronto, Toronto, Canada, 1995.

Modeling the Semantics of 3D Protein Structures Sudha Ram and Wei Wei Department of Management Information Systems, Eller College of Management, The University of Arizona, Tucson, AZ 85721, USA {ram,wwei}@eller.arizona.edu

Abstract. The post Human Genome Project era calls for reliable, integrated, flexible, and convenient data management techniques to facilitate research activities. Querying biological data that is large in volume and complex in structure such as 3D proteins requires expressive models to explicitly support and capture the semantics of the complex data. Protein 3D structure search and comparison not only enable us to predict unknown structures, but can also reveal distant evolutionary relationships that are otherwise undetectable, and perhaps suggest unsuspected functional properties. In this work, we model 3D protein structures by adding spatial semantics and constructs to represent the contributing forces such as hydrogen bonds and high-level structures such as protein secondary structures. This paper makes a contribution to modeling the specialty of life science data and develops methods to meet the novel challenges posed by such data.

1 Introduction The Human Genome Project and its concomitant research have provided the scientific community with data that is increasing in volume and complexity. It has generated a precious information pool that can be used to support the best interests of human beings. To exploit this information, we need new ways to manage, integrate, and present the data so complex questions can be answered effectively. To do so, we need expressive data models that can capture the semantics of the wide variety of biological data. Proteins are large biological molecules with complex structure and they constitute much of the bulk of living organisms. In order to understand the life processes of an organism, it is necessary to first know the functions of the proteins. Since the function of a protein in a given environment is determined by its structure, we need to know the structure of the molecule to fully understand its function. The success of the Human Genome Project generated multiple protein databases including protein sequence databases and protein 3D structure databases [8]. Each 3D structure stored in the databases is either determined by experimental methods such as X-ray crystallography and Nuclear Magnetic Resonance or by computational chemistry [22]. Researchers need to search these databases for specific structures or compare structures with each other to seek similarities. Similar sequences can result in similar 3D structures, and similar structures perform similar functions. Therefore, protein structure similarities may be predicted based on sequence similarities. More importantly, search for similar protein structures can help us find homologs that sequence searches cannot discover, and, homologs often conserve structure more strongly than sequence. Also, P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 696–708, 2004. © Springer-Verlag Berlin Heidelberg 2004

Modeling the Semantics of 3D Protein Structures

697

we can explore protein evolution because similar protein folds can be used to support different functions [13]. Meanwhile, if we can identify conserved core elements of protein folding, the information can be used to model related proteins of unknown structures. Being able to determine, search and compare 3D protein structures, rather than just comparing sequences, is thus becoming very important in the life sciences. However, structure comparisons are currently a big challenge. We address this issue in our research by developing a semantic model. We believe our semantic model can facilitate the development of techniques and operators for 3D structure searching and comparison. In this paper, we extend our previous work on semantic modeling of DNA sequences and primary protein sequences to three-dimension (3D) protein structures with semantics that are novel using extended Entity-Relationship Modeling [10]. When studying proteins, scientists investigate not only the amino acids subunits that form a protein and their order, but also how the sequences fold into 3D structures in certain ways due to chemical forces. Currently, protein structure data is stored in plain text files that record the three-dimensional coordinates of each non-hydrogen atom as well as a small part of the substructures. The text file formatted data doesn’t capture biological meaning. Comparison and search over structure data have to be done using visualization tools and extra software tools running on various algorithms. We define the semantics of primary, secondary, tertiary and quaternary structures of proteins by describing their components, chemical bonding forces, and spatial arrangement. To model the 3D structure of a protein and its formation, we need to explicitly represent the spatial arrangement of each component in addition to its sequential order, along with its associated biological information. Our semantic model captures the semantics of protein data and specifies it using an annotation-based approach to capture all of this semantic information in a straightforward way. The rest of the paper is organized as follows. Section 2 provides a brief background about proteins and their various levels of structures, and describes the semantics of such structures. In section 3 we describe entity classes and new constructs to represent the semantics of protein structures and develop annotations to capture their spatial arrangement and biological characteristics. Also, we briefly review related research and justify why it is necessary to develop new semantic constructs to model protein structures. In section 4, we describe the utility of our semantic model, demonstrate its application and point out extensibility of the model for other similar fields. We conclude with a discussion of future research directions in section 5.

2 Background 2.1 Protein Structures Proteins are the most important macromolecules in the factory of living cells that perform various biological tasks. Basically, a protein is composed of various numbers of 20 kinds of amino acids, also known as subunits or residues (see Figure 1). These residues are arranged in a specific order or sequence; each amino acid is denoted by a letter of the English alphabet [6]. Multiple amino acids bond together through condensation reaction to form amino bonds (see Figure 2) which connect subunits into a protein sequence. One protein

698

Sudha Ram and Wei Wei

sequence can range from 10 to 1000 amino acids (residues). The actual proteins are not linear sequences; rather they are 3-D structures. Knowing this 3-D structure is a key to understanding the protein’s functions and for using it to improve human lives.

Fig. 1. General Structure of Amino Acids. Fig. 2. Structure of Amino Bonds. Two amino Different side chain-R determines the type of acid residues (subunits) are shown amino acid. Each amino acid contains an amino group and a carboxylate group. Hydrogen atom on amino group reacts with hydroxyl on carboxylate group through condensation to form amino bonds

The general principle that protein sequences follow to fold into 3-D structures is that “The three-dimensional structure of a native protein in its normal physiological milieu is the one in which the Gibbs free energy of the whole system is lowest; that is, the native conformation is determined by the totality of inter-atomic interactions and hence by the amino acid sequence, in a given environment” [11]. Following are descriptions for the four levels of protein structures [6]. Primary Structure. The primary structure of a protein refers to the exact sequence of amino acids in a protein. Hence the primary structure is linear in nature; it says nothing about the spatial arrangement of amino acids or their atoms. It merely shows the specific amino acids used to compose the protein and their linear order.

Fig. 3. Primary Protein Structure

Secondary Structure. The covalently linked amino acids are further organized by forming regularly repeating patterns such as helix and sheet and other less popular structures [16]. A hydrogen bond is the cause of the secondary structure. More specifically, it is the spatial interaction between a hydrogen atom in an N-H group and a nearby highly electro-negative carbonyl oxygen as shown in Figure 4. Each atom participating in the formation of a hydrogen bond is from a different residue, the distance between these residues determines the possible category of its secondary structure. The secondary structure is the base on which tertiary and quaternary struc- Fig. 4. Formation of Hydrogen tures are formed. 3D structure search and com- Bond

Modeling the Semantics of 3D Protein Structures

699

parison start from the secondary structure level [19]. One protein sequence or a chain can contain multiple different secondary structures. Each of these secondary structures is formed on a segment of the primary sequence. Several adjacent secondary structures form a motif and motifs can group to form domains. Tertiary Structure. Several motifs typically combine to form a compact globular structure referred to as a domain. The tertiary structure is used to describe the way motifs are arranged into domain structures and the way a single polypeptide chain folds into one or several domains. The side chain interactions contributing to the formation of domains or tertiary structures include: (a) Hydrogen bonds between polar side chain groups; (b) Hydrophobic interactions among non-polar groups; (c) Salt bridges between acidic and basic side chains; and, (d) Disulfide bridges. Here, each “chain” is a protein sequence recorded in protein sequence databases. Quaternary Structure. For proteins with more than one chain, interactions can occur between the chains themselves. The forces contributing to the quaternary structure are of the same kinds as those for tertiary structures, but they are between chains, not within chains. To summarize, various inter- and intra-molecular forces work together to decide the least energy/most stable structure of the proteins. The structure determines how various biological tasks are performed. It is therefore important to represent the semantics of these structures so they can be queried easily.

2.2 Current Protein Structure Databases and Usage The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide archive of experimentally determined (using X-ray Crystallography and Nuclear Magnetic Resonance techniques) three-dimensional structures of proteins [2]. It is operated by the Research Collaboratory for Structural Bioinformatics (RCSB). There are 26485 structures stored in PDB as of now with the number and complexity of the structures increasing rapidly. The format of data stored in PDB consists of a HEADER followed by the data. There are two major categories of data. The first category includes the identifier of each protein, the experiment (which determined its structure), the authors, keywords and references etc. The other more important category is the x, y and z coordinates of each non-hydrogen atom (heavy atom) in the structure. This format records the protein sequence and the composition of its secondary structure. It does not record the tertiary and quaternary structures, which as stated earlier, are very important parts of the protein structure. The core of the protein data lies in its coordinate data or its spatial arrangement. However, spatial coordinates by themselves depict nothing more than the shape and provide no biological value, unless they can be related to other information such as the shape, the strength (or energy), and the length of the chemical bonds between the various subunits of a protein. This is very important in structural genomics [12] and is used to connect spatial data with whole-genome data and to relate it to various biological activities. Researchers use experimental methods or computational calculation to determine or predict the structure of proteins, and submit their data to PDB [5, 23], PDB proc-

700

Sudha Ram and Wei Wei

esses the data based on the mmCIF (Macromolecular Crystallographic Information File standard) dictionary that is an ontology of 1700 terms that define the macromolecular structure and the crystallographic experiment [4]. Then the data is stored in the core database. Structure data stored in flat files poses challenges to effective and efficient data retrieval and analysis. Because data in flat file formats is not machine-processable, specialized structure search and comparison software tools are required to access and analyze the data. Each of the tools available has its own interface supporting slightly different invocation, processing and data semantics. This makes using the tools difficult because the input must be guaranteed to be in the correct format with the correct semantics and the tools must be invoked in specific, non-standard ways [7]. Primary protein structure or protein sequence search and comparison software tools use the BLAST (Basic Local Alignment Search Tool) algorithm [1], while 3D structure search and comparison software tools use more or less similar algorithms [15, 20] based on VAST (Vector Alignment Search Tool) [18]. Meanwhile, co-existence of multiple search tools makes one-stop search impossible. Inefficient structure search and comparison has become the bottleneck for high throughput experiments. Our research focuses on how to represent 3D structure semantics in a conceptual model which can be used for seamless interoperation of multiple resources [21]. More importantly, operators [9] can be developed based on the semantic model to facilitate query and analysis of structure data. Therefore, using extra software tools would become unnecessary and the bottlenecks can be eliminated. In this paper, we propose new constructs to explicitly represent the semantics of protein structures. We believe our model will help facilitate the ability of scientists to understand and detect the link between protein structures and their biological functions. Our proposed model provides formally defined constructs to represent the semantics of protein structure data. The spatial elements are represented using annotations. Our semantic model will also aid the development of tools to process ad hoc queries about protein structures. It can also be used as a canonical model to virtually unify protein structure data [3, 23].

3 Proposed Semantic Model 3.1 Entity Classes Atoms. This entity class is used to model chemical atoms (C, H, O, N and heteroatoms such as S) in the protein structure with each of them identified uniquely by an ID. Each atom in the structure can be represented by a 3-tuple A . The as element represents the atom’s serial number and as AS where AS is the collection of all atom serial numbers. The an element is the atom name and an AN where AN = {C, H, O, N, S}. The last element ty is the atom type and ty TY where S}. Each element in TY is a representative category of atoms in an amino acid. is carbon, N is the nitrogen atom, is the carboxylate carbon, is the carboxylate oxygen and are side chain hydrogen, amino hydro-

Modeling the Semantics of 3D Protein Structures

701

gen and hydrogen respectively. S is sulfur atom that contributes to structure formation such as disulfur bridges. Residues. Each residue (amino acid subunit) is an aggregate of a set of component atoms and a set of intra-residue bonds. Each residue can be represented using a 4tuple Element rs is the residue serial number, and where SL is the length of the protein sequence or the total number of residues. The second element rn is the residue name, where AA is the set of all types of amino acid residues. AA = {A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}. is the set of atoms in the residue identified by their atom serial numbers. is the set of internal bonds in the residue. is the bond between atom i and atom j. BL and BE are used to record the bond length and bond energy. This structure data is determined by experimental or computational methods. Primary Structure. The Primary structure of a protein is the sequence of amino acids comprising the protein. It describes what amino acids subunits are used in which order to form specific protein sequence. It can be represented as a 3-tuple PS < psid, pn, psl >, where psid is the unique identifier for each protein sequence record in the database, pn is the biological name of the protein and psl is the length of the protein sequence. At the same time, psid PSID where PSID is the collection of all available identifiers for protein sequences. Also, pn PN where PN is the collection of the names of all proteins. Segment. A complete protein sequence can be fragmented into several segments and each segment is folded to form certain higher-level structures. A segment is defined as seg <segid> where and seg SEG. SEG is the set of all segments fragmented from a single protein sequence. More information about each segment is represented by the Fragment relationship between Primary Structure and Segment. Forces. This entity class represents the four chemical forces that contribute to the formation of secondary structures, namely, hydrogen bonds, disulfur bridges, salt bridges, and hydrophobic interactions. FORCES is a superclass with four subclasses where each of them represents one of the four types of chemical forces. Hydrogen bond is the focus of our study, but by capturing the other three types of forces in our model, we allow for future expansion. Hydrogenbonds. A hydrogen bond is modeled as an entity class because it is the main cause of secondary structures based on which protein sequences fold into specific spatial arrangements. HYDROGENBONDS as a superclass has two subclasses – BACKBONE and SIDECHAIN. BACKBONE represents hydrogen bonds formed by backbone hydrogen atoms. SIDECHAIN represents hydrogen bonds formed by nonbackbone hydrogen atoms. We focus on backbone hydrogen bonds in this paper; sidechain hydrogen bonds are included in the model for future investigation. Each hydrogen bond in the structure can be depicted as a 3-tuple

702

Sudha Ram and Wei Wei

The first element records the hydrogen bond formed between a hydrogen atom from an amino group and an oxygen atom from a carboxylate group, i is the serial number of the residue which donates the hydrogen atom and j is the serial number of the residue which donates the oxygen atom where because the two residues have to be at least 3 units away from each other to form a hydrogen bond. Different distances between amino acids cause different forms of secondary structures [16]. For each protein structure, there is a set of hydrogen bonds formed within the structure HBand

Fig. 5. Semantic model of protein 3D structure

Secondary Structure. A secondary structure is formed by hydrogen bonding. Multiple hydrogen bonds make a segment of the protein sequence fold in specific ways. Therefore, each secondary structure corresponds to a set of hydrogen bonds. The type

Modeling the Semantics of 3D Protein Structures

703

of hydrogen bonds in the set determines the type of the secondary structure i.e. helix and sheet. The contributing hydrogen bond set is defined as a set where If and then the resulting structure is called n-turn. Multiple adjacent turns form a helix. Minimal helices overlapping can form longer helices. Other types of structures can also be defined using i and j. The geometry of a secondary structure can be represented as a vector which is the basis for VAST. Each vector is represented as a 4-tuple . VL is the vector length in angstrom, VS(x,y,z) are the internal coordinates of the starting point of the vector, VE(x,y,z) are the internal coordinates of the ending point of the vector and VMP(x,y,z) are the internal coordinates of the middle point of the vector. These values of a secondary structure element (SSE) need to be defined because structure comparison is based on vector geometry. Within each protein structure, all SSEs group together to form a set SSE where

Fig. 6. Secondary Structure Entity Class

Fig. 7. Spatial-Aggregate Relationship

Motif. Secondary structure elements usually arrange themselves in simple motifs. Motifs are formed by packing side chains from adjacent secondary structures such as helices and sheets that are close to each other. Therefore, a motif is a cluster of SSEs and we define it as and The relative spatial relationship between pairs of SSEs can be defined by six geometrical variables (see Figure 12). Domain and Quaternary Structure. This entity class represents a compact globular structure composed of several motifs. It is defined as When the protein of interest actually has more than one polypeptide chain, the interaction among chains will fold them further to form quaternary structures.

3.2 Relationships Spatial-Aggregate. This construct (see Figure 7) is defined to capture the spatial arrangement of each atom in an amino acid to form the protein’s 3-D structure. This is similar to the normal aggregate in existing spatial semantic models only in this case we want to represent the x, y and z coordinates of each atom. Each atom can be depicted as a point. //P(deg)/P(deg)/P(deg) suggesting that the position of this point can be measured using x, y and z coordinate in degrees. //T(c) is another dimension denoting the temperature (Celsius) at the time the structure was determined, because temperature changes affects the activities of atoms and relative positions change depending on the temperature. The relationship can be more concretely represented as

704

Sudha Ram and Wei Wei

and spa SPA where SPA is the complete set of spatial aggregate relationships within each structure, The constraint states which atom composes a which residue at what position, where x, y and z are the coordinates of each atom. Sequential Aggregate. This is a concept borrowed from our previous work to model DNA sequences and primary protein sequences. This construct represents the fact that multiple amino acids/residues are bonded together in a specific linear order to form a sequence. sea represents which residue is at which position of which protein sequence where sea SEA and SL represents the sequence length and X is an integer that represents the position of this residue in the sequence, where the position number has to be less than or equal to the length of the sequence.

Fig. 8. Sequential-Aggregate Relationship

Fig. 9. Fragment Construct

Fragment. This relationship can be considered the exact opposite of the aggregation relationship. The “O” in this relationship represents the fact that segments can overlap, or one segment can contain another. The formal definition of this relationship is frag < psid, segid, sp, ep> and frag FRAG where It says which segment is fragmented from which protein sequence starting and ending at what points. The length of the segment can be easily derived by subtracting the starting point from the ending point. A complete protein sequence contains several segments at different levels, where each segment contributes to a higher-level structure. For example, a segment of size 4 can form a 4-turn, several adjacent 4-turns can group together to form a helix. Helices further can group together to form motifs. Spatial-Bonding. This relationship is used to describe how atoms in the structure form forces that contribute to the formation of secondary structure. A1 and A2 are the two atoms participating in the force. For hydrogen bonds, it specifies which two residues contribute to the bond. This is specified as: //HB (Hi, Oj)//BL(A)//BE(kcal/mol). BL records the bond length and BE is bond energy of the chemical force. From the value of i and j, the type of fold (i.e. and can be determined.

Fig. 10. Spatial-Bonding Relationship

Modeling the Semantics of 3D Protein Structures

705

Fig. 11. Spatial Composition Relationship

Spatial Composition. This is a very complicated relationship that captures all geometrical data required to compare 3D structures based on the arrangement of secondary structures. As mentioned earlier, each secondary structure can be geometrically represented by a vector. Consider the two vectors Vm and in Figure 12 as an example. Each of them has its own length denoted by and Since the mid-point of each vector is recorded, the distance VD (m, n) between two vectors can be measured using the distance between the mid-points of Fig. 12. Internal coordinates that represent spatial the two vectors. Together with the two bond angles arrangements of a pair of and as well as the dihedral angle between the two vectors vectors DA (m, n), these variables can strictly define how two vector or secondary structure elements are arranged spatially. If there is a pair of SSEs (SSEP) in another protein structure that display the same geometrical arrangement, structure similarities may be inferred. With further analysis, the SSE pairs can be enlarged to accommodate more secondary structure elements to infer higher-level structure similarities. We can think of SSEPs as SSE groups of the finest granularity.

4 Contributions and Utility of Our Semantic Model We believe our model makes significant contributions because it brings out the hidden meaning behind the spatial arrangement of protein structures. This information is even more powerful when it is associated with the structure and formation of the bonding forces. Our model can facilitate easier data retrieval and analysis by the scientific community in following ways.

4.1 Database Integration Being able to use structure data effectively and efficiently is the core of structural genomics [22] especially when there are number of resources using different data models. Ontologies are being developed in the field of bioinformatics to support data integration [14]. Most of this work is aimed at describing DNA sequence with particular focus on their function and how they react with other biological entities. Not much research has been directed at representing the structural complexity of biological data. Our model can serve as a canonical model for unifying different sources of protein structure data with all the semantics captured. This would save a lot of time and human resource in data curating and enable one-stop shopping.

706

Sudha Ram and Wei Wei

4.2 Revolutionize Structure Search and Comparison Current protein structure data is stored in flat files that cannot be easily queried or examined. Our model can represent the semantics of the data and facilitate development of user-friendly tools to browse, search, and extract protein structures. For instance, a model such as ours can be used to ask queries of the following type “ For a specific protein with ID 1SHA, find all of its motifs, and retrieve the structural composition of each motif”. Protein structure prediction can be improved or perhaps even speeded up and also made easier by deploying our model. Instead of building various software tools and developing algorithms to compare and search structure on the top of flat file data, we can explore the possibility of extending an object-relational DBMS with specialized operators targeted at structure data. This approach is currently being explored by the latest Oracle l0g database management system that embeds data mining functionality for classification, prediction and association mining. One of the new features is that it incorporates the BLAST algorithm supporting sequence matching and alignment searches. But there is an obvious difference between the Oracle approach and our work presented in this paper. In Oracle 10g, the sequence data is still recorded as a text field. The BLAST operator is merely an interface based on flat files without capturing the semantics of sequence data. As to the application of our proposed model, we can implement VAST or similar algorithms based on our semantic model into the DBMS. At the same time, new structure search and comparison operators can also be developed to extend SQL (Structured Query Language). For example, using our semantic model we capture the spatial composition of each secondary structure element in a protein. We can now define operators to compare the different SSEs based on their vector variables. Such a comparison operator can include thresholds for determining similarity. That is, if the vector length, vector angle, vector distance and vector dihedral angle comparison results are all within the corresponding thresholds then the SSEPs can be determined to be similar. Once such an operator is defined and implemented, we can extend it further to easily compare two 3D structures to find out what structures two proteins share in common and therefore determine the overall similarity between proteins. This will allow scientific users to easily query the data without having to learn advanced SQL operators and procedural languages.

4.3 Utility in Other Fields Besides its application in bioinformatics, our semantic model can benefit other related fields such as Chemoinformatics [17]. Chemoinformatics is concerned with the application of computational methods to tackle chemical problems, with particular emphasis on the manipulation of chemical structure information. Therefore, it is essential in Chemoinformatics that researchers have effective and efficient approaches to store, search, and compare large quantities of complicated 2D or 3D structures of various chemical compounds. We are planning to extend our model for this field.

Modeling the Semantics of 3D Protein Structures

707

5 Future Research The number of protein structures that are determined or calculated is increasing rapidly. Searching for and comparing structures is currently very complicated and have become major obstacles in the development of structural genomics. Our work focuses on the semantics of bioinformatics data to understand and model 3-D protein structures. With this model, we hope to pave the way for standard and useful software tools for performing protein structure search more effectively and efficiently. We are continuing to extend this model to make it more complete. For example, we are elaborating on the chemical forces that contribute to the formation of secondary structures beyond hydrogen bonds. Other semantics are also being explored including the effects of solvent molecules. Based on our semantic model, a relational schema is being developed, and we are proposing new structure comparison and search operators. Even though implementation is not the focus of our research, we will be developing a prototype system as a proof-of-concept.

References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215. 1990, 403-410. 2. Herman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I. and Bourne, P. The Protein Data Bank. Nucleic Acids Research, 28 (1). 2000, 235-242. 3. Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Schneider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J. and Berman, H. The PDB data uniformity project. Nucleic Acids Research, 29 (1). 2001, 214-218. 4. Bourne, P., Berman, H., Mcmahob, B., Watenpaugh, K., Weissig, H. and Fitzgerald, P. The macromolecular CIF dictionary. Meth, Enzymol,, 227 (571-590). 1997. 5. Bourne, P.E., Addess, K., Bluhm, W., Chen, L., Deshpande, N., Feng, Z., Fieri, W., Green, R., Merino-Ott, J., Townsend-Merino, W., Weissig, H., Westbrook, J. and Berman, H. The distribution and query systems of the RCSB Protein Data Bank. Nucleic Acids Research,, 32 (Database Issue). 2004, D223-D225. 6. Branden, C. and Tooze, J. Introduction to Protein Structure. Garland Publishing, 1999. 7. Buttler, D., Coleman, M., Critchlow, T., Fileto, R., Han, W., Liu, L., Pu, C., Rocco, D. and Xiong, L. Querying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help? SIGMOD Record, 31 (4). 2002. 8. Chen, J., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S. and Hurwitz, D.I. MMDB: Entrez’s 3D-structure database. Nucleic Acids Research, 31 (1). 2003, 474-477. 9. Chen, J.Y. and Carlis, J.V., Similar_Join: Extending DBMS with a Bio-specific Operator, in Proceedings of the 2003 ACM Symposium on Applied Computing, (Melbourne, FL, USA, 2003), 109-114. 10. Chen, P.P.-S. The Entity-Relationship Model-Toward a Unified View of Data. ACM Transactions on Database Systems, 1 (1). 1976, 9-36. 11. Epstein, C.J., Goldberger, R.F. and Anfinsen, C.B. Cold Spring Harbor Symp. Speech. Quant. Biol., 28. 1963, 439. 12. Gerstein, M. Integrative database analysis in structural genomics. Nature Structural Biology, Structural genomics supplement. 2000.

708

Sudha Ram and Wei Wei

13. Gibrat, J.-F., Madej, T. and Bryant, S.H. Surprising similarities in structure comparison. Current Opinion in Structural Biology, 6. 1996, 377-385. 14. Greer, D., Westbook, J. and Bourne, P. An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics, 18 (9). 2002,1280-1281. 15. Holm, L. and Sander, C., 3-D Lookup: Fast Protein Structur Database Searches at 90% Reliability, in Third International Conference on Intelligent Systems for Molecular Biology, (Robinson College, Cambridge, England, 1995), AAAI Press, 179-187. 16. Kabsch, W. and Sander, C. Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers, 22. 1983, 2577-2683. 17. Leach, A. and Gillet, V. An Introduction to Chemoinformatics. Kluwer Academic Publishers, 2003. 18. Madej, T., Gibrat, J.-F. and Bryant, S.H. Threading a Databse of Protein Cores. Proteins: Structure, Function, and Genetics, 23. 1995, 356-369. 19. Mizuguchi, K. and Go, N. Comparison of spatial arrangements of secondary structural elements in proteins. Protein Engineering, 8 (4). 1995, 353-362. 20. Murthy, M.R.N. A fast methods of comparing protein structures. FEBS, 168 (1). 1984, 97102. 21. Stone, J., Wu, X. and Greenblatt, M. A Semantic Network for Modeling Biological Knowledge in Multiple Databases, University of Vermont Computer Science Technical Report, 2003. 22. Westbrook, J., Feng, Z., Chen, L., Yang, H. and Berman, H. The Protein Data Bank and Structural genomics. Nucleic Acids Research, 31 (1). 2003,489-491. 23. Westbrook, J., Feng, Z., Jain, S., Bhat, T.N., Thanki, N., Ravichandran, V., Gilliland, G., Bluhm, W., Weissig, H., Greer, D., Bourne, P. and Berman, H. The Protein Data Bank: unifying the archive. Nucleic Acids Research, 30 (1). 2002, 245-248.

Risk-Driven Conceptual Modeling of Outsourcing Decisions* Pascal van Eck1, Roel Wieringa1, and Jaap Gordijn2 1

Department of Computer Science, University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands {vaneck,roelw}@cs.utwente.nl

2

Department of Computer Science, Vrije Universiteit Amsterdam De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands [emailprotected]

Abstract. In the current networked world, outsourcing of information technology or even of entire business processes is often a prominent design alternative. In the general case, outsourcing is the distribution of economically viable activities over a collection of networked organizations. To evaluate outsourcing decision alternatives, we need to make a conceptual model of each of them. However, in an outsourcing situation, many actors are involved that are reluctant to spend too many resources on exploring alternatives that are not known to be cost-effective. Moreover, the particular risks involved in a specific outsourcing decision have to be identified as early as possible to focus the decision-making process. In this paper, we present a risk-driven approach to conceptual modeling of outsourcing decision alternatives, in which we model just enough of each alternative to be able to make the decision. We illustrate our approach with an example.

1 Introduction Current network technology reduces the cost of outsourcing automated tasks to such an extent that it is often cheaper to outsource the automated task than to perform it in-house. Automation decisions thereby become outsourcing decisions. In the simplest case, outsourcing is the delegation of value activities from one organization to another, but in the general case, it is the distribution of a set of value activities over a collection of networked organizations. Organization involved in negotiating about outsourcing must know as early as possible whether some allocation of activities to organizations is profitable. This precludes them from the costly modeling of functionality, data, behavior, communication structure and quality attributes of each possible alternative. To reduce this cost, a just-enough approach to conceptual modeling of possible solutions is needed, which allows a selection among alternatives without elaborate conceptual models of each. In this paper, we present a risk-driven approach to conceptual modeling of alternatives in outsourcing decisions. The main advantage of our approach is that it provides, *

This work is part of the Freeband A-MUSE project. Freeband (http://www.freeband.nl) is sponsored by the Dutch government under contract BSIK 03025.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 709–723, 2004. © Springer-Verlag Berlin Heidelberg 2004

710

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

with relatively little effort, a problem structuring of the outsourcing decision. The approach itself involves only very simple diagramming techniques, just enough to capture the structure of the problem while simple enough to present to stakeholders who do not have a background in conceptual modelling. The approach helps in identifying the parts of the problem for which we need to develop more detailed conceptual models using well-known techniques such as provided by the UML. We illustrate our approach by a case study that is introduced in Section 2. The first step in our approach is based on the method, which is presented in Section 3. The method, which has been developed earlier by the third author, can be used to model and evaluate innovative business models for networked businesses (in this paper: outsourcing options) from the perspective of value created by each participant in a networked business. The method is based on accepted conceptual notions from marketing, business science and axiology. It deliberately limits the number of modeling constructs such that the method is easy to learn and apply in practice. Using outsourcing decision alternatives for the case study are developed in Section 4. In Section 5, we present a systematic approach to identify risks associated with each option, which is applied to our case study and discussed in Section 6. Section 7 concludes the paper.

2 The NGO Example We illustrate our approach with a real-life example of a collection of European NonGovernmental Organizations (NGOs) in the domain of international voluntary service. Each NGO sends out volunteers from its own country to projects offered by NGOs in other countries (as well as to its own projects) and accepts volunteers from other countries in its own projects. The purpose is to create possibilites for learning from other cultures and to help in local social development. The NGOs maintain contact with each other about projects offered and about volunteers, and there is a supranational umbrella organization that loosely coordinates the work of the (independent) NGOs. Some of the NGOs receive government subsidies, most do not. In the projects offered, only work is done that cannot be performed commercially. Each NGO has a web site, a general ledger system for the financial administration, a simple workflow management system (WFM) to manage the workflow for matching each volunteer to a project, a project database of running projects, and a customer relationship management system (CRM) to manage information about volunteers that have shown interest in voluntary service. Since the NGOs vary widely in age, size and level of professionalism, and since they are independent, the implementations of these systems also vary widely and do not provide compatible interfaces. Recently, an application service provider (ASP) has offered to handle the WFM/CRM systems of all NGOs. The question to be solved is how this can be done such that the ASP makes money, while the NGOs are better off in terms of costs, quality, or both and the risks associated with the outsourcing solution chosen are manageable.

3 The

Method

The methodology is specifically targeted at the design of business networks, as for example in e-commerce and e-business. Business networks jointly produce, dis-

Risk-Driven Conceptual Modeling of Outsourcing Decisions

711

tribute and consume things of economic value. The rapid spread of business networks, and of large enterprises that organize themselves as networks of profit and loss responsible business units, is enabled by the capability to interconnect information systems of various businesses and business units. In all cases, the trigger of an application of are the networking opportunities perceived to be offered by information and communication technology (ICT). The use of is then to explore whether the networking idea can really be made profitable for all actors involved. We do so by thoroughly conceptualizing and analyzing such a networked idea, to increase shared understanding of the idea by all stakeholders involved. The results of an track are sufficiently clear to start requirements engineering for software systems. In the following, we will indicate networks of businesses and networks of business units by the blanket term networked enterprises. We will also call the software systems that support business processes business systems. Examples of business systems are information systems, workflow management systems and enterprise-specific application software. Before the requirements on the information technology used by networked enterprises can be understood, the goals of the network itself need to be understood. More precisely, before specifying the business systems and communications between these, it is important to understand how various enterprises in the network create, distribute and consume objects of economic value. The method has been developed in a number of action research projects as a method to determine the economic structure of a networked enterprise. These are real life projects in which the researcher uses the technique together with business partners, followed by a reflection on and improvement of the technique. For the business partners, these projects are not research but commercial projects where they pay for the results. The researcher has the dual aim to do a job for the business and to learn something from doing so. We illustrate the concepts of using Fig. 1, which shows a value model of the current network of NGOs. Actor. An actor is perceived by its environment as an independent economic (and often also legal) entity. An actor intends to make a profit or to provide a non-profit service. In a sound, sustainable business model each actor should be capable of creating a net

Fig. 1. Value Model of the current NGO network.

712

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

value. Commercial actors should be able to make a profit, and non-profit actors should be able to create a value that in monetary terms exceeds the costs of producing it in order to sustain. Each NGO, the umbrella organization, each project and each volunteer in our example is an actor. Although this example is about non-profit organizations, the arguments to enter the network of cooperating NGOs are stated in terms of value added and costs saved by the members in this cooperation. This makes a useful technique to solve the business problem whether a cooperation can be organized in such a way that value is added for all concerned. Value Object. Actors exchange value objects, which are services, products, money, or even consumer experiences. A value object is of value to at least one actor. In Fig. 1, Assigned volunteer and Assigned project are value objects. Value Port. An actor uses a value port to show to its environment that it wants to provide or request value objects. A value port has a direction, namely outbound (e.g. a service provision) or inbound (e.g. a service consumption). A value port is represented by a small arrowhead that represents its direction. Value Transfer. A value transfer connects two equidirectional value ports of different actors with each other. It is one or more potential trades of value objects between these value ports. A value transfer is represented by a line connecting two value ports. Note that a value transfer may be implemented by a complex business interaction containing data transmissions in both directions [1]. The direction of a value transfer is precisely that: the direction in which value is transfered, not the direction of data communications underlying this transfer. Value exchange. Value transfers come in economic reciprocal pairs, which are called value exchanges. This models ‘one good turn deserves another’: you offer something to someone else only if you get adequate compensation for it. Value Interface. A value interface consists of ingoing and outgoing ports of an actor. Grouping of ingoing and outgoing ports model economic reciprocity: an object is delivered via a port, and another object is expected in return. An actor has one or more value interfaces, each modelling different objects offered and reciprocal objects requested in return. The exchange of value objects across one value interface is atomic. A value interface is represented by an ellipsed rectangle. Market segment. A market segment is a set of actors that, for one or more of their value interfaces, ascribe value to objects in the same way from an economic perspective. Naturally, this is a simplification of the real world, but choosing the right simplifications is exactly what modeling is about. A market segment is represented by a stack of actor symbols. NGOs is an example of such a market segment. With the concepts introduced so far, we can describe who exchanges values with whom. If we include the end consumer as one business actor, we would like to show all value exchanges triggered by the occurrence of one end-consumer need. This considerably enhances a shared understanding of the networked enterprise idea by all stakeholders. In addition, to assess the profitability of the networked enterprise, we would like to do profitability computations. But to do that, we must count the number of value exchanges triggered by one consumer need. To create an end-consumer need and do profitability

Risk-Driven Conceptual Modeling of Outsourcing Decisions

713

computations, we include in the value model a representation of dependency paths between value exchanges. A dependency path connects value interfaces in an actor and represents triggering relations between these interfaces. A dependency path has a direction. It consists of dependency nodes and connections. Dependency node. A dependency node is a stimulus (represented by a bullet), an ANDfork or AND-join (short line), an OR-fork or OR-join (triangle), or an end node (bull’s eye). As explained below, a stimulus represents a trigger for the exchange of economic value objects, an end node represents a model boundary. Dependency connection. A dependency connection connects dependency nodes and value interfaces. It is represented by a link. Dependency path. A dependency path is a set of connected dependency nodes and connections with the same direction, that leads from one value interface to other value interfaces or end nodes of the same actor. The meaning of the path is that if a value exchange occurs across a value interface I, then value interfaces pointed to by the path that starts at interface I are triggered according to the and/or logic of the dependency path. If a branch of the path points to an end node, then this says “don’t care”. Dependency paths allow one to reason about a network as follows: When an end consumer generates a stimulus, this triggers a number of value interfaces of the consumer as indicated by the dependency path starting from the triggering bullet inside the consumer. These value interfaces are connected to value interfaces of other actors by value exchanges, and so these other value interfaces are triggered too. This in turn triggers more value interfaces as indicated by dependency paths inside those actors, and so on. Our value model now represents two kinds of coordination requirements: Value exchanges represent the need to coordinate two actors in their exchange of a value object, and dependency paths indicate the need for internal coordination in an actor. When an actor exchanges value across one interface, it must exchange value across all value interfaces connected to this interface. This allows us to trace the value activities and value exchanges in the network triggered by a consumer need, and it also allows us to estimate profitability of responding to this need in this way for each actor. For each actor we can compute the net value of the value objects flowing in and those flowing out according to the dependency path. The concept of a dependency path is reminiscent to that of use case maps [2], but it has a different meaning. A use case map represents a sequential scenario. Dependency paths represent coordination of value interfaces, and dependency paths in different actors may among each other not have an obvious temporal ordering, even if triggered by the same stimulus.

4 Example: Outsourcing Options for the NGO’s 4.1 Current Value Model In order to explore possibilities for outsourcing, we first discuss the current value model of the NGOs as presented in Fig. 1. The diagram shows the NGO market segment twice, because we want to show that there exists interaction between NGOs. An NGO serves two types of actors: Volunteers and projects. The task of a NGO is to match a volunteer

714

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

to a project. If a match is successful, the project obtains a volunteer, and a volunteer obtains a project. Both the volunteer as well as the project pay a fee for this service. Volunteers need a project to work for; Projects need volunteers. These needs are shown in Fig. 1 by stimuli. The match itself is represented as an AND-join. Following the paths connected to the join, it can be seen that for a match, a volunteer and a project is needed. These volunteers and projects can be obtained from the NGO’s own customer base, or can be obtained from other NGO’s as is represented by OR-joins. Note that Fig. 1 shows only part of the dependency path. Specifically we represent that for matching purposes the rightmost NGO uses volunteers and projects from its own base or from other NGO’s. However, the leftmost NGO’s do also matching. Paths associated with these matchings are not presented. We skip the profitability estimations for this example, because these play no role in the following argument. The interested reader can find examples in earlier publications [3, 4]. The method includes tools to check well-formedness of models and to perform profitability analysis.

4.2 Option (1): ICT Outsourcing A main concern for NGOs and the umbrella organization is to have cost-effective ICT support for their processes, while preserving or improving the quality of service offered to volunteers. Specifically, NGOs have indicated that the different WFM and CRM systems present in the NGOs are candidates for cost-cutting operations. We saw in our current problem analysis that each NGO exploits its own WFM and CRM. One option for cost-cutting is therefore to replace all these WFM and CRM systems by one system, to be used by all NGOs. This system can be placed at the unbrella organization, who then acts as an Application Service Provider (ASP). This means that NGOs connect to the Internet, and use the WFM and CRM system owned by the umbrella organization. To keep costs low, NGOs use a browser to interact with the WFM and CRM system of the umbrella. This leads to the value model in Fig. 2. The exchanges introduced in Fig. 1 remain intact. The umbrella organization acting as ASP is introduced in the value model. In the value model we see that the ASP offers a matching service, i.e. the ASP offers an information system with the same main functionality as the old WFM and CRM application. Each NGO still has to perform the matching function (using the information system offered by the ASP). Thus, this is a case of IT outsourcing but not of business process outsourcing (BPO) This implies that the NGO interacts from a value perspective exactly the same as in Fig. 1.

4.3 Option (2): Business Process Outsourcing A second option is to outsource the matching function itself to the umbrella organization (business process outsourcing, which includes ICT outsourcing). Fig. 3 show the value model of this. The matching is now done for all NGOs using the same base of volunteers and projects. This allows for doing global matching, rather than doing local matching for each NGO separately. In this solution, there is a drastic change in the value exchanges: Each NGO pays for a match to the umbrella organization. The role of a NGO is not so much the matching itself, but attracting volunteers and projects

Risk-Driven Conceptual Modeling of Outsourcing Decisions

715

Fig. 2. Value Model for an ICT outsourcing solution.

in their specific region. So, exchanges between NGOs disappear. They exchange now value objects using the umbrella organization as an intermediate.

5 Concerns and Risks In order to implement a value model, we need to model business processes, information manipulated by these processes, and other aspects of the technology support of the model. To prevent us from spending a lot of time on models that will not be used after an outsourcing option is chosen, we identify current business goals that will be used to disciminate different options. The goals are identified by listing current business issues, as illustrated in table 1. This table is explained later. Each outsourcing option will be evaluated with respect to these goals. Furthermore, we will use a concern matrix, that lists all relevant system aspects that we possibly would want to model, and set this off against the major cost factor of each option, namely maintenance. Table 2 shows a concern matrix, that we will discuss later. We use a concern matrix to identify the risks asssociated with each option, where a risk is the likelihood of a bad consequence, combined with the severity of that consequence. Each cell in the concern matrix is evaluated by asking (i) what the risk is that this option cannot be realized, and (ii) what the risk is that the option under consideration will not achieve the business goals in this area. The concern matrix allows us to reduce conceptual modeling costs in two ways. First, it prevents us from modeling in detail options that will not be chosen, and second, for the chosen option it will point us at aspects that need not be modelled because no risk is associated with them. Use of the issue/goal list and of the concern matrix are two tools in our method engineering approach to conceptual modeling. They allow us to select system aspects for which we will make conceptual models.

716

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

Fig. 3. Value Model for a business process outsourcing solution.

We now explain the two dimensions of the concern matrix in more detail. The horizontal dimension of the matrix distinghuishes five general aspects of any system. The universality of these aspects is motivated extensively in earlier publications [5,6]. The relevance of these aspects follows from the fact that any specification of a system to be outsourced, must specify these aspects. Note that in this paper, a system equals an outsourcing option, i.e. outsourced ICT and/or business processes together with their context in an organization. We now briefly explain the system aspects. The services (or functions) provided by the system; The data (or information) processed and provided by the system; The behaviour of a system: the temporal order of interactions during delivery of these services. Communication: the communication channels through which the system interacts with other systems during service delivery. The composition of the system in terms of subsystems; Our earlier publications [5,6] distinguish a sixth aspect: the non-functional or quality aspect. In this paper, this aspect consists of attributes of the other aspects. The vertical dimension of our concern matrix consists of several types of maintenance. Maintenance in this paper is defined as all activities that need to be performed to manage, control and maintain the ICT systems and procedures of an organization. Maintenance in this sense is also called IT service management. We need to consider maintenance because this embodies most of the costs of the entire system costs, and therefore contains most of the risk of an outsourcing option. By the same token, design and implementation of ICT systems (i.e., the work done by software engineers) is only a small part of the entire system cost and therefore contain only a small part of the risk of a design alternative. In addition, in the context of outsourcing, design and

Risk-Driven Conceptual Modeling of Outsourcing Decisions

717

implementation are even less relevant: If existing ICT systems are outsourced, design and implementation have already been completed; if new business processes are outsourced, design and implementation of ICT systems that support these processes is the responsibility of the organization to which these processes are outsourced. The maintenance dimension distinghuises three kinds of maintenance, namely functional, application and infrastructure [7], explained next. Functional maintenance consists both of maintenance of the set of services that an information system provides, as well as of supporting users in getting the most out of the set of services offered. This involves providing some form of helpdesk and user handholding, but also personnel and procedures to collect user requirements and turn them into a specification of required services. Functional maintenance is a responsibility of the user organization and is often performed by non-IT personel. Some of the users are partially freed from their normal duties and instead are given the task to help other users, perform acceptance tests, etc. Application maintenance consists of maintenance of the software that implements an information system (as well as, to a lesser extent, user support, e.g. providing a third-line helpdesk). Application maintenance is carried out mostly by IT personel, specifically programmers. Tasks include fixing bugs, implementing new functions, version and release management. ASL (Application Service Library) is a standard process model for application maintenance [8]. Infrastructure maintenance comprises all tasks needed to provide the computer and networking infrastructure needed for the information systems of an organisation to run: configuration management, capacity management, incident management (including user support). ITIL (IT Infrastructure Library is a standard process model for application maintenance [9], The maintenance dimension contains maintenance aspects of the outsourcing options, not of maintenance itself. This means for instance that the behaviour aspect is involved with the processes an outsourced system executes and not with processes needed for maintenance such as described by ASL and ITIL.

718

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

6 Example: NGO Outsourcing Concerns Table 1 lists the goals with respect to which we will decide what concerns us in the outsourcing options. For each cell in the matrix, we ask whether it will help bring us closer to the goals, and what the risk is that it take us farther away from the goals. The resulting concern matrix is shown in Table 2. We now discuss the columns of the matrix.

6.1 The Behaviour Aspect The current business processes operating in the NGOs are these: Core processes acquisition of own projects acquisition of own volunteers matching: placement of incoming volunteers in own projects placement of own volunteers in projects of other NGOs volunteer preparation (training)

Management of the network of NGOs Entry and exit of an NGO in the NGO network Financial processes ICT support processes ICT management processes HRM processes Controlling processes policy making quality control incident response

At this moment, there is no need to elaborate this simple conceptual model of business processes, because we can already see what is the issue. In option (1), the ASP solution, has no impact on the business processes. However, in option (2), the BPO solution, one of the core processes (matching) no longer has to be executed. We can now ask the NGOs to decide whether this is good (more time to focus on project and volunteer acquisition and preparation) or bad (loss of strategic advantage). Looking at our list of goals (table 1), we see that a second question to the NGO’s is whether Options 1 and 2 facilitate possible future European consolidation.

Risk-Driven Conceptual Modeling of Outsourcing Decisions

719

6.2 The Communication Aspect Fig. 4 shows the currently available business systems in each NGO. The figure shows a number of business systems in NGO1. Each system consists of people and technology, such as software, paper, telephones, fax, etc. Each NGO has systems with the same functionality, but different NGOs may use different people-technology combinations to implement this functionality. The diagram also shows a number of communication channels between these systems. We have labeled them to be able to refer to them below. Each communication channel is a means to share information between two actors. The meaning of the diagram is that each channel is reliable and instantaneous; if we want to include an unreliable communication channel with delays, we should include the channel as system in the diagram and connect it with lines to the systems communicating through the channel; the remaining lines then represent reliable instantaneous communication channels. The WFM system of NGO1 communicates with WFM systems in all other NGOs through channel E. Not shown is the fact that the communication between WFMs of different NGOs currently is done mostly by telephone, fax, email and paper mail. Fig. 4 also shows the context of NGO1, which consists of volunteers and projects (and other NGOs). Option (1), the ASP solution, impacts the technology situation, as shown in the context diagram of Fig. 5(a). From an ICT perspective, there is now only one WFM/CRM application (instead of many different ones), but there are still as many instances of it as there were applications in the old situation, only they are now provided by one party, and they are all exactly the same. By doing so, the umbrella organization can exploit economies of scale and thus yield a more cost-effective ICT service for the NGOs. Interface E is now simplified because it is an interface between different instances of the same system. However, the other interfaces need to be redesigned, as the WFM/CRM application offered by the ASP is most probably different from the one the NGO used before. This means that either the ASP or each NGO has to manage integration middleware. Cross-organizational integration of enterprise applications is a relatively new phenomenon and is known to be complicated. Thus, the need for this technology adds considerable risk to outsourcing. The NGOs use the WFM/CRM applications exactly in the same way as before; their business processes do not have to change. From an ICT perspective, option (2), the business process outsourcing solution, has one matching (WFM/CRM) system for all NGOs (Fig. 5(b)). Interface E now dis-

Fig.4. Communication diagram of current situation.

720

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

Fig. 5. Communication diagrams of software systems in the ASP and the outsourcing solutions.

appears. However, as in the ICT outsourcing option, the other interfaces have to be adapted.

6.3 The Services Aspect Functional support/maintenance. The question in both outsourcing options is how user support (‘handholding’) is organised. One possibility is that the ASP provides a first-line helpdesk, either as part of a package deal (fixed price), or billed per incident. In this case, each NGO has to ask itself whether it thinks the ASP knows enough of this NGO to actually by able to understand user questions and respond to them in a helpful way (language is an issue here as well). For the ASP, this helpdesk is a new value object that can be added to the value models in Fig. 2 and Fig. 3 to get a more complete model. If the ASP does not offer a first-line helpdesk, each NGO has to appoint someone (most probably a ‘power user’) to provide support for other users. This person is then supported by a second-line helpdesk provided by the ASP. Application support/maintenance. In application maintenance, often a distinction is made between corrective maintenance (fixing bugs, no new functions) and adaptive maintenance (implementing new user needs by adapting or building new functions). Corrective maintenance is equivalent with fixing bugs. It can be expected that in both options, the ASP is responsible for this. The NGO needs to convince itself that the ASP is up to this task. Adaptive maintenance. It can be expected that each NGO from time to time needs new functions. The ASP may provide a service that consists of building new func-

Risk-Driven Conceptual Modeling of Outsourcing Decisions

721

tions in the application provided, for instance billed by the hour or at a fixed, prenegotiated price (this is a value object that the ASP may offer). The ASP may also use a collaborative, open-source based method. The ASP may also not offer adaptive maintenance. In this case all added functionality has to be implemented outside of the application offered by the NGOs themselves, which has implications for interfaces A–D. Infrastructure support/maintenance. Infrastructure support/maintenance does not change significantly for the NGOs if the CRM/WFM application is outsourced to the ASP. Each NGO still needs to provide a local infrastructure to its personel consisting of workstations, a local area network, operating systems and personal productivity software. If the CRM/WFM application is outsourced, the NGO no longer needs to provide e.g. a database or application server (assuming it was only used for the CRM/WFM application), but maintenance of the Internet uplink becomes more important as it is a single point of failure: If it is unavailable, the outsourced application cannot be used.

6.4 The Data Aspect Functional support/maintenance. The issue here are ad-hoc queries. From time to time, each NGO may want to do some one-time analysis of its data. (A realistic example is checking whether the NGO qualifies for a certain form of subsidy, e.g. related to the average age of its volunteers.) Strictly speaking, this belongs to the function aspect, as it requires a new function. In practice, however, it is not possible to treat a one-time analysis as a new function: there is not enough time to wait for a new release. The ASP may offer a kind of extended datatbase administration service that can run ad-hoc queries, or provide data-level access to the data sets to NGO, which would require a new interface next to A–D. Application support/maintenance Information aspect, corrective maintenance. It is widely known that each and every data set sooner or later gets polluted with incorrect data. It may be the case that the application offered by the ASP provides a set of functions that enables end users to always manipulate all data, no matter what happened to it. It is perhaps more realistic to assume that every now and then, a database administrator is needed to correct things at the database level, either because no function is available for certain corrective actions, or it is more efficient (bulk updates). The ASP may offer a database administrator, either as part of a package deal (fixed price), or billed by the hour. The ASP may also decide to offer access at the database level to the NGOs, or both. For each NGO, this means that it has to decide whether to perform database maintenance itself, or buy it from the ASP. Information aspect, adaptive maintenance. This refers to changing the database scheme and most often also requires adapting existing functions to do something useful with the new scheme. Therefore, the same considerations hold as for the function aspect, adaptive maintenance.

722

Pascal van Eck, Roel Wieringa, and Jaap Gordijn

Infrastructure support/maintenance. The NGO is able to save some costs as a database management system for the WFM and CRM applications is no longer needed.

6.5 Discussion The concern matrix identifies issues to be taken into consideration when choosing between options. It also identifies aspects of the chosen option to be elaborated in conceptual models. So it saves us work in two ways: (1) It prevents us from detailed conceptual modeling when choosing options and (2) it prevents us from modeling all aspects of an option once chosen. Identification of possible outsourcing options (Section 4) and the risks associated with them (Section 6) enabled the NGO to focus its internal decision process as well as discussions with the ASP provider. So far, none of the options have been found to be unacceptable. The next step for the NGO is to further elaborate the options by designing high-level models of support processes and estimating the costs associated with them. Moreover, the NGO needs to look deeper into the interfaces needed with the outsourced application. This will involve modelling the data exchanged to get an idea of the effort needed to design these interfaces. The ASP may, based on discussions with the NGOs, further extend its offerings, which can be modelled with additional models.

7 Conclusions We presented an approach to quickly identify alternatives for outsourcing decisions and the risks associated with them, using a few simple diagramming techniques (value models, a bulleted list as a process model, and communication diagrams). The main value of this approach is that it provides, with relatively little effort, insight into the structure of the outsourcing problem at hand. This insight is needed to identify the parts of the problem that warrant more detailed conceptual modelling efforts using well-known techniques such as entity-relationship modelling. The problem structure also quickly reveals enterprise application integration (EAI) problems introduced by outsourcing. We plan to further develop our value-based approach to design and analysis of ebusiness systems. This involves for instance systematic ways of deriving business processes from a value model [1]. Furthermore, we plan to investigate the relation between our approach and Quality Function Deployment (QFD/House of Quality) [10]. QFD provides a systematic way to compare alternative solutions to a design problem with respect to quality attributes. We think that our approach can be used to identify which quality attributes are important in a given outsourcing problem, as well as to identify alternative solutions. In this way, QFD may be usable as an extension to our approach.

References 1. van Eck, P., Gordijn, J., Wieringa, R.: Value-based design of collaboration processes for ecommerce. In Yuan, S.T., Liu, J., eds.: Proceedings 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service, EEE’04, IEEE Press (2004) 349–358

Risk-Driven Conceptual Modeling of Outsourcing Decisions

723

2. Buhr, R.J.A.: Use case maps as architectural entities for complex systems. IEEE Transactions on Software Engineering 24 (1998) 1131–1155 3. Gordijn, J., Akkermans, J.M.: Designing and evaluating e-Business models. IEEE Intelligent Systems - Intelligent e-Business 16 (2001) 11–17 4. Gordijn, J., Akkermans, J.: Value-based requirements engineering: Exploring innovative ecommerce ideas. Requirements Engineering Journal 8 (2003) 114–134 5. Wieringa, R.: A survey of structured and object-oriented software specification methods and techniques. ACM Computing Surveys 30 (1998) 459–527 6. Wieringa, R.J.: Design Methods for Reactive Systems: Yourdon, Statemate, and the UML. Morgan Kaufman (2003) 7. Looijen, M.: Information Systems, management, control and maintenance. Ten Hagen & Stam (1998) 8. van der Pols, R.: ASL , a Framework for Application Management. Van Haren Publishing (2004) 9. Office of Government Commerce: ITIL Service Support. The Stationary Office (2000) 10. Herzwurm, G., Schockert, S., Pietsch, W.: QFD for customer-focused requirements engineering. In Wieringa, R., Chang, C., Sikkel, K., eds.: 11th IEEE International Requirements Engineering Conference (RE’03), IEEE Computer Society Press (2003) 330–340

A Pattern and Dependency Based Approach to the Design of Process Models Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, and Petia Wohed Department of Computer and System Sciences Stockholm University/Royal Institute of Technology Forum 100, SE-164 40 Kista, Sweden {maria,prasad,pajo,petia}@dsv.su.se

Abstract. In this paper an approach for building process models for e-commerce is proposed. It is based on the assumption that the process modeling task can be methodologically supported by a designers assistant. Such a foundation provides justifications, expressible in business terms, for design decisions made in process modeling, thereby facilitating communication between systems designers and business users. Two techniques are utilized in the designers assistant, namely process patterns and action dependencies. A process pattern is a generic template for a set of interrelated activities between two agents, while an action dependency expresses a sequential relationship between two activities.

1 Introduction Conceptual models have become important tools for designing and managing complex, distributed and heterogeneous systems, e.g. in e-business and e-commerce, [2, 17]. In e-commerce it is possible to identify two basic types of conceptual models: business models and process models. A business model focuses on the what in an ecommerce system, identifying agents, resources, and exchanges of resources between agents. Thus, a business model provides a high-level view of the activities taking place in e-commerce. A process model, on the other hand, focuses on the how in an ecommerce system, specifying operational and procedural aspects of business communication. The process model moves into a more detailed view on the choreography of the activities carried out by agents. A business model has a clearly declarative form and is expressed in terms that can be easily understood by business users. Therefore, business models function well for supporting communication between systems designers and business users. In contrast, a process model has a more procedural form and is at least partially expressed in terms, like sequence flows and gateways, that are not immediately familiar to business users. Furthermore, it is often difficult to understand why a process model has been designed in a certain way and what consequences alternative designs would have. In order to overcome these limitations, we believe that process models should be complemented by and be based on a more declarative foundation. Such a foundation would provide justifications, expressible in business terms, for design decisions made in process modeling, thereby facilitating communication between systems designers and business users. In this paper, we propose a designers assistant that provides a declarative foundation for process modeling suggests a method for gathering domain knowledge. The work reported in this paper extends the work of [1] and [10] in that we propose two instruments for a declarative foundation of process models: P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 724–739, 2004. © Springer-Verlag Berlin Heidelberg 2004

A Pattern and Dependency Based Approach to the Design of Process Models

725

process patterns and action dependencies. A process pattern is a generic template for a set of interrelated activities between two agents, while an action dependency expresses a sequential relationship between two actions. The rest of the paper is organized as follows. Section 2 presents the notions of business models and process models. Section 3 introduces process patterns and makes a distinction between transaction patterns and collaboration patterns. Section 4 discusses action dependencies. Section 5 proposes a designers assistant that supports a designer in the construction of a process model. Section 6 concludes the paper and gives suggestions for further research.

2 Business Models and Process Models For illustrating business and process models, a small running case is introduced. It is a simplified version of the Drop-Dead Order business case described in [8]. In this business scenario, a Customer requests an amount of fried chicken from a Distributor. The Distributor then requests formal offers from a Chicken Supplier and a Carrier. Furthermore, the Distributor requests a down payment from the Customer before accepting the offers from the Chicken Supplier and Carrier. As the Customer completes the down payment to the Distributor, the Distributor accepts the offer from the Chicken Supplier by also paying a down payment and the offer from Carrier. When the Chicken Supplier has provided the fried chicken and the Carrier has delivered them to the Customer, the Distributor has thereby fulfilled the Customer’s order. After that, the Customer settles the final payment to the Distributor. Finally, the Distributor settles the Chicken Supplier’s final payment and the payment for the Carrier.

2.1 Business Models As a foundation for business models, we will use the REA ontology [13], which has been widely used for business modeling in e-Commerce, [17]. The REA framework is based on three main components: Resources, Economic Events, and Agents, see Fig. 11. An Agent is a person or organization that is capable of controlling Resources and interacting with other Agents. A Resource is a commodity, e.g. goods or services that is viewed as being valuable by Agents. An Economic Event is the transfer of control of a Resource from one Agent to another one. Each Economic Event has a counterpart, i.e. another Economic Event that is performed in return and realizing an exchange. For instance, the counterpart of a delivery of goods may be the payment of the same goods. This connection between Economic Events is modeled through the relationship Duality. Furthermore, a Commitment is a promise to execute a future Economic Event, for example fulfilling an order by making a delivery. The Duality between Economic Events is inherited by the Commitments, where it is represented by the association Reciprocal. In order to represent collections of related Commitments, the concept of Contract is used. A Contract is an aggregation of two or more reciprocal Commitments. An example of a Contract is a purchase order composed of one or several 1

Due to space restrictions and for the purpose of readability we use abbreviated forms of the terms in the original REA ontology. This is done by dropping the term ‘Economic’ for Economic Contract, Economic Commitment, Economic Resource, and Economic Agent.

726

Maria Bergholtz et al.

order lines, each one representing two Commitments (the goods to be delivered and the money to be paid for the goods, respectively).

Fig. 1. REA basis for business models

Fig. 2. Business Model for the Fried Chicken Business Case

A business model based on REA will consist of instances of the classes Resource, Economic Event and Agent as well as the associations between these. The business model for the running case described above can be visualized as in Fig. 2. Here, arrows represent Economic Events labeled with relevant Resources. The transfer of resource control from one Agent to another is represented by the direction of arrows. Ellipses represent relationships between Economic Events belonging to the same Duality.

2.2 Process Models The notation we will use for process models is BPMN [4], a standard developed by the Business Process Management Initiative (BPMI) [3]. The goal of BPMN is to be a easily comprehensible notation for a wide spectrum of stakeholders ranging from business domain experts to technical developers. A feature of BPMN is that BPMN specifications can be readily mapped to executable XML languages for process specification such as BPEL4WS, [2]. In this paper, a selected set of core elements from BPMN have been used. These elements are Activities, Events, Gateways, Sequence flows, Message flows, Pools and Lanes. Activity is a generic term for work that an Agent can perform. In a BPMN Business Process Diagram (abbreviated BPMN diagram), an Activity is represented by a rounded rectangle. Events, represented as circles, are something that “happens” during the course of a business process. There exist three types of Events: Start, End and Intermediate Events. Activities and Events are connected via Sequence Flows that show the order in which Activities will be performed in a process. Gateways are used to control the sequence flows by determining branching, forking, merging, and joining of paths. In this paper we will restrict our attention to XOR and AND branching, graphically depicted as a diamond with an ‘X’ or a ‘+’, respectively. Lanes and Pools are graphical constructs for separating different sets of Activities from each other. A

A Pattern and Dependency Based Approach to the Design of Process Models

727

Lane is a sub-partition within a Pool used to organize and categorize Activities. Message flows depicted as dotted lines are used for communication between Activities in different Pools. (An example of them appear later in Fig. 12) An example of a BPMN diagram is shown in Fig. 3. The diagram shows a single Business Transaction in one pool with three lanes. A Business Transaction is a unit of work through which information and signals are exchanged (in agreed format, sequence and time interval) between two Agents [17]. A Business Transaction consists of two Activities, one Requesting Activity where one Agent initiates the Business Transaction and one Responding Activity where another Agent responds to the Requesting Activity. (See Fig. 4)

Fig. 3. Example of a BPMN diagram

Several Business Transactions between two Agents can be combined into one binary Business Collaboration. It turns out that it is often fruitful to base binary Business Collaborations on Dualities, i.e. one Business Collaboration will contain all the Business Transactions related to one Duality. This gives a starting point for constructing a process model from a business model. Each Duality in the business model gives rise to one binary Business Collaboration, graphically depicted as a BPMN diagram in a Pool. In this way, a process model will be constructed as a set of interrelated Business Collaborations. Furthermore, a binary Business Collaboration can naturally be divided into a number of phases. Dietz, [6], distinguishes between three phases. The Ordering phase, in which an Agent requests some Resource from another Agent who, in turn, promises to fulfill the request. The Execution phase, in which the Agents perform Activities in order to fulfill their promises. The Result phase, in which an Agent declares a transfer of Resource control to be finished, followed by the acceptance or rejection by the other Agent. The ISO OPEN-EDI initiative [15] identifies five phases: Planning, Identification, Negotiation, Actualization and Post-Actualization. In this paper, we use only two phases: a Contract Negotiation phase in which contracts are proposed and accepted, and an Execution phase in which transfers of Resources between Agents occur and are acknowledged. In the next section, we will discuss how a binary Business Collaboration can be constructed utilizing patterns for these phases.

3 Generic Process Patterns Designing and creating business and process models is a complicated and time consuming task, especially if one is to start from scratch for every new model. A good designer practice to overcome these difficulties is, therefore, to use already proven solutions. A pattern is a description of a problem, its solution, when to apply the solu-

728

Maria Bergholtz et al.

tion, and when and how to apply the solution in new contexts [11]. The significance of a pattern in e-commerce is to serve as a predefined template that encodes business rules and business structure according to well-established best practices. In this paper such patterns are expressed as BPMN diagrams. They differ from the workflow patterns of [18], [16], [19] by focusing primarily on communicative aspects, while control flow mechanisms are covered on a basic level only. In the following sub sections, a framework for analyzing and creating transactionand collaboration patterns is proposed. We hypothesize that most process models for e-commerce applications can be expressed as a combination of a small number of these patterns.

3.1 Modeling Business Transactions When a transaction occurs, it typically gives rise to effects, i.e. Business Entities like Economic Events/Contracts/Commitments are effected (created, deleted, cancelled, fulfilled). Furthermore, the execution of a transaction may cause the desired effect to come into existence immediately, or only indirectly, depending on the intentions of the interacting Agents. For example, the intention of an Agent in a transaction may be to propose a Contract, to request a Contract or to accept a Contract. In all three cases the business entity is the same (a Contract) but the intention of the Agent differs.

Fig. 4. Business Transaction analysis

Fig. 4 builds on REA and suggests a set of Business- Intentions, Effects and Entities. These notions are utilized in defining transaction patterns and transaction pattern instances as follows. Definition: A transaction pattern (TP) is a BPMN diagram with two Activities, one Requesting Activity and one Responding Activity. Every Activity has a label of the form , where Intention {Request, Propose, Declare, Accept, Reject, Acknowledge}, Effect {create, delete, cancel}, and Business Entity {aContract, anEconomicEvent, aCommitment}. All End Events are labeled according to the Intention and Business Entity of the Activity prior to the sequence flow leading to the End Event. Intuitively, the components of an activity label mean the following: Business Entity tells what kind of object the Activity may effect. Effect tells what kind of action is to be applied to the Business Entity – create, delete or cancel. Intention specifies what intention the business partner has towards the Effect on the Business Entity.

A Pattern and Dependency Based Approach to the Design of Process Models

729

The meanings of the intentions listed above are as follows: Propose – someone offers to create, delete or cancel a Business Entity. Request – someone requests other Agents to propose to create, delete or cancel a Business Entity. Declare – someone unilaterally declare a Business Entity created, deleted or cancelled. Accept/Reject – someone answers a previously given proposal. Acknowledge – someone acknowledges the reception of a message. Definition: A pattern instance of a transaction pattern is a BPMN diagram derived from the pattern by renaming its Activities, replacing each occurrence of aContract in an activity label with the name of a specific Contract, replacing each occurrence of anEconomicEvent in an activity label with the name of a specific EconomicEvent, and replacing each occurrence of aCommitment in an activity label with the name of a specific Commitment.

3.2 Transaction Patterns (TPs) In the following sections three basic Contract Negotiation and two Execution TPs are suggested based on the framework described above. 3.2.1 Contract Negotiation TPs The Contract-Offer TP models one Agent proposing an offer (<propose, Create, aContract>) to another Agent who acknowledges receiving the offer. The acceptance or rejection of an offer is modeled in the Contract-Accept/Reject TP, see Fig. 5.

Fig. 5. TPs for Contract Negotiation: Contract-Offer and Contract-Accept/Reject

Fig. 6. TP for Contract Negotiation: Contract-Request

Fig. 6 models the Contract Request case where an Agent requests of other Agents to make an offer for aContract on certain Resources.

730

Maria Bergholtz et al.

3.2.2 Execution TPs We introduce two Execution TPs (see Fig. 7) that specify the execution of an Economic Event, i.e. the transfer of Resource control from one Agent to another. An example is a Chicken Distributor selling Chickens (a Resource) for $3 (another Resource).

Fig. 7. TPs for Execution: Economic Event Offer and Economic Event Accept

3.3 Assembling Transactions Patterns into Collaboration Patterns An issue is how to combine the transaction patterns described in the previous section, i.e. how to create larger sequences of patterns. For this purpose, collaboration patterns define the orchestration of Activities by assembling a set of transaction patterns and/or more basic collaboration patterns based on rules for transitioning from one transaction/collaboration to another. To hide the complexity when TPs are combined into arbitrarily large collaboration patterns, we use a layered approach where the TPs constitute activities in the BPMN diagram of the collaboration patterns. Definition: A collaboration pattern (CP) is a BPMN diagram where the activities consist of transaction and collaboration pattern(s). A CP has exactly two end events representing success or failure of the collaboration, respectively. All end events are labeled according to the Intention and Business Entity of the Activity prior to the sequence flow that led to the end event. 3.3.1 Contract Negotiation CPs The Contract Establishment CP, see Fig. 8, is assembled from the Contract-Offer and Contract-Accept/Reject TPs. An example scenario is a Chicken Distributor proposing an offer to a customer on certain terms. The contract is formed (or rejected) by the customers acceptance or rejection of the proposed offer.

Fig. 8. Contract Establishment CP

The two recursive paths when a contract offer/request has been rejected have a natural correspondence in the business negotiation concepts ‘Counter Offer’ and ‘Bidding’ (or ‘Auctioning’) respectively. ‘Counter Offer’ refers to the switch of roles between Agents, i.e. when the responding Agent has rejected the requesting Agents

A Pattern and Dependency Based Approach to the Design of Process Models

731

offer, the former makes an offer of her own. ‘Bidding’ is modeled via the other sequence Flow from the gateway, i.e. when the responding Agent has turned down a contract offer, the requesting Agent immediately initiates a new Business Transaction with a new (changed) offer for Contract. The Contract-Proposal collaboration pattern, Fig. 92, is assembled from the Contract-Request TP and the Contract-Establishment CP defined above.

Fig. 9. Contract Propose CP

3.3.2 Execution CP The execution collaboration pattern specifies relevant TPs and rules for sequencing among these within the completion of an Economic Event. The pattern is assembled from the Offer-Event and Accept/Reject Event TPs.

Fig. 10. Execution CP

4 Action Dependencies The process patterns introduced in the previous section provide a basis for a partial ordering of the activities taking place in a business process, in particular the ordering based on contract negotiation and execution. We will refer to the activities involved in the different phases of a process as contract negotiation or execution activities respectively. However, the ordering derived from the process patterns only provide a starting point for designing complete process models, i.e. it needs to be complemented by additional interrelationships among the activities. These interrelationships should have a clear business motivation, i.e. every interrelationship between two activities should be explainable and motivated in business terms. We suggest to formalize this idea of business motivation by introducing the notion of action dependencies. An action dependency is a pair of actions (either economic events or activities), where the second action for some reason is dependent on the first one. We identify the following four kinds of action dependencies. Flow dependencies. A flow dependency, [12], is a relationship between two Economic Events, which expresses that the Resources obtained by the first Economic Event are 2

When a CP is composed of other CPs, no lanes can be shown as the Requesting and Responding Activities are already encapsulated.

732

Maria Bergholtz et al.

required as input to the second Economic Event. An example is a retailer who has to obtain a product from an importer before delivering it to a customer. Formally, a flow dependency is a pair , where A and B are Economic Events from different Dualities. Trust dependencies. A trust dependency is a relationship between two Economic Events within the same Duality, which expresses that the first Economic Event has to be carried out before the other one as a consequence of low trust between the Agents. Informally, a trust dependency states that one Agent wants to see the other Agent do her work before doing his own work. An example is a car dealer who requires a down payment from a customer before delivering a car. Formally, a trust dependency is a pair , where A and B are Economic Events from the same Duality. Control dependencies. A control dependency is a relationship between an execution Activity and a contract negotiation Activity. A control dependency occurs when one Agent wants information about another Agent before establishing a Contract with that Agent. A typical example is a company making a credit check on a potential customer (i.e. an exchange of the Resources information and money in two directions). Formally, a control dependency is a pair , where A is an execution Activity and B is a contract negotiation Activity and where A and B belong to different Dualities. Negotiation dependencies. A negotiation dependency is a relationship between Activities in the contract negotiation phase from different Dualities. A negotiation dependency expresses that an Agent is not prepared to establish a contract with another Agent before she has established another contract with a third Agent. One reason for this could be that an Agent wants to ensure that certain Resources can be procured before entering into a Contract where these Resources are required. Another reason could be that an Agent does not want to procure certain Resources before there is a Contract for an Economic Event where these Resources are required. Formally, a negotiation dependency is a pair , where A and B are contract negotiation Activities in different Dualities.

5 A Designers Assistant In this section, we will show how a process model can be designed based on process patterns and action dependencies. Designing a process model is not a trivial task but requires a large number of design decisions. In order to support a designer in this task, we propose an automated designers assistant that guides the designer through the task by means of a sequence of questions, divided into four steps, followed by a fifth step where the process model is generated based on the answers to questions in step 1-4, see Fig. 11. Step 1. during which information is gathered about the Agents involved in the business process, the Resources exchanged between them, and the Economic Events through which these Resources are exchanged. The result from this step is a business model. Step 2. during which information about the (partial) order between the Economic Events is gathered. The result from this step is an ordering of the Activities in the Execution phase of a process model.

A Pattern and Dependency Based Approach to the Design of Process Models

733

Step 3. during which information about existing negotiation dependencies is gathered. The result from this step is an ordering of the Activities in the Negotiation phase. Step 4. during which inter phase and inter pool dependencies are established. The result from this step is an ordering of Activities that crosses the Negotiation and Execution phases. Step 5. during which a set of production rules are applied on the results of the previous steps in order to generate a process model.

Fig. 11. Steps of the Designers Assistant

5.1 Step 1 – Business Model In order to produce a business model the following four questions need to be answered. Answers according to the running case are given after every question. 1. Who are the Agents? Answers: Customer (Cust), Distributor (Dist), Chicken Supplier (Supp), Carrier (Carr) 2. What are the Resources? Answers: Money, Chicken, Delivery 3. What are the Economic Events? Specify them by filling in the following table.

4. Group the Economic Events into Dualities by filling in the following table.

734

Maria Bergholtz et al.

The answers to these four questions provide sufficient information to produce the business model shown in Fig. 2.

5.2 Step 2 – Execution Phase Order Having identified the Economic Events, the designer is prompted to determine the dependency orders. In this step only flow and trust dependencies are considered. 5. Specify Flow and Trust Dependencies by filling in the table below (where the row and column headings are Economic Events identified in question 4). If an (in row i) precedes an (in column j): put a ‘, which will include all the components of the XML schema. The name of the schema will be the name of the package. The attributes of the XML Schema will be tagged values of the package. The XML elements are represented with stereotyped classes named as the value of the attribute name of the element. The attributes of the element will be tagged values of the class. The appearance order of the element in the XML Schema, including as a prefix the order number of the element to which it belongs, will also be a tagged value of the class and it will be represented next to the name of the class. The XML attributes are represented by means of UML attributes of the class that represent the XML element to which the XML attributes belong to. The base type of an XML attribute will be represented as the data type of the corresponding UML attribute. The constraints to be satisfied by the attribute (required, optional) and the default or fixed value will be represented as tagged values. A compositor composition is a special kind of composition stereotyped with the kind of compositor: , > or . It can only be used to join an element (composite) with the elements that compose the father element (parts). The compositors can be used to represent nameless XML complexTypes. The XML complexTypes have been considered as stereotyped classes with , if they are named. In this case, the complexType will be related by means of a uses association with the element, complexType or simpleType that uses it. If the complexType has no name, it will be represented in an implicit way by the compositor that composes the complexType. The XML simpleType is a type that has no subelements or attributes. The simpleTypes have been considered as classes stereotyped with > named as the element that contains it. It will be related with its father element with a stereotyped composition with >. The XML complexContent is a subclass of the complexType that it defines. The complexContent types have been considered as stereotyped classes, which must be

A Model Driven Approach for XML Database Development

785

related by an inheritance association with the elements or complexTypes that the complexContent redefines. The XML simpleContent is a subclass of the complexType or a simpleType. The simpleContent types have been considered as stereotyped classes that are related with an inheritance association to the father type (simple or complex type) that is redefined by the simpleContent type. A uses association is a special kind of unidirectional stereotyped association which joins a named complexType with the element or type (simple or complex) that uses it. A > association can also be used to join two elements by means of a ref attribute in one of the elements. The direction of the association is represented by an arrow at the end of the element, which is used by the one that contains the corresponding refelement. A REF element will be represented by means of an attribute stereotyped with and represents a link to another element. A REF attribute can only refer to a defined element and is associated with the referred element by means of a uses association. In figure 3 the metamodel of the UML extension for XML Schemas is shown.

Fig. 3. Metamodel of the UML extension for XML Schemas

3.3 Mappings to Obtain the Data PSM from the Data PIM In this section we are going to describe the mappings defined to build the data PSM from the data PIM. There exist some other works [9], in which some rules are defined to obtain XML Schemas from the UML class diagram, but, to our knowledge, none of these proposals give specific guidelines for the design of XML DBs. We will start from the data PIM represented with a UML class diagram and will obtain the data PSM in XML, also represented in extended UML applying the following mapping rules:

786

Belén Vela, César J. Acuña, and Esperanza Marcos

The complete conceptual data model is transformed, at the PSM level, into an XML schema named ‘Data PSM’, including all the components of the data PIM. It will be represented with a UML package stereotyped with > and named ‘Data PSM’. This package will include the components of the data PSM. Transformation of UML classes We can split the UML classes into different groups: subclasses of a generalization, classes that represent parts of a composition and finally, the rest of the classes. The first and second groups of UML classes (subclasses and parts) will be transformed by means of named complexTypes. In the third group the classes will be transformed into an element, named as the class name. The abstract classes are mapped into abstract elements. The complexTypes generated when transforming UML classes belonging to the first and second group, will be represented in extended UML by means of stereotyped classes with and named with the name of the subclasses or parts plus “_type”. The elements generated when transforming the UML classes belonging to the third group will be represented in extended UML with a class stereotyped with > and named as the class of the data PIM from which it comes. Transformation of UML attributes In order to map the UML attributes of the classes, we can transform them in two different ways: into XML attributes or into XML elements. A straightforward mapping is to transform the UML attributes into XML attributes of the element that represents the class. However, in this way the attributes can no be used if they are not single valued, as for example, multivalued or composed ones. Moreover, attributes are usually used to describe the content of the elements and not to be visualized. For these reasons, and as the UML attributes are represented as classes in the UML metamodel, we propose to transform the UML attributes of a class by means of a complexType including as subelements the UML attributes of the class. This complexType will be represented in extended UML with composition stereotyped with >, which includes all the attributes as subelements represented with classes stereotyped with >. The attributes of the class can be transformed according to their type: A mandatory attribute will be represented with a minimum multiplicity of one at the composition, whereas an optional attribute will be represented with a minimum multiplicity of zero at the composition. A multivalued attribute will be represented with a maximum multiplicity of N in the part side of the composition. A composed attribute will be represented by an element that is related to the composing attributes by means of a complexType An enumerated attribute will be represented by a simpleType composition, with the stereotyped restriction Enumeration. A choice attribute will be represented by a simpleType composition, with the stereotyped restriction Choice.

A Model Driven Approach for XML Database Development

787

Transformation of associations There exist two main aspects to take into account when transforming UML associations into an XML Schema. The first one deals with the direction in which the associations are implemented, that is, if they are unidirectional or bidirectional. The second one deals with the way in which the associations are mapped using XML Schema constructions. With regard to the first aspect, UML associations can be represented in an XML schema either as unidirectional or as bidirectional associations. A unidirectional association means that it can be crossed only in one direction whereas a bidirectional one can be crossed in the two directions. If we know that queries require data in both directions of the association, then it is recommended to implement them as bidirectional associations improving in this way the response times. However, we have to take into account that bidirectional associations are not maintained by the system, so the consistence has to be guaranteed in a manual way. Therefore bidirectional associations, despite of improving in some cases the response times, have a higher maintenance cost. The navigability (if it is represented) in a UML diagram shows the direction in which the association should be implemented. With regard to the second aspect, the way in which an association could be mapped to XML schema is a crucial issue and there are different ways of transforming UML associations into XML Schema associations within a XML document, each with its advantages and disadvantages. Some criteria to select the best alternative are related to the kind of information, the desirable level of redundancy, etc. In [9] a study of the different alternatives is made. We propose to model the associations by adding association elements within the XML elements that represent the classes implicated in the association. Next, we show how to map the association in a unidirectional way using ref elements. One-to-One. A one-to-one association will be mapped creating an association subelement of one of the elements that represent the classes implicated in the association. The subelement will be named with the association name. This subelement will include a complexType with an element of a ref type that references the other element implicated in the association. If the minimum multiplicity is one, the attribute minOccurs will be one, which is the default value and can be omitted, and otherwise it will be zero. As the maximum multiplicity is one, the attribute maxOccurs has to be one, which is the default value too and can be omitted. One-to-Many. A one-to-many association will be transformed in a unidirectional way creating an association subelement within the element that represents the class with multiplicity N, named as the association, including a complexType within this element with a subelement of a ref type that references the other element implicated in the association. If the minimum multiplicity is one, the attribute minOccurs will be one, which is the default value and can be omitted, and otherwise it will be zero. As the maximum multiplicity is one in this direction, the attribute maxOccurs has to be one, which is the default value and can also be omitted. Many-to-Many. Following the same reasoning as in the previous case, a many-to-many association will be transformed defining an association element within one of the elements, including a > complexType of refer-

788

Belén Vela, César J. Acuña, and Esperanza Marcos

ence elements to the collection of elements implicated in the association. If the minimum multiplicity is one, the attribute minOccurs will be one, which is the default value, otherwise it will be zero. As the maximum multiplicity is N in this direction, the attribute maxOccurs has to be N. Transformation of aggregations An aggregation will be mapped creating a subelement of the element which represents the aggregate named as the aggregation. If the aggregation has no name, the name will be “is_aggregated_of”. This element (aggregate) will include a complexType with an element of ref type that references the parts of the aggregation. If the maximum multiplicity is N, the complexType will include a sequence of references. If the minimum multiplicity is one, the attribute minOccurs will be one, which is the default value, otherwise it will be zero. It will be represented in extended UML by means of an aggregation stereotyped with . Transformation of compositions A composition of classes in the UML class diagram will be mapped including as subelements in the element that represents the compositor the parts of the composition. The subelements will be of the type of the complexType defined to represent the parts of the composition. It will be represented in extended UML including in the compositor a part of the stereotyped composition and a uses association, which relates the part with the corresponding complexType of the part. Transformation of generalizations A generalization of classes in the UML class diagram will be mapped including the superclass as an element and a complexType of choice type which includes as subelements of a complexType the subclasses of the generalization. It will be represented in extended UML by means of a composition stereotyped with . The proposed mapping rules to obtain the data PSM are summarized in the table 1.

4 A Case Study The tasks, models and mappings proposed in MIDAS are being defined by means of different case studies. In this paper we present part of the case study of a WIS for medical image management. This WIS is based on DICOM (Digital Image and Communications in Medicine) [1], which is the most accepted standard for the medical image exchange. In this paper, the presented case study will only focus on the development of the XML DB. We will show how to build it starting from a conceptual data model. In section 4.1 we present the data PIM and the section 4.2 presents the data PSM, showing how to apply the proposed mappings to obtain it. Finally, section 4.3 shows the XML database implementation in Oracle’s XML DB.

4.1 Data PIM For the sake of brevity we will only present a reduced part of the data PIM obtained in the analysis activity of MIDAS/ST step. This partial data PIM is based on the infor-

A Model Driven Approach for XML Database Development

789

mation model defined in the DICOM standard. As we can see in figure 4, the Patients can make one or more Visits. Each Visit can derive into one or more Studies. A Study is formed by several Study Components, which can belong simultaneously to different Studies. A Study Component makes references to several Series, which are a set of Images. There are different kinds of Series like Image, Raw Data, etc. A Result is obtained from a Study and is composed of several Interpretations.

Fig. 4. Partial Data PIM in UML

790

Belén Vela, César J. Acuña, and Esperanza Marcos

4.2 Data PSM To obtain the data PSM from the data PIM we have to apply the mappings defined in section 3.3 and we have to use the extended UML notation resumed in section 3.2, to represent the resulting XML Schema. Transformation of classes: In order to transform the UML classes we could split them into two groups as follow: The first group is the one formed by those classes that are part classes in a composition, as the Interpretation class, as well as those classes that are subclasses in a generalization, as the Image and Raw Data classes. These classes are transformed into XML complexTypes and the attributes of these classes are mapped into subelements of the defined complexTypes. Figure 5 depicts part of the data PSM represented in extended UML. The part remarked with a solid line shows the transformation of the Interpretation class by means of a complexType named Interpretation_type. The other group is formed by the rest of the classes. Each of these classes is mapped into an XML element and its attributes are mapped into subelements of the XML element that represents the class related by means of a composition stereotyped with >. Figure 5 shows the transformation of the class Result, which is remarked with a dashed line.

Fig. 5. Partial Data PSM in extended UML

Transformation of associations: In figure 5 the transformation of the one-to-many association between the Study and Result classes is remarked with a dotted line. For space reasons the Study class is not completely represented in this figure and the representation of its UML attributes were omitted. As the mapping rules indicate, the one-to-many association between these classes is transformed adding a subelement to the element that represents the class of the maximum multiplicity N. In this case, we add the subelement originate to the element Result named as the association, which

A Model Driven Approach for XML Database Development

791

has a reference to the element that represents the class with the maximum multiplicity one. The subelement originate will be related to the Study element by means of a uses association. Moreover, figure 5 depicts the transformation of the composition between the Result and Interpretation classes. This association is mapped adding a subelement to the element that represents the compositor class Result, named as the part class Interpretation. The type of the Interpretation element is the complex type Interpretation_Type defined when mapping the Interpretation class. Additionally, figure 6 shows the XML Schema code generated from the UML diagram depicted in figure 5.

Fig. 6. XML Schema code

In order to transform the disjoint and incomplete generalization association between the Series, Image and Raw Data classes, an element for each subclass has to be added into the XML choice complexType, within the element that represents the superclass. That is to say, the Image and Raw_data elements have to be included into the Series element. The complexTypes of the added elements are Image_type and Raw_data_type, respectively. These types were created when transforming the Image and Raw Data UML classes. Figure 7 shows the resulting transformation in extended UML and figure 8 shows the corresponding XML Schema code generated. In both figures, the subelements of Image_type and Raw_data_type were omitted for space reasons.

Fig. 7. Partial Data PSM in extended UML (generalization transformation)

Fig. 8. 8XML Schema code

792

Belén Vela, César J. Acuña, and Esperanza Marcos

4.3 Database Implementation in Oracle XML DB The XML Schema obtained in the previous section was implemented using Oracle’s XML DB. Based on the study of different XML DB solutions made in [20] and on the previous experience of our research group we have chosen Oracle to validate our proposal and to carry out the implementation of the XML DB. But, as we propose to use the standard XML Schema as data storage model in XML, the approach will be applicable to any DBMS that supports XML Schema. The way in which Oracle stores the XML data compliant with the defined XML Schema is shown in figure 9. We use the UML extension for OR DB design proposed in [13] to represent it. For space reasons, in figure 9 we only show the part corresponding to the XML Schema depicted in figure 5.

Fig. 9. Implementation in Oracle’s XML DB

5 Conclusions and Future Work Nowadays, there exists different solutions for the storage of XML data, but, in spite of existing several works in this line, there is no methodology for the systematic design of XML databases. In this paper we have described a model driven approach for the development of XML DBs in the framework of MIDAS, a model driven methodology for the development of WIS based on MDA. Specifically, we have focused on the content aspect of the structural dimension of MIDAS, which corresponds to the traditional concept of a DB. There exists different ways of developing a DB. In this paper we have proposed the development process for XML DBs, where the data PIM is the conceptual data model (UML class diagram) and the data PSM the XML Schema model. Both of them will be represented in UML, therefore we have also summarized the UML extension to represent XML Schemas. Moreover, we have defined the mappings to transform the data PIM into the data PSM, which will be the XML database schema. We have developed different case studies to validate our proposal and in this paper we have shown part of the case study of the development of a XML DB for the management of medical images stored in Oracle’s XML DB. We are working on the implementation of a CASE tool (MIDAS-CASE), which integrates all the techniques proposed in MIDAS for the semiautomatic generation of WIS. The repository of the CASE tool is also being implemented in Oracle’s XML

A Model Driven Approach for XML Database Development

793

DB, following the approach proposed in this paper. We have already implemented in the tool the XML module, including the part for XML Schema and WSDL. Now, we are implementing on the one hand the automatic generation of the XML Schema code from the corresponding graphical representation of the data PSM in extended UML and on the other hand, the semi-automatic transformation from the data PIM to the data PSM to obtain the code of the XML DB.

References 1. ACR-NEMA. The DICOM Standard. Retrieved from: http://medical.nema.org/, 2003. 2. Barbosa, D., Barta, A., Mendelzon, A., Mihaila, G., Rizzolo, F. and Rodriguez-Gianolli, P. ToX - The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001. 3. Bray, T., Paoli, J, Sperberg-McQu4een, C. M. and Maler, E., Extensible Markup Language (XML) 1.0 (SecondEdition), W3C Recommendation. Retrieved from: http://www.w3.org/TR/2000/REC-xml-20001006/, 2000. 4. Cáceres, P., Marcos, E. and Vela, B. A MDA-Based Approach for Web Information System Development. Workshop in Software Model Engineering in UML Conference. San Francisco, USA, October, 2003. 5. Case, T., Henderson-Sellers, B. and Low, G.C. A generic object-oriented design methodology incorporating database considerations. Annals of Software Engineering. Vol. 2, pp. 524, 1996. 6. Chaudhri, A.B., Rashid, A. and Zicari, R. (Eds.). XML Data Management. Native XML and XML-Enabled Database Systems. Addison Wesley, 2003. 7. eXcelon Corporation. Managing DXE. System Documentation Release 3.5. eXcelon Corporation. Burlington. Retrieved from: www.excelon.corp.com, 2003. 8. IBM Corportation. IBM DB2 Universal Database -XML Extender Administration and Programming, Product Documentation Version 7. IBM Corporation, 2000. 9. Krumbein, T. and Kudrass, T. Rule-Based Generation of XML Schemas from UML Class Diagrams. Berliner XML Tage 2003, Berlin (Germany). 13-15 October 2003. Ed. R. Tolksdorf and R. Eckstein. Berliner XML Tage. 2003 10. Marcos, E., Cáceres, P, and De Castro, V. From the Use Case Model to the Navigation Model: a Service Oriented Approach. CAISE FORUM ’04. Riga (Latvia). 10-11 June, 2004. Ed: J. Grabis, A. Persson y J. Stirna. Proceedings, 2004. 11. Marcos E., Vela B. and Cavero, J. M. Extending UML for Object-Relational Database Design. Fourth Int. Conference on the Unified Modeling Language, UML 2001. Toronto (Canadá), LNCS 2185, Springer-Verlag, pp. 225-239, October, 2001. 12. Marcos, E. Vela, B., Cáceres, P. and Cavero, J.M. MIDAS/DB: a Methodological Framework for Web Database Design. DASWIS 2001. Yokohama (Japan), November, 2001. LNCS 2465, Springer-Verlag, pp. 227-238, September, 2002. 13. Marcos, E., Vela, B. and Cavero J.M. Methodological Approach for Object-Relational Database Design using UML. Journal on Software and Systems Modeling (SoSyM). SpringerVerlag. Ed.: R. France and B. Rumpe. Vol. SoSyM 2, pp.59-72,2003. 14. Microsoft Corporation. Microsoft SQL Server - SQLXML 2.0, System Documentation. Microsoft Corporation, 2000. 15. OMG. Model Driven Architecture. Document number ormsc/2001-07-01. Ed.: Miller, J. and Mukerji, J. Retrieved from: http://www.omg.com/mda, 2001. 16. Oracle Corporation. Oracle XML DB. Technical White Paper. Retrieved from: www.otn.com, January, 2003. 17. Software AG. Tamino X-Query. System Documentation Version 3.1.1. Software AG, Darmstadt, Germany. Retrieved from: www.softwareag.com, 2001.

794

Belén Vela, César J. Acuña, and Esperanza Marcos

18. Vela, B. and Marcos E. Extending UML to represent XML Schemas. The 15th Conference On Advanced Information Systems Engineering. CAISE’03 FORUM. Klagenfurt/Velden (Austria). 16-20 June 2003. Ed: J. Eder, T. Welzer. Short Paper Proceedings, 2003. 19. W3C XML Schema Working Group. XML Schema Parts 0-2:[Primer, Structures, Datatypes]. W3C Recommendation. Retrieved from: http://www.w3.org/TR/xmlschema-0/, http://www.w3.org/TR/xmlschema-1/ and http://www.w3.org/TR/xmlschema-2/, 2001. 20. Westermann, U. and Klas W. An Analysis of XML Database Solutions for the Management of MPEG-7 Media Descriptions. ACM Computing Surveys, Vol. 35 (4), pp. 331-373, December, 2003. 21. X-Hive Corporation. X-Hive/DB 2.0-Manual. System. Documentation Release 2.0.2., XHive Corp., Rotherdam, The Neatherlands. Retrieved from: http://www.x-hive.com/,2002.

On the Updatability of XML Views Published over Relational Data Ling Wang and Elke A. Rundensteiner Department of Computer Science Worcester Polytechnic Institute Worcester, MA 01609 {lingw,rundenst}@cs.wpi.edu

Abstract. Updates over virtual XML views that wrap the relational data have not been well supported by current XML data management systems. This paper studies the problem of the existence of a correct relational update translation for a given view update. First, we propose a clean extended-source theory to decide whether a translation mapping is correct. Then to answer the question of the existence of a correct mapping, we classify a view update as either un-translatable, conditionally or unconditionally translatable under a given update translation policy. We design a graph-based algorithm to classify a given update into one of the three update categories based on schema knowledge extracted from the XML view and the relational base. This now represents a practical approach that could be applied by any existing view update system in industry and in academic for analyzing the translatability of a given update statement before translation of it is attempted.

1

Introduction

Typical XML management systems [5,9,14] support the creation of XML wrapping views and the querying against these virtual views to bridge the gap between relational databases and XML applications. Update operations against such wrapper views, however, are not well supported yet. The problem of updating XML views published over relational data comes with new challenges beyond those of updating relational [1,7] or even objectoriented [3] views. The first is the updatability. That is, the mismatch between the hierarchical XML view model and the flat relational base model raises the question whether the given view update is even mappable into SQL updates. The second is the translation strategy. That is, assuming the view update is indeed translatable, how to translate the XQuery updates statements on the XML view into the equivalent tuple-based SQL updates expressed on the relational base. Translation strategies have been explored to some degree in recent work. [11] presents an XQuery update grammar and studies the execution performance of translated updates. However, the assumption made in this work is that the given update is indeed translatable and that in fact it has already been translated into SQL updates over a relational database, which is assumed to be created by a P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 795–809, 2004. © Springer-Verlag Berlin Heidelberg 2004

796

Ling Wang and Elke A. Rundensteiner

fixed inline loading strategy [8]. Commercial database systems such as SQLServer2000 [10], Oracle [2] and DB2 [6] also provide system-specific solutions for restricted update types, again under the assumption of given updates always being translatable. Our earlier work [12] studies the XML view updatability for the “roundtrip” case, which is characterized by a pair of invertable lossless mappings for (1) loading the XML documents into the relational bases, and (2) extracting an XML view identical to the original XML document back out of it. We prove that such XML views are always updatable by any update operation valid on the XML view. However, to the best of our knowledge, no result in the literature focuses on a general method to assess the updatability of an arbitrary XML view published over an existing relational database. This view updatability issue has been a long standing difficult problem even in the relational context. Using the concept of “clean source”, Dayal and Bernstein [7] characterize the schema conditions under which a relational view over a single table is updatable. Beyond this result, our current work now analyzes the key factors affecting the view updatability in the XML context. That is, given an update translation policy, we classify updates over an XML view as un-translatable, conditionally or unconditionally translatable. As we will show, this classification depends on several features of the XML view and the update statements, including: (a) granularity of the update at the view side, (b) properties of the view construction, and (c) types of duplication appearing in the view. By extending the concept of a “clean source” for relational databases [7] into “clean extended-source” for XML, we now propose a theory for determining the existence of a correct relational update translation for a given XML view update. We also provide a graph-based algorithm to identify the conditions under which an XML view over a relational database is updatable. The algorithm depends only on the view and database schema knowledge instead of on the actual database content. It rejects un-translatable updates, requests additional conditions for conditionally translatable updates, and passes unconditionally translatable updates to the later update translation step. The proof of correctness of our algorithm can be found in our technical report [13]. It utilizes our clean extended-source theory. Section 2 analyzes the factors deciding the XML view updatability, which is then formalized in Section 3. In Section 4 we propose the “clean extended-source” theory as theoretical foundation of our proposed solution. Section 5 describes our graph-based algorithm for detecting update translatability. Section 6 provides conclusions.

2

Factors for XML View Updatability

Using examples, we now illustrate what factors affect the view updatability in general, and which features of XML specifically cause new view update translation issues. Recent XML systems [5,9,14] use a default XML view to define the

On the Updatability of XML Views Published over Relational Data

Fig. 1. Relational database

Fig. 2. Default XML view database shown in Figure 1

797

of

one-to-one XML-to-relational mapping (Fig. 2). A view query (Fig. 3) is defined over it to express user-specific XML wrapper views. User updates over the virtual XML views are expressed in XQuery update syntax [11] (Fig. 4). Also, we only consider insertion/deletion in our discussion. A replacement is treated as a deletion followed by an insertion without specifically discussion.

2.1

Update Translation Policy

Clearly, the update translation policy chosen for the system is essential for the decision of view updatability. An update may be translatable under one policy, while not under another one. We now enumerate common policies observed in the literature [3,11] and in practice [14]. Policies for update type selection. (1) Same type. The translated update always must have the same update type as the given view update. (2) Mixed type. Translated updates with a different type are allowed. Policies for maintaining referential integrity of the relational database under deletion. (1) Cascade. The directly translated relational updates cascade to update the referenced relations as well. (2) Restrict. The relational update is restricted to the case when there are no referenced relations. Otherwise, reject the view update. (3) Set Null. The relational update is performed as required, while the foreign key is set to be NULL in each dangling tuple. The translatability of a valid view update under a given policy can be classified as unconditionally translatable, conditionally translatable and un-translatable. A view update is called un-translatable if it cannot be mapped into relational updates without violating some consistency. A view update is unconditionally translatable if such a translation always exists under the given policy. Otherwise, we call it conditionally translatable. That is, under the current update policy, the given update is not translatable unless additional conditions, such as assumptions or user communication, are introduced to make it translatable. When not stated otherwise, throughout the paper we pick the most commonly used policy, that is, same update type and delete cascade. If a different translation

798

Ling Wang and Elke A. Rundensteiner

Fig. 3. View V1 to V4 defined by XQuery Q1 to Q4 respectively

policy is used, then the discussion can be easily adjusted accordingly. Also, we do not indicate the order of the translated relational updates. For a given execution strategy, the correct order can be easily decided [1,11,12].

2.2

New Challenges Arising from XML Data Model

Example 1 (View construction consistency). Assume two view updates and (Fig. 4) delete a “book_info” element from V1 and V2 in Fig. 3 respectively.

On the Updatability of XML Views Published over Relational Data

799

Fig. 4. Update operations on XML views defined in Fig. 3

Fig. 5. Translate (a) The user expected updated view, (b) The translated update, (c) The updated relational database, (d) The regenerated view

(i) Fig. 5 shows is unconditionally translatable. The translated relational update sequence in Fig. 5(b) will delete the first book from the “book” relation by and its prices from the “price” relation through By reapplying the view query Q1 on the updated database in Fig. 5(c), the updated XML view in Fig. 5(d) equals the user expected updated view in Fig. 5(a). (ii) Fig. 6 shows is un-translatable. First, the relational update in Fig. 6(b) is generated to delete the book (bookid=98001) from the “book” relation. Note the foreign key from the “price” relation to the “book” relation (Fig. 1). The second update operation will be generated by the update translator to keep the relational database consistent. The regenerated view in Fig. 6(d) is different than the user expected updated view in Fig. 6(a). No other translation is available which could preserve consistency either. The existence of a correct translation is affected by the view construction consistency property, namely, whether the XML view hierarchy agrees with the hierarchical structure implied by the base relational schema.

800

Ling Wang and Elke A. Rundensteiner

Fig. 6. Translate (a) The user expected updated view, (b) The translated update, (c) The updated relational database, (d) The regenerated view

Example 2 (Content duplication). Next we compare the two virtual XQuery views V1 and V3 in Fig. 3. The book (bookid=98003) with two prices is exposed twice in V3, while only once in V1. The update in Fig. 4 will delete the “book_info” element from amazon, while keeping the one from bookpool. Now should we delete the book tuple underneath? It is unclear. An additional condition, such as an extra translation rule like “No underlying tuple is deleted if it is still referenced by any other part of the view” could make the update translatable by keeping the book tuple untouched. This update is thus called conditionally translatable. This ambiguous content duplication is introduced by the XQuery “FOR” expression. This property could also arise in relational Join views. Example 3 (Structural duplication). Given Q4 in Fig. 3 with each “bookid” exposed twice in the single “book_info” element. The update in Fig. 4, which deletes the first price of the specified book, is classified as conditionally translatable. Since the primary key “bookid” is touched by we cannot decide whether to delete the book-tuple underneath. With an additional condition, such as knowledge of the user intention about the update, becomes translatable. Structural duplication, as illustrated above, is special to XML view updating. While it also exists in the relational context, it would not cause any ambiguity. The flat relational data model only allows tuple-based view insertion/deletion. The update touches all not just some of the duplicates within a view tuple. Instead of always enforcing an update on the biggest view element “book_info”, the flexible hierarchical structure of XML allows a “partial” update on subelements inside it. Inconsistency between the duplicated parts thus occurs. Example 4 (Update granularity). Compared with the failure of translating in Example 1, the update in Fig. 4 on the same view V2 is conditionally translatable. deletes the whole “price_info” element instead of just the subelement “book_info”. The translated relational update sequence is the same

On the Updatability of XML Views Published over Relational Data

801

as in Fig. 6(b). The regenerated view is the same as what the user expects. Due to content duplication, is said to be conditionally translatable. XML hierarchical structure offers an opportunity for different update granularity, an issue that does not arise for relational views.

3

Formalizing the Problem of XML View Updatability

The structure of a relation is described by a relation schema where is the name of the relation, is its attribute set, and is a set of constraints. A relation R is a finite subset of a product of all the attribute domains. A relational database, denoted as D, is a set of relations A relational update operation is a deletion, insertion or replacement on a relation R. A sequence of relational update operations, denoted by is also modeled as a function

An XML view V over a relational database D is defined by a view definition (an XQuery expression in our case). The domain of the view is denoted by dom(V). Let rel be a function to extract the relations in D referAn XML view enced by then and See [13] for details. schema is extracted from both Let be an update on the view V. A valid view update (e.g., Fig. 4) is an insertion or deletion that satisfies all constraints in the view schema. Definition 1. Given an update translation policy. Let D be a relational database and V be a virtual view defined by A relational update sequence is a correct translation of iff (a) and (b) if First, a correct translation means the “rectangle” rule holds (Fig. 7). Intuitively, it implies the translated relational updates do not cause any view side effects. Second, if an update operation does not affect the view, then it should not affect the relational base either. This guarantees any modification of the relational base is indeed done for the sake of the view. Fig. 8 shows a typical partition of the view update domain The XML view updatability classifies a valid view update as either unconditionally translatable, conditionally translatable or un-translatable.

802

Ling Wang and Elke A. Rundensteiner

Fig. 7. Correct translation of view update to relational update

4

Fig. 8. The partition of view update domain

Theoretical Foundation for XML View Updatability

Dayal and Bernstein [7] show that a correct translation exists in the case of a “clean source”, when only considering functional dependencies inside a single relation. In the context of XML views, we now adopt and extend this work to also consider functional dependencies between relations. Definition 2. Given a relational database D and an XML view V defined over several relations Let be a view element of V. Let be a generator of where for Then is called a source tuple in D of Further, is an extended source tuple in D of iff that is a foreign key of where and is an extended source tuple of is called an extended generator of A source tuple is a relational row used to compute the view element. For instance, in V1 of Fig. 3, the first view element is book_info element with bookid=98001. Let and denote the book and price relations respectively, then the generator of is where is the book tuple (98001,TCP/IP Illustrated) and is the price tuple (98001, 63.70, www.amazon.com). Let the view-element be the title of Then the source tuple of is Since is a foreign key of we say is an extended source tuple of and is an extended generator of Definition 3. Let be a part of a given XML view V. Let be the set of generators of defined by is a generator of a view-element in For each let be some nonempty subset of Then any superset of is a source in D of (If then has no source in D.) Similarly, let be the set of extended generators for view elements in Then any superset of is an extended source in D of denoted by A source includes the underlying relational part of a view “portion” which consists of multiple view-elements. For example, let (Fig. 3), where TCP/IP Illustrated),(98001, 63.70,www.amazon.

On the Updatability of XML Views Published over Relational Data

803

com)}, Data on the Web), (98003,56.00,www.amazon. com), (98003, 45.60, www.bookpool.com)}. That is, includes all the generators for view elements in Let TCP/IP Illustrated)} and 56.00, www.amazon.com)}. Then {(98001, TCP/IP Illustrated),(98003, 56,00, www.amazon.com)} is a source of also an extended source of Definition 4. Let be a relational database. Let be part of a given XML view V and be an extended source in D of is a clean extended source in D of iff such that is an extended source in of Or, equivalently, is a clean extended source in D of iff is not an extended source in D of A clean extended source defines a source that is only referenced by the given view element itself. For instance, given the view-element in V2 (Fig. 3) representing the book_info element (bookid = 98001), its extended source {(98001, TCP/IP Illustrated),(98001, 63.70,www.amazon.com)} is not a clean extended source since it is also an extended source of the price element. The clean extended source theory below captures the connection between clean extended source and update translatability (Proofs in [13]). It serves as a conservative solution for identifying the (unconditionally) translatable updates. Theorem 1. Let be the deletion of a set of view elements Let a translation procedure, Then correctly translates D iff deletes a clean extended source of

be to

By Definition 1, a correct delete translation is one without any view side effect. This is exactly what deleting a clean extended-source guarantees by Definition 4. Thus Theorem 1 follows. Theorem 2. Let

be the insertion of a set of view elements into V. Let Let be a translation procedure, Then correctly translates to D iff (i) inserts a source tuple of and (ii) does not insert a source tuple of

Since Theorem 2 indicates a correct insert translation is the one without any duplicate insertion (insert a source of and any extra insertion (insert a source of That is, it inserts a clean extended source for the new view-element. Duplicate insertion is not allowed by BCNF, while extra insertion will cause a view side effect. For example, for in Fig. 4, let (98003,Data on the Web) into book}, (98003,56.00,www.ebay.com) into price}. Then is not a correct translation since it inserts a duplicate source tuple into book. While is a correct translation.

804

5

Ling Wang and Elke A. Rundensteiner

Graph-Based Algorithm for Deciding View Updatability

We now propose a graph-based algorithm to identify the factors and their effects on the update translatability based on our clean extended source theory. We assume the relational database is in the BCNF form. No cyclic dependency caused by integrity constraints among relations exists. Also, the predicate used in the view query expression is a conjunction of non-correlation (e.g., $price/website = “www.amazon.com”) or equi-correlation predicates (e.g., $book/bookid = $price/bookid).

5.1

Graphic Representation of XML Views

Two graphs capture the update related features in the view V and relational base D. The view relationship graph is a forest representing the hierarchical and cardinality constraints in the XML view schema. An internal node, represented by a triangle identifies a view element or attribute labeled by its name. A leaf node (represented by a small circle is an atomic type, labeled by both the XPath binding and the name of its corresponding relational column An edge represents that is a parent of in the view hierarchy. Each edge is labeled by the cardinality relationship and condition (if any) between its end nodes. A label “?” means each parent node can only have one child, while shows multiple children are possible. Figures 9(a) to 9(d) depict the view relationship graphs for V1 to V4 in Fig. 3 respectively. Definition 5. The hierarchy implied in relational model is defined as: (1) Given a relation schema with then is called the parent of the attribute (2) Given two relation schemas and with foreign key constraints defined as then is the parent of The view trace graph represents the hierarchical and cardinality constraints in the relational schema underlying the XML view. The set of leaf nodes of correspond to the union of all leaves of Specially, a leaf node labeled by the primary key attribute of a relation is called a key node (depicted by a black circle An internal node, depicted by a triangle is labeled by the relation name. Each edge means is the parent of by Definition 5. An edge is labeled by its foreign key condition (if it is generated by rule (2) in Definition 5), and the cardinality relationship between its end nodes. The view trace graphs of V1 to V4 are identical (Fig. 10), since they all defined over the same attributes of base relations. The concept of closure in and is used to represent the “effect” of an update on the view and on the relational database respectively. Intuitively, their relationship indicates the updatability of the given view. The closure of a node denoted by is defined as follows: (1) If is a leaf node, (2) Otherwise, is the union of its children’s closures grouped by their hierarchical relationship and marked by their cardinality (for

On the Updatability of XML Views Published over Relational Data

Fig. 9.

805

of V1 to V4 as shown by (a) to (d)

Fig. 10.

of V1 – V4

simplicity, not shown when cardinality mark is ?). For example, in Figure 9(a), while The closure of a node is defined in the same manner as in except for leaf nodes. Each leaf node has the same closure as its parent node. For instance, in Fig. 10, This closure definition in is based on the pre-selected update policy in Section 2.1. If a different policy were used, then the definition needs to be adjusted accordingly. For example, if we pick the mixed type, the closure will be “only the key node has the same closure definition as its parent node, while any other leaf node has itself as the closure”. Consequently in Fig. 10, while The delete on these non-key leaf nodes can be translated as a replacement on the corresponding relational column. To reduce the closure definition, the group mark “()” can be eliminated if its cardinality mark is “?”. For example, in Figure 9(c), The closure of a set of nodes N, denoted by is defined as

806

Ling Wang and Elke A. Rundensteiner

where is a “Union-like” operation that combines not only the nodes but their shared occurrence. For instance, in Fig. 10, Two leaf nodes in or are equal if and only if the relational attribute labels in their respective node labels are the same.

5.2

A Graph-Based Algorithm for View Updatability Identification

Definition 6. Two closures and match, denoted by iff the node set of and are equal. Further, and are equal, denoted by iff the node groups, cardinality marks of each group, and conditions on each edge are all the same. For two closures to match means that the same view schema nodes are included. While equality indicates that the same instances of XML view elements will be included. For example, in Fig. 9(c) and in Fig. 10 match. That is, both closures include the same XML view schema nodes: book.bookid, book.title, price.amount, price.website. However, in Figure 9(a) and in Figure 10 are equal, namely {book.bookid, book.title, (price. amount, This is because their group partition (marked by “()”), cardinality mark or ?) and conditions for each edge are all the same. Both closures touch exactly the same XML view-element instances. Theorem 3. Let V be a view defined by over a relational database D with the view relationship graph and view trace graph Let and generators of view elements and respectively, iff their closures Theorem 3 indicates that two equal generators always produce the identical view elements iff the respective closures of the view schema nodes in and are equal. Theorem 3 now enables us to produce an algorithm for detecting the clean extended sources of a view element based on schema knowledge captured in and Theorem 4. Let V, view element such that

D, Y be defined as in Theorem 3. Given a there is a clean extended source of in D iff

Theorem 4 indicates that a given view element has a clean extended source iff the closure of its schema node in has an equal closure in As indicated by Theorems 1 and 2, the existence of a clean extended source for a given XML view element implies that the update touching this element is unconditionally translatable. The following observation thus serves as a general methodology for view updatability determination. Observation 1 Let D, V, Y be defined as in Theorem 3. (1) Updates that touch Y are unconditionally translatable iff such that (2) Updates that touch Y are conditionally translatable iff such that (3) Otherwise, updates on Y are un-translatable.

On the Updatability of XML Views Published over Relational Data

807

However, searching all node closures in to find one equal to the closure of a given view-element is expensive. According to the generation rules of the nodes in the closure of also serve as leaf nodes in We thus propose to start searching from leaf nodes within the closure, thus reducing the search space. Observation 2 utilized the following definition to determine the translatability of a given view update. Definition 7. Let be a node in Let We say is a clean node iff an inconsistent node otherwise.

with its closure in denoted by where be the closure of in a consistent node iff and

For a node to be inconsistent means that the effect of an update on the view (node closure in is different from the effect on the relational side (node closure in based on the selected policy (closure definition in It is thus un-translatable. A clean node is guaranteed to be safely updatable without any view side-effects. A dirty consistent node, however, needs an additional condition to be updatable. For example, in Fig. 9(a) is a clean node. In Fig. 9(b), is an inconsistent node and is a dirty consistent node. Observation 2 An update on a clean node is unconditionally translatable, on a consistent node it is conditionally translatable, while on an inconsistent node it is un-translatable.

Algorithm 1 shows our optimized update translatability checking algorithm using Observation 2. It first identifies the deleting/inserting node. Then, using Definition 7 the procedure classifyNode determines the type of the node to be updated. Thereafter the given view update can be classified as un-translatable, conditionally or unconditionally translatable by Observation 2. Using this optimized update translatability checking algorithm, a concrete case study on the translatability of deletes and inserts is also provided in [13].

808

6

Ling Wang and Elke A. Rundensteiner

Conclusion

In this paper, we have identified the factors determining view updatability in general and also in the context of XQuery views in particular. The extended clean-source theory for determining translation correctness is presented. A graphbased algorithm has also been presented to identify the conditions under which a correct translation of a given view update exists. Our solution is general. It could be used by an update translation systems such as [4] to identify the translatable update before translation of it is attempted. This way we would guarantee that only a “well-behaved” view update is passed down to the next translation step. [4] assumes the view is always wellformed, that is, joins are through keys and foreign keys, and nesting is controlled to agree with the integrity constraints and to avoid duplication. The update over such a view is thus always translatable. Our work is orthogonal to this work by addressing new challenges related to the decision of translation existence when conflicts are possible, that is a view cannot always be guaranteed to be wellformed (as assumed in this prior work). Our view updatability checking solution is based on schema reasoning, thus utilizes only view and database schema and constraints knowledge. Note that the translated updates might still conflict with the actual base data. For example, an update inserting a book (bookid = 98002) to V 1 is said to be unconditionally translatable by our schema check procedure, while conflicts with the base data in Fig. 1 may still arise. Depending on selected update translation policy, the translated update can then be either rejected or executed by replacing the existing tuple with the newly inserted tuple. This run-time updatability issue can only be resolved at execution time by examining the actual data in the database.

References 1. A. M. Keller. The Role of Semantics in Translating View Updates. IEEE Transactions on Computers, 19(1):63–73, 1986. 2. S. Banerjee, V. Krishnamurthy, M. Krishnaprasad, and R. Murthy. Oracle8i - The XML Enabled Data Management System. In ICDE, pages 561–568, 2000. 3. T. Barsalou, N. Siambela, A. M. Keller, and G. Wiederhold. Updating Relational Databases through Object-Based Views. In SIGMOD, pages 248–257, 1991. 4. V. P. Braganholo, S. B. Davidson, and C. A. Heuser. On the Updatability of XML Views over Relational Databases. In WEBDB, pages 31–36, 2003. 5. M. J. Carey, J. Kiernan, J.Shanmugasundaram, E. J. Shekita, and S. N. Subramanian. XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents. In The VLDB Journal, pages 646–648, 2000. 6. J. M. Cheng and J. Xu. XML and DB2. In ICDE, pages 569–573, 2000. 7. U. Dayal and P. A. Bernstein. On the Correct Translation of Update Operations on Relational Views. In ACM Transactions on Database Systems, volume 7(3), pages 381–416, Sept 1982. 8. J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, pages 302–314, September 1999.

On the Updatability of XML Views Published over Relational Data

809

9. M. Fernandez et al. SilkRoute: A Framework for Publishing Relational Data in XML. ACM Transactions on Database Systems, 27(4):438–493, 2002. 10. M. Rys. Bringing the Internet to Your Database: Using SQL Server 2000 and XML to Build Loosely-Coupled Systems. In VLDB, pages 465–472, 2001. 11. I. Tatarinov, Z. G. Ives, A. Y. Halevy, and D. S. Weld. Updating XML. In SIGMOD, pages 413–424, May 2001. 12. L. Wang, M. Mulchandani, and E. A. Rundensteiner. Updating XQuery Views Published over Relational Data: A Round-trip Case Study. In XML Database Symposium (VLDB Workshop), pages 223–237, 2003. 13. L. Wang and E. A. Rundensteiner. Updating XML Views Published Over Relational Databases: Towards the Existence of a Correct Update Mapping. Technical Report WPI-CS-TR-04-19, Computer Science Department, WPI, 2004. 14. X. Zhang, K. Dimitrova, L. Wang, M. EL-Sayed, B. Murphy, L. Ding, and E. A. Rundensteiner. RainbowII: Multi-XQuery Optimization Using Materialized XML Views. In Demo Session Proceedings of SIGMOD, page 671, 2003.

XBiT: An XML-Based Bitemporal Data Model Fusheng Wang and Carlo Zaniolo Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA {wangfsh,zaniolo}@cs.ucla.edu

Abstract. Past research work on modeling and managing temporal information has, so far, failed to elicit support in commercial database systems. The increasing popularity of XML offers a unique opportunity to change this situation, inasmuch as XML and XQuery support temporal information much better than relational tables and SQL. This is the important conclusion claimed in this paper where we show that valid-time, transaction-time, and bitemporal databases can be naturally viewed in XML using temporally-grouped data models. Then, we show that complex historical queries, that would be very difficult to express in SQL on relational tables, can now be easily expressed in standard XQuery on such XML-based representations. We first discuss the management of transaction-time and valid-time histories and then extend our approach to bitemporal histories. The approach can be generalized naturally to support the temporal management of arbitrary XML documents and queries on their version history.

1 Introduction While users’ demand for temporal database applications is only increasing with time [1], database vendors are not moving forward in supporting the management and querying of temporal information. Given the remarkable research efforts that have been spent on these problems [2], the lack of viable solutions must be attributed, at least in part, to the technical difficulties of introducing temporal extensions into the relational data model and query languages. In the meantime, database researchers, vendors and SQL standardization groups are working feverishly to extend SQL with XML publishing capabilities [4] and to support languages such as XQuery [5] on the XML-published views of the relational database [6]. In this context, XML and XQuery can respectively be viewed as a new powerful data model and query language, thus inviting the natural question on whether they can provide a better basis for representing and querying temporal database information. In this paper, we answer this critical question by showing that transaction-time, valid-time and bitemporal database histories can be effectively represented in XML and queried using XQuery without requiring extensions of current standards. This breakthrough over the relational data model and query languages is made possible by (i) the ability of XML to support a temporally grouped model, which is long-recognized P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 810–824, 2004. © Springer-Verlag Berlin Heidelberg 2004

XBiT: An XML-Based Bitemporal Data Model

811

as natural and expressive [7,8] but could not be implemented well in the flat structure of the relational data model [9], and (ii) the greater expressive power and native extensibility of XQuery (which is Turing-complete [10]) over SQL. Furthermore, these benefits are not restricted to XML-published databases; indeed these temporal representations and queries can be naturally extended to arbitrary XML documents, and used, e.g., to support temporal extensions for database systems featuring native support for XML and XQuery, and in preserving the version history of XML documents, in archives [11] and web warehouses [12]. In this paper, we build and extend techniques described in previous papers. In particular, support for transaction time was discussed in [13], and techniques for managing document versions were discussed in [12]. However, the focus of this paper is supporting valid-time and bitemporal databases, which pose new complexity and were not discussed in previous papers. The paper is organized as follows. After a discussion of related work in the next section, we study an example of temporal relations modeled with a temporal ER model. In Section 4 we show that the valid time history of relational database history can be represented as XML, and queried with XQuery. Section 5 briefly reviews how to model transaction-time history with XML. In Section 6, we focus on an XML-based bitemporal data model to represent the bitemporal relational database history, and show that complex bitemporal queries can be expressed with XQuery based on this model, and database update can also be supported. Section 7 concludes the paper.

2

Related Work

Temporal ER Modeling. There has been much interesting work on ER-based temporal modeling of information systems at the conceptual level. For instance, ER models have been supported in commercial products for database schema designs, and more than 10 temporal enhanced ER models have been proposed in the research community [14]. As discussed in the survey by Gregersen and Jensen [14], there are two major approaches of extensions to ER model for temporal support, devising new notational shorthands, or altering the semantics of the current ER model constructs. The recent TIMEER model [15] is based on an ontological foundation and supports an array of properties. Among the temporal ER models, the Temporal EER Model (TEER) [16] extends the temporal semantics into the existing EER modeling constructs. Temporal Databases. A body of previous work on temporal data models and query languages include [17–20]; thus the design space for the relational data model has been exhaustively explored [2, 21]. Clifford et al. [9] classified them as two main categories: temporally ungrouped and temporally grouped data models. Temporally grouped data model is also referred to as non-first-normalform model or attribute time stamping, in which the domain of each attribute is extended to include the temporal dimension [8], e.g., Gadia’s temporal data

812

Fusheng Wang and Carlo Zaniolo

model [22]. It is shown that the temporally grouped representation has more expressive power and is more natural since it is history-oriented [9]. TSQL2 [23] tries to reconcile the two approaches [9] within the severe limitations of the relational tables. Our approach is based on a temporally grouped data model, which dovetails perfectly with the hierarchical structure of XML documents. The lack of temporal support in commercial DBMS can be attributed to the limitations of SQL, the engineering complexity, and the difficulty to implement it incrementally [24]. Publishing Relational Databases in XML. There is much current interest in publishing relational databases in XML. A middleware-based approach is used in SilkRoute [25] and XPERANTO [6]. For instance, XPERANTO can build a default view on the whole relational database, and new XML views and queries upon XML views can then be defined using XQuery. XQuery statements are then translated into SQL and executed on the RDBMS engine. SQL/XML is emerging as a new SQL standard supported by several DBMS vendors [4, 26], to extend RDBMS with XML support. Time in XML. Some interesting research work has recently focused on the problem of representing historical information in XML. In [27] an annotationbased object model is proposed to manage historical semistructured data, and a special Chorel language is used to query changes. In [28] a new markup tag for XML/HTML documents is proposed to support valid time on the Web, thus temporal visualization can be implemented on web browsers with XSL. In [29], a dimension-based method is proposed to manage changes in XML documents, however how to support queries is not discussed. In [30], a data model is proposed for temporal XML documents. However, since a valid interval is represented as a mixed string, queries have to be supported by extending DOM APIs or XPath. Similarly, in [31, 32], extensions of XPath is needed to support temporal semantics. (In our approach, we instead support XPath/XQuery without any extension to XML data models or query languages.) A language is proposed in [33] to extend XQuery for temporal support, which has to provide new constructs for the language. An archiving technique for scientific data using XML was presented in [34], but the issue of temporal queries was not discussed. Both the schema proposed in [34] and our schema are generalizations of SCCS [35].

Fig. 1. TEER Schema of Employees and Departments (with Time Semantics Added)

XBiT: An XML-Based Bitemporal Data Model

3

813

An Example

The Temporal EER Model (TEER) [16] extends the temporal semantics into the existing EER modeling constructs, and works for both valid time and transaction time. TEER model associates each entity with a lifespan, and an attribute’s value history is grouped together, and assigned with a temporal element (a union of valid temporal spans). Each relationship instance is also associated with a temporal element to represent the lifespan. This temporal ER model is believed by the authors to be more natural to manage temporal aspects of data than in a tuple-oriented relational data model [16]. Suppose that we have two relations employees and departments, and each employee has a name, title, salary, and dept (name is the key), and each dept has a name and manager (name is the key). To model the history of the two relations, we use a TEER diagram as shown in Figure 1. (For simplicity, only valid time is considered, and transaction time can be modeled in a similar way.) Figure 1 looks exactly like a normal ER diagram except that the time semantics is added. In this schema, the entity employee (or e ) will have the following temporal attribute values:

Here each attribute value is associated with a valid time lifespan. surrogate is a system-defined identifier, which can be ignored if the key doesn’t change. The following is the list of temporal attribute values of entity dept (or d) :

Similarly, for the instance rb of the relationship belongs_to between employee ‘Bob’ and dept ‘RD’, the lifespan is T(rb)=[1995-01-01,now], and for the instance rm of the relationship manages between employee ‘Mike’ and dept ‘RD’, the lifespan is T(r)=[1999-01-01,now]. In the next section, we show that such temporal ER model can be supported well with XML.

4

Valid Time History in XML

While transaction time identifies when data was recorded in the database, valid time concerns when a fact was true in reality. One major difference is that while transaction time is appended only and cannot be updated, valid time can

814

Fusheng Wang and Carlo Zaniolo

Fig. 2. Valid Time History of Employees

Fig. 3. XML Representation of the Valid-time History of Employees(VH-document)

be updated by users. We show that, with XML, we can model the valid time history naturally. Figure 2 shows a valid time history of employees, where each tuple is timestamped with a valid time interval. This representation assumes valid time homogeneity, and is temporally ungrouped [9]. It has several drawbacks: first, redundancy information is preserved between tuples, e.g., Bob’s department appeared the same but was stored in all the tuples; second, temporal queries need to frequently coalesce tuples, which is a source of complications in temporal query languages. These problems can be overcome using a representation where the timestamped history of each attribute is grouped under the attribute [9]. This produces a hierarchical organization that can be naturally represented by the hierarchical XML view shown in Figure 3 (VH-document). Observe that every element is timestamped using two XML attributes vstart and vend. In the VH-document, each element is timestamped with an inclusive valid time interval (vstart, vend), vend can be set to now to denote the ever-increasing current date, which is internally represented as “9999-12-31” (Section 4.2). Please note that an entity (e.g., employee ‘Bob’) always has a longer or equal lifespan than its children, thus there is a valid time covering constraint that the valid time interval of a parent node always covers that of its child nodes, which is preserved in the update process(Section 4.3). Unlike the relational data model that is almost invariably depicted via tables, XML is not directly associated with a graphical representation. This creates the challenge and the opportunity of devising the graphical representation most conducive for the application at hand—and implementing it using standard XML

XBiT: An XML-Based Bitemporal Data Model

815

Fig. 4. Temporally Grouped Valid Time History of Employees

tools such as XSL [36]. Figure 4 shows a representation of temporally grouped tables that we found effective as user interface (and even more so after contrasting colored backgrounds and other browser-supported embellishments).

4.1

Valid Time Temporal Queries

The data shown in Figure 4 is the actual data stored in the database—with the exception of the special “now” symbol discussed later. Thus a powerful query language such as XQuery can be directly applied to this data model. In terms of data types, XML and XQuery support an adequate set of built-in temporal types, including datetime, date, time, and duration [5]; they also provide a complete set of comparison and casting functions for duration, date and time values, making snapshot and period-based queries convenient to express in XQuery. Furthermore, whenever more complex temporal functions are needed, they can be defined using XQuery functions that provide a native extensibility mechanism for the language. Next we show that we can specify temporal queries with XQuery on the VH-document, such as temporal projection, snapshot queries, temporal slicing, temporal joins, etc. QUERY V1: Temporal projection: retrieve the history of departments where Bob was employed:

QUERY V2: Snapshot: retrieve the managers of each department on 1999-05-01:

Here depts .xml is the VH-document that includes the history of dept names and managers. vstart() and vend() are user-defined functions (expressed in

Fusheng Wang and Carlo Zaniolo

816

XQuery) that return the starting date and ending date of an element’s valid time respectively, thus the implementation is transparent to users. QUERY V3: Continuous Period: find employees who worked as a manager for more than 5 consecutive years ( i.e., 1826 days):

Here “P1826D” is a duration constant of 1826 days in XQuery. QUERY V4: Temporal Join: find employees who were making the same salaries on 2001-04-01:

This query will join emps.xml with itself. It is also easy to support since and until connectives of first-order temporal logic [18], for example: QUERY V5: A Until B: find the employee who was hired and worked in dept “RD” until Bob was appointed as the manager of the dept:

4.2

Temporal Operators

In the temporal queries, we used functions such as vstart and vend to shield users from the implementations of representing time. Functions predefined include: timestamp referencing functions, such as vstart, vend; interval comparison functions, such as voverlaps, vprecedes, vcontains, vequals, vmeets, voverlapinterval; and during and date/time functions, such as vtimespan, vinterval. For example, vcontains is defined as follows:

XBiT: An XML-Based Bitemporal Data Model

817

Internally, we use “end-of-time” values to denote the ‘now’ and ‘UC’ symbol. For instance for dates we use “9999-12-31.” The user does not access this value directly, but accesses it through built-in functions. For instance, to refer to the ending valid time of a node s, the user uses the function vend(s), which returns s’s end, if this is different from ‘9999-12-31” and CURRENT_DATE otherwise. The nodes returned in the output, normally use the “9999-12-31” representation used for internal data. However, for data returned to the end-user, two different representations are preferable. One is to return the CURRENT_DATE by applying function rvend() that, recursively, replaces all the occurrence of “9999-12-31” with the value of CURRENT_DATE. The other is to return a special string, such as now to be displayed on the end-user screen. These valid-time queries are similar to those transaction time history, as discussed in [13]. However, unlike transaction-time databases, valid time databases must also support explicit update. This is not discussed in [13] and will be discussed next. 4.3

Database Modifications

An update task force is currently working on defining standard update constructs for XQuery [37]; moreover, update constructs are already supported in several native XML databases [38]. Our approach to temporal updates consists in supporting the operations of insert, delete, and update via user-defined functions. This approach will preserve the validity of end-user programs in the face of differences between vendors and evolving standards. It also shields the end-users from the complexity of the additional operations required by temporal updates, such as the coalescing of periods, and the propagation of updates to enforce the covering constraints. INSERT. When a new entity is inserted, the new employee element with its children elements is appended in the VH-Document; the vstart attributes are set to the valid starting timestamp, and vend are set to now. Insertion can be done through the user-defined function vinsert($path,$newelement).The new element can be created using the function VNewElement($valueset,$vstart, $vend). For example, the following query inserts Mike as an engineer into RD dept with salary 50K, starting immediately:

DELETE. There are two types of deletion: deletion without valid time and deletion with valid time. The former assumes a default valid time interval: (current_date, forever), and can be implemented with the user defined function VNodeDelete($path). For deletion with a valid time interval v on node e, there can be three mutually exclusive cases: (i) e is removed if its valid time interval

818

Fusheng Wang and Carlo Zaniolo

is contained in v, (ii) the valid time interval of e is extended if the two intervals overlap, but do not contain each other, or (iii) e’s interval is split if it properly contains v. Deletions on a node are then propagated downward to its children to satisfy the covering constraint. Node deletion (with downward propagation) is supported by the function VTimeDelete($path, $vstart, $vend). UPDATE. Updates can be on values or valid time, and coalescing is needed. There are two functions defined: VNodeReplace($path,$newValue), and VTime Replace($path, $vstart, $vend). For value update, propagation is not needed; for valid time update, it is needed to downward update the node’s children’s valid time. If a valid time update on a child node violates the valid time covering constraint, then the update will fail.

5 Viewing Transaction Time History as XML In [13] we have proposed an approach to represent the transaction-time history of relational databases in XML using a temporally grouped data model. This approach is very effective at supporting complex temporal queries using XQuery [5], without requiring changes in this standard query language. In [13] we used these features to show that the XML-viewed transaction time history(TH-document) can be easily generated from the evolving history of the databases, and implemented by either using native XML databases or, after decomposition into binary relations, by relational databases enhanced with tools such as SQL/XML [4]. We also showed that XQuery without modifications can be used as an effective language for expressing temporal queries. A key issue not addressed in [13] was whether this approach, and its unique practical benefits of only requiring off-the-shelf tools, can be extended to support bitemporal databases. With two dimensions of time, bitemporal databases have much more complexity, e.g., coalescing on two dimensions, explicit update complexity, and support of more complex bitemporal queries. In the next section, we explore how to support a bitemporal data model based on XML.

6 6.1

An XML-Based Bitemporal Data Model The XBiT Data Model

In practice, temporal applications often involve both transaction time and valid time. We show next that, with XML, we can naturally represent a temporally grouped data model, and provide support for complex bitemporal queries. Bitemporal Grouping. Figure 5 shows a bitemporal history of employees, using a temporally ungrouped representation. Although valid time and transaction time are generally independent, for the sake of illustration, we assume here that employees’ promotions are scheduled and entered in the database four months before they occur. XBiT supports a temporally grouped representation by coalescing attributes’ histories on both transaction time and valid time. Temporal coalescing on two

XBiT: An XML-Based Bitemporal Data Model

819

Fig. 5. Bitemporal History of Employees

temporal dimensions is different from coalescing on just one. On one dimension, coalescing is done when: i) two successive tuples are value equivalent, and ii) the intervals overlap or meet. The two intervals are then merged into maximal intervals. For bitemporal histories, coalescing is done when two tuples are value-equivalent and (i) their valid time intervals are the same and the transaction time intervals meet or overlap; or (ii) the transaction time intervals are the same and the valid time intervals meet or overlap. This operation is repeated until no tuples satisfy these conditions. For example, in Figure 5, to group the history of titles with value ‘Sr Engineer’ in the last three tuples, i.e., (title, valid_time, transaction_time), the last two transaction time intervals are the same, so they are coalesced as (Sr Engineer, 1998-01-01 :now, 1999 -09 -01 :UC). This one again has the same valid time interval as the previous one: ((Sr Engineer, 1998-01-01:now, 1997-0901:1999-08-31), thus finally they are coalesced as (Sr Engineer, 1998-0101:now,1997-09-01:UC), as shown in Figure 7. Data Modeling of Bitemporal History with XML. With temporal grouping, the bitemporal history is represented in XBiT as an XML document (BHdocument). This is shown in the example of Figure 6, which is snapshot-equivalent to the example of Figure 5. Each employee entity is represented as an employee element in the BH-document, and table attributes are represented as employee element’s child elements. Each element in the BH-document is assigned two pairs of attributes tstart and tend to represent the inclusive transaction time interval, and vstart and vend to represent the inclusive valid time interval. Elements corresponding to a table attribute value history are ordered by the starting transaction time tstart. The value of tend can be set to UC (until changed), and vend can be set to now. There is a covering constraint whereby the transaction time interval of a parent node must always cover that of its child nodes, and likewise for valid time intervals. Figure 7 displays the resulting temporally grouped representation, which is appealing to intuition, and also effective at supporting natural language interfaces, as shown by Clifford [7].

820

Fusheng Wang and Carlo Zaniolo

Fig. 6. XML Representation of the Bitemporal History of Employees(BH-document)

6.2

Bitemporal Queries with XQuery

The XBiT-based representation can also support powerful temporal queries, expressed in XQuery without requiring the introduction of new constructs in the language. We next show how to express bitemporal queries on employees. QUERY B1: Temporal projection: retrieve the bitemporal salary history of employee “Bob”:

This query is exactly the same as query V1, except that it retrieves both transaction time and valid time history of salaries. QUERY B2: Snapshot: according to what was known on 1999-05-01, what was the average salary at that time?

Here tstart(),tend(),vstart() and vend() are user-defined functions that get the starting date and ending date of an element’s transaction-time and validtime, respectively. QUERY B3: Diff queries: retrieve employees whose salaries (according to our current information) didn’t changed between 1999-01-01 and 2000-01-01:

XBiT: An XML-Based Bitemporal Data Model

821

Fig. 7. Temporally Grouped Bitemporal History of Employees

This query will take a transaction time snapshot and a valid time slicing of salaries. QUERY B4: Change Detection: find all the updates of employee salaries that were applied retroactively.

QUERY B5: find the manager for each current employee, as best known now:

This query will take the current snapshot on both transaction time and valid time.

6.3

Database Modifications

For valid time databases, both attribute values and attribute valid time can be updated by users, and XBiT must perform some implicit coalescing to support the update process. Note that only elements that are current (ending transaction time as UC) can be modified. A modification combines two processes: explicit modification of valid time and values, and implicit modification of transaction time. Modifications of Transaction Time Databases. Transaction time modifications can also be classified as three types: insert, delete, and update.

822

Fusheng Wang and Carlo Zaniolo

INSERT. When a new tuple is inserted, the corresponding new element (e.g., employee ‘Bob’) and its child elements in BH-document are timestamped with starting transaction time as current date, and ending transaction time as UC. The user-defined function TInsert($node) will insert the node with the transaction time interval(current date, UC). DELETE. When a tuple is removed, the ending transaction time of the corresponding element and its current children is changed to current time. This can be done by the function TDelete($node). UPDATE. Update can be seen as a delete followed by an insert. Database Modifications in XBiT. Modifications in XBiT can be seen as the combination of modifications on valid time and transaction time history. XBiT will automatically coalesce on both valid time and transaction time. INSERT. Insertion is similar to valid time database insertion except that the added element is timestamped with transaction time interval as (current date, UC). This can be done by the funciton BInsert($path, $newelement), which combines VInsert and TInsert. DELETE. Deletion is similar to valid time database insertion, except that the function TDelete is called to change tend of the deleted element and its current children to current date. Node deletion is done through the function BNodeDelete($path), and valid time deletion is done through the function BTimeDelete($path,$vstart,$vend). UPDATE. Update is also a combination of valid time and transaction time, i.e., deleting the old tuple with tend set to current date, and inserting the new tuple with new value and valid time interval, tstart set to current date and tend set to UC. This is done by the functions BNodeReplace($path,$newValue) and BTimeReplace($path, $vstart, $vend) respectively.

6.4

Temporal Database Implementations

Two basic approaches are possible to manage the three types of H-documents discussed here: one is to use a native XML database, and the other is to use traditional RDBMS. In [13] we show that a transaction time TH-document can be stored in a RDBMS and has significant performance advantages on temporal queries over a native XML database. Similarly, RDBMS-based approach can be applied to the valid history and bitemporal history. First, the BH-document is shredded and stored into H-tables. For example, the employee BH-document in Figure 6 is mapped into the following attribute history tables:

XBiT: An XML-Based Bitemporal Data Model

823

Since the BH-document and H-tables have a simple mapping relationship, temporal XQuery can be translated into SQL queries based on such mapping relationship, using the techniques discussed in [13].

7

Conclusions

In this paper, we showed that valid-time, transaction-time, and bitemporal databases can be naturally managed in XML using temporally-grouped data models. This approach is similar to the one we proposed for transaction-time data bases in [13], but we have here shown that it also supports (i) the temporal EER model [16], and (ii) valid-time and bitemporal databases with the complex temporal update operations they require. Complex historical queries, and updates, which would be very difficult to express in SQL on relational tables, can now be easily expressed in XQuery on such XML-based representations. The technique is general and can be applied to historical representations of relational data, XML documents in native XML databases, and version management in archives and web warehouses [12]. It can also be used to support schema evolution queries [39]. Acknowledgments. The authors would like to thank Xin Zhou for his help and comments. This work was supported by the National Historical Publications and Records Commission and a gift by NCR Teradata.

References 1. R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, 1999. 2. G. Ozsoyoglu and R.T. Snodgrass. Temporal and Real-Time Databases: A Survey. IEEE Transactions on Knowledge and Data Engineering, 7(4):513–532, 1995. 3. F. Grandi. An Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web. In TimeCenter Technical Report TR-75, 2003. 4. SQL/XML, http://www.sqlx.org. 5. XQuery 1.0: An XML Query Language. http://www.w3.org/XML/Query. 6. M. Carey, J. Kiernan, J. Shanmugasundaram, and et al. XPERANTO: A Middleware for Publishing Object-Relational Data as XML Documents. In VLDB, 2000. 7. J. Clifford. Formal Semantics and Pragmatics for Natural Language Querying. Cambridge University Press, 1990. 8. J. Clifford, A. Croker, and A. Tuzhilin. On completeness of historical relational query languages. ACM Trans. Database Syst., 19(1):64–116, 1994. 9. J. Clifford, A. Croker, F. Grandi, and A. Tuzhilin. On Temporal Grouping. In Recent Advances in Temporal Databases, pages 194–213. Springer Verlag, 1995. 10. S. Kepser. A Proof of the Turing-Completeness of XSLT and XQuery. In Technical report SFB 441, Eberhard Karls Universitat Tubingen, 2002. 11. ICAP: Incorporating Change Management into Archival Processes. http://wis.cs.ucla.edu/projects/icap/. 12. F. Wang and C. Zaniolo. Temporal Queries in XML Document Archives and Web Warehouses. In TIME-ICTL, 2003.

824

Fusheng Wang and Carlo Zaniolo

13. F. Wang and C. Zaniolo. Publishing and Querying the Histories of Archived Relational Databases in XML. In WISE, 2003. 14. H. Gregersen and C. S. Jensen. Temporal Entity-Relationship Models - A Survey. Knowledge and Data Engineering, 11(3):464–497, 1999. 15. H. Gregersen and C. Jensen. Conceptual Modeling of Time-varying Information. In TIMECENTER Technical Report TR-35, September 1998., 1998. 16. R. Elmasri and G.T.J.Wuu. A Temporal Model and Query Language for ER Databases. In ICDE, pages 76–83, 1990. 17. R. T. Snodgrass. The TSQL2 Temporal Query Language. Kluwer, 1995. 18. J. Chomicki, D. Toman, and M.H. Böhlen. Querying ATSQL Databases with Temporal Logic. TODS, 26(2): 145–178, June 2001. 19. M. H. Böhlen, J. Chomicki, R. T. Snodgrass, and D. Toman. Querying TSQL2 Databases with Temporal Logic. In EDBT, pages 325–341, 1996. 20. J. Chomicki and D. Toman. Temporal Logic in Information Systems. In Logics for Databases and Information Systems, pages 31–70. Kluwer, 1998. 21. C. S. Jensen and C. E. Dyreson (eds). A Consensus Glossary of Temporal Database Concepts - February 1998 Version. Temporal Databases: Research and Practice, pages 367–405, 1998. 22. S. K. Gadia and C. S. Yeung. A Generalized Model for a Relational Temporal Database. In SIGMOD, 1988. 23. C. Zaniolo, S. Ceri, C.Faloutsos, R.T. Snodgrass, V.S. Subrahmanian, and R. Zicari. Advanced Database Systems. Morgan Kaufmann Publishers, 1997. 24. Adam Bosworth, Michael J. Franklin, and Christian S. Jensen. Querying the Past, the Present, and the Future. In ICDE, 2004. 25. M. Fernandez, W. Tan, and D. Suciu. SilkRoute: Trading Between Relations and XML. In 8th Intl. WWW Conf., 1999. 26. Oracle XML. http://otn.oracle.com/xml/. 27. S.S. Chawathe, S. Abiteboul, and J. Widom. Managing Historical Semistructured Data. Theory and Practice of Object Systems, 24(4): 1–20, 1999. 28. F. Grandi and F. Mandreoli. The Valid Web: An XML/XSL Infrastructure for Temporal Management of Web Documents. In ADVIS, 2000. 29. M. Gergatsoulis and Y. Stavrakas. Representing Changes in XML Documents using Dimensions. In Xsym, 2003. 30. T. Amagasa, M. Yoshikawa, and S. Uemura. A Data Model for Temporal XML Documents. In DEXA, 2000. 31. C.E. Dyreson. Observing Transaction-Time Semantics with TTXPath. In WISE, 2001. 32. S. Zhang and C. Dyreson. Adding valid time to xpath. In DNIS, 2002. 33. D. Gao and R. T. Snodgrass. Temporal Slicing in the Evaluation of XML Queries. In VLDB, 2003. 34. P. Buneman, S. Khanna, K. Tajima, and W. Tan. Archiving scientific data. ACM Trans. Database Syst., 29(1):2–42, 2004. 35. M.J. Rochkind. The Source Code Control System. IEEE Transactions on Software Engineering, SE-1(4):364–370, 1975. 36. The Extensible Stylesheet Language (XSL). http://www.w3.org/Style/XSL/. 37. M. Rys. Proposal for an XML Data Modification Language. In Microsoft Report, 2002. 38. Tamino XML Server. http://www.tamino.com. 39. F. Wang and C. Zaniolo. Representing and Querying the Evolution of Databases and their Schemas in XML. In Intl. Workshop on Web Engineering, SEKE, 2003.

Enterprise Cockpit for Business Operation Management Fabio Casati, Malu Castellanos, and Ming-Chien Shan Hewlett-Packard 1501 Page Mill road Palo Alto, CA, 94304 {fabio.casati,malu.castellanos,ming-chien.shan}@hp.com

The area of business operations monitoring and management is rapidly gaining importance both in the industry and in the academia. This is demonstrated by the large number of performance reporting tools that have been developed. Such tools essentially leverage system monitoring and data warehousing applications to perform online analysis of business operations and produce fancy charts, from which users can get the feeling of what is happening in the system. While this provides value, there is still a huge gap between what is available today and what users would ideally like to have1: Business analysts tend to think of the way business operations are performed in terms of high level business processes, that we will call abstract in the following. There is no way today for analyst to draw such abstract processes and use them as a metaphor for analyzing business operations. Defining metrics of interest and reporting against these metrics requires a significant coding effort. No system provides, out of the box, the facility for easily defining metrics over process execution data, for providing users with explanations for why a metric has a certain value, and for predicting the future value for a metric. There is no automated support for identifying optimal configurations of the business processes to improve critical metrics. There is no support for understanding the business impact of system failures. The Enterprise Cockpit (EC) is an “intelligent” business operation management platform that provides the functionality described above. In addition to providing information and alerts about any business operation supported by an IT infrastructure, EC includes control and optimization features, so that managers can use it to automatically or manually intervene on the enterprise processes and resources, make changes in response to problems, or identify optimizations that can improve businessrelevant metrics. In the following, we sketch the proposed solution2. The basic layer of EC is the Abstract Process Monitor (APM), that allows users to define abstract processes and link the steps in these processes with events (e.g., access to certain Web pages, invocation of SAP interface methods, etc.) occurring in the underlying IT infrastructure In addition to monitoring abstract processes, EC leverages other business operation data, managed by means of “traditional” data warehousing techniques, and therefore not discussed further here. Once processes have been defined, users can specify metrics or SLAs over them, through the metric/SLA definer. For example, analysts can define a success metric 1

2

We name here just a few of the many issues that came out at a requirements gathering workshop held last fall in Palo Alto. A more detailed paper is available on request.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 825–827, 2004. © Springer-Verlag Berlin Heidelberg 2004

826

Fabio Casati, Malu Castellanos, and Ming-Chien Shan

stating that a payment process is successful if it ends at the “pay invoice” node and is completed within 4 days from the time the invoice has been received. Metrics are defined by means of a simple web-based GUI and by reusing metric templates either built into APM or developed by consultants at solution deployment time. Once metrics have been defined, the metric computation engine takes care of computing their values. In addition, EC computes distributions for both process attributes (such as the duration of each step and of the whole process, or the process arrival rate) and metrics. This is done by the curve fitting module. For example, users can discover that the duration of the check invoice step follows a normal distribution with a given mean and variance, or that the process arrival rate follows an exponential distribution. EC also provides features to help users make the most out of this information and really understand which things go wrong, why, what is their impact on the business, and how to correct problems. One of these features is process analysis, performed by the analysis and prediction engine. This consists in providing users with explanation for why metrics have certain values (e.g., why the cost is high or the success rate is low). To this end, EC integrates algorithms that automatically mine the EC databases and extract decision trees, which have a graphical formalism that makes it easy, even for business users, to examine correlations between metric values and other process attributes or metrics and identify the critical attributes affecting metric deviations from desired values. For example, users can see that unsuccessful processes are often characterized by invoices from a certain supplier arriving on a certain day. The hard challenge here is how to prepare the data (collected among the ocean of information available by the different data logs) to be fed to the mining algorithm, how to do this in an automated fashion (without human supervision), and in a way that works for every process and every metric. We addressed this challenge by confining the problem (we do analysis and prediction over metric data defined over abstract processes), by leveraging the fact that we have a rich, self-describing process and metric metamodel and therefore could write data preparation programs that can gather all the potentially useful process and metric data, and by leveraging experimental knowledge about which process features are most typically correlated with metric values. Another feature, essentially based on the same mining technology, is to provide users with a prediction of the value that a metric will have at the end of a process, or whether an SLA will be violated or not. Predictions are made at the start of the process and are updated as process execution proceeds. To this end, a family of decision (or regression) trees is built for each abstract process. In addition to the predicted value, users are provided with a confidence value that indicates the probability that the prediction will happen.

Enterprise Cockpit for Business Operation Management

827

Metric analysis and predictions are useful tools in their own right, but they leave the burden of optimization to the users. Hence, EC also includes an optimization component, that suggests improvements to the enterprise process based on business goals, expressed in terms of desired metric values defined over abstract processes. This is achieved by leveraging process simulation techniques: Users can state that they want to optimize an abstract process so to minimize or maximize the value of a certain metric. EC will then simulate the execution of several alternative process configurations corresponding to that abstract process (for example, will try to allocate human and automated resources in different ways, while meeting resource constraints defined by the user), will compute metrics out of the simulated data, and will consequently identify the configuration that best meets the user’s goals. EC also optimizes the search among the many possible process configurations, although in the current version we use simple heuristics for this purpose. Finally, we stress that all of the above features are provided in a fully automated fashion, at the click of the mouse. This is in sharp contrast with the way that, for example, data mining or process simulation packages are used today, requiring heavy manual intervention and lengthy consulting efforts.

Modeling Autonomous Catalog for Electronic Commerce Yuan-Chi Chang, Vamsavardhana R. Chillakuru, and Min Wang IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, USA

The catalog function is an essential feature in B2C and B2B e-commerce. While catalog is primarily for end users to navigate and search for interested products, other e-commerce functions such as merchandising, order, inventory and aftermarket constantly refer to information stored in the catalog [1]. The billiondollar mail order business was created around catalog long before e-commerce. More opportunities surface after catalog content previously created on paper is digitized. While catalog is recognized as a necessity for a successful web store, its content structure varies greatly across industries and also within each industry. Product categories, attributes, measurements, languages, and currency all contribute to the wide variations, which create a difficult dilemma for catalog designers. We have recently encountered a real business scenario that challenges traditional approaches of modeling and building e-catalog. We were commissioned to build an in-store shopping solution for branches in retail store chains. The local catalog at a branch is a synchronized copy of selected enterprise catalog content plus branch specific information, such as item location on the shelf. A key business requirement, which drives up the technical challenge, is that the in-store catalog solution needs to interoperate with the retail chain’s legacy enterprise catalog or its catalog software vendor of choice. This requirement reflects the business reality that decisions to pick enterprise software and branch software are usually not made simultaneously nor coordinated. As we learned that hundreds of enterprise catalog software, legacy and recent, is being used in industries such as grocery, clothing, books, office staples and home improvement, our challenge is to create a catalog model that is autonomously adapting to the content of enterprise catalog in any of the industries. A straightforward answer to the challenge is to build a mapping tool that will convert enterprise catalog content to the pre-designed in-store catalog, but this approach is highly undesirable. The difficulty lies within that it is impossible to predict the content to be stored. A simple example to illustrate the difficulty is by looking at what is stored in catalog for Home Depot, a home furnishing retailer, and by examining what is stored in catalog for Staples, an office equipment retailer. A kitchen faucet sold at Home Depot has information about its size, weight, material, color, and style. On the other hand, a fax machine sold at Staples carries attributes such as speed, resolution, and tone dialing. These attributes need to be stored in the catalog for retrieval and product comparisons. Without knowing where a catalog will be used, our design obviously P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 828–830, 2004. © Springer-Verlag Berlin Heidelberg 2004

Modeling Autonomous Catalog for Electronic Commerce

829

cannot pre-set the storage schema for either faucets or fax machines. Needless to say, there are hundreds of thousands of products whose information needs to be stored in catalogs. Today’s catalog solutions in the market also suffer from over design, which leads to wasted storage space, over-normalized schema and poor performance. Multi-language support, currency locale, geological label and access control are commonly embedded and inseparable from the main catalog functions. Suppose a company only operates stores in California. The additional features can turn highlights to burden. Further enhancing the shortfall of the traditional catalog modeling and mapping approach is the lack of configurability and optimization. Customization made on small delta changes to the catalog data model propagates in a magnified way all the way up to business logic and presentation layers. Furthermore, the vertical schema to store catalog attributes in name-value pairs distorts database statistics and makes catalog queries hard to optimize [3] [4]. We foresee no easy way to continue the traditional methodology for a satisfactory solution to our problem. In this paper, we propose a set of abstracted catalog semantics to model an autonomous catalog to become the in-store catalog solution. An autonomous catalog exhibits two key properties of autonomic computing: self-configuration and self-optimization [2]. It receives definitions of catalog entities from enterprise catalog to synthesize and create persistent storage schema and programming access interface. It buffers objects for cached retrieval and learns from search history to create index for performance. The use of autonomous catalog requires little learning and training since it morphs into enterprise catalog content structure. Changes can be reflected instantly at storage schema and programmatic interfaces. We model this autonomous catalog by associations of basic categorical entities. A categorical entity is defined as a named grouping of products that share similar attributes. Instances of a categorical entity are physical, procurable products or services. For example, the kitchen faucet may be declared as a categorical entity and one of its instances is Moen Asceri. A categorical entity may be pointing to one or more categorical entities to establish parent or child category relationship. Attributes in a categorical entity may be completely different from those in another and yet in both cases, they are efficiently stored in a normalized schema without applying the vertical schema. We define five operations including add, update, delete, search and retrieve on categorical entity. To shield software developers from accessing instances of categorical entities directly, these five catalog operations can only be executed through a programming language interface such as Java. When a new entity is declared by the enterprise catalog in XML Schema expression, new Java classes and interfaces, following a predefined template of these five operations, will be automatically synthesized. For example, the enterprise may declare an entity named ‘Kitchen Faucet’ with five attributes. Our autonomous catalog then creates tables in the database to store instances of faucets and synthesizes a Java class with methods to popu-

830

Yuan-Chi Chang, Vamsavardhana R. Chillakuru, and Min Wang

late, retrieve and search the instances by attribute values. Kitchen faucet may be associated with plumbing and kitchen categories. The Java class has methods to support searches from the associated categories. Revisiting the aforementioned catalog features such as multi-language support, we can easily add new attributes describing kitchen faucet in foreign languages applicable to use cases. There is no unused space for catalog attributes not needed. Another advantage of the autonomous catalog is its ability to capture more sophisticated modeling semantics at runtime, due to the flexibility of programming language wrapper. For example, in the synthesized Java class, programmatic pointers can reference an external taxonomy or ontology for runtime inferencing. Catalog content linked to a knowledge management system can support more intelligent queries such as ‘which kitchen faucets are recommended for water conservation?’ This further brings catalog modeling beyond the inclusive entity-relationship diagram. The modeling of autonomous catalog enables it to re-configure itself while administrators and programmers are shielded from knowing the details in managing the flexible persistent storage. As the Java classes change and evolve to adapt to the enterprise catalog content, one can envision that business logic that invokes these Java classes to be modeled and generated autonomously as well. We are investigating the modeling of merchandising and order tracking to demonstrate the feasibility of autonomous modeling of business logic.

References 1. S. Danish, “Building database-driven electronic catalogs,” ACM SIGMOD Record, Vol. 27, No. 4, December 1998. 2. J. O. Kephart and D. M. Chess, “The vision of autonomic computing,” IEEE Computer Magazine, Janurary 2003. 3. S. G. Lee, et.al. “An experimental evaluation of dynamic electronic catalog models in relational database systems,” 2002 Information Resources Management Association International Conference, Vol. 1, May 2002. 4. M. Wang, Y. C. Chang and S. Padmanabhan, “Supporting efficient parametric search of e-commerce data: a loosely-coupled solution,” 8th International Conference on Extending Database Technology, March 2002.

GiSA: A Grid System for Genome Sequences Assembly* Jun Tang, Dong Huang, Chen Wang, Wei Wang, and Baile Shi Fudan University, China {tangjun,032053004,chenwang,weiwang1,bshi}@fudan.edu.cn

Sequencing genomes is a fundamental aspect of biological research. Shotgun sequencing, since introduced by Sanger et al [2], has remained the mainstay in the research field of genome sequence assembly. This method randomly obtains sequence reads (e.g. a subsequence including about 500 characters) from a genome and then assemblies them into contigs based on significant overlap among them. The whole-genome shotgun (WGS) approach, generates sequence reads directly from a whole-genome library and uses computational techniques to reassemble them. A variety of assembly programs have been previously proposed and implemented, including PHRAP [3] (Green 1994), CAP3 [4] (1999), Celera [5] (2000) etc. Because of great computational complexity and increasingly large size, they incur great time and space overhead. PHRAP [3], for instance, which can only run in a stand-alone way, requires many times memory (usually greater than 10) as the size of original sequence data. In realistic applications, sequencing process might come to become unacceptably slow for insufficient memory even with a mainframe with huge RAM. The GiSA (i.e. Grid System for Genome Sequence Assembly) is thus designed to solve the problem. It is based on Globus Toolkit 3.2. With grid framework, it exploits parallelism and distribution for improving scalability. Its architecture is shown in figure 1. The approach of GiSA is designed into a recursive procedure containing two steps. The first step partitions the sequence data into several intermediate-sized groups in which sequence reads are relevant and can potentially be assembled together. Each group can be successfully processed independently in limited memory. The second step will be performed to assemble intermediate results derived from the first steps in the round. In this way, we can handle dramatically large size of biological sequence data. GiSA is divided into three layers: client, PHRAP servers, and servers for management including BLAST [1] Data Server and Management Data Server (MDS). The client simply sends assembly request through Web Browser. MDS of GiSA is ready to receive request and then GiSA starts working for genome sequence assembly. PHRAP servers are deployed with Grid Environment (Globus Toolkit 3.2 for Linux in our implementation) and gird services for control and communication. * This research is supported in part by the Key Program of National Natural Science Foundation of China (No. 69933010 and 60303008), and China National 863 HighTech Projects (No. 2002AA4Z3430 and 2002AA231041) P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 831–833, 2004. © Springer-Verlag Berlin Heidelberg 2004

832

Jun Tang et al.

Fig. 1. Grid System for Genome Sequences Assembly

Information about the grid services such as GSH (Grid Service Handle) is registered in MDS. Each grid service works as a single thread to provide parallelism. PHRAP is also installed in each PHRAP server to accomplish assembly task. Each server continuously receives task from Task Queue on MDS and return locally processed result. On MDS, several important programs are deployed as threads. Main control thread manages all the process variables and schedules the other programs. Queue thread constructs and maintains global Task Queue for workload balance. Dispatching thread dispatches tasks from Task Queue to PHRAP servers. And results-receiving thread collects partial results returned from PHRAP servers. All the four threads above are finely designed for synchronization. Genome sequence data are stored in the BLAST Data Sever where BLAST is available for sequence similarity search. The whole procedure works as follows. As the client sends assembly request through Web Browser, GiSA starts to run in a recursive reformation. First, control thread launches ‘formatdb’ program in BLAST package to construct BLAST target db. Then, queue thread randomly selects an unused sequence. BLAST use the sequence as a seed for finding sequences which have a promising chance of being joined. These sequences are collected in file and ready to be packed as Task Element into Task Queue. Dispatching thread dispatches tasks to each PHPAP server according to their respective capability. If there is no task in queue cur-

GiSA: A Grid System for Genome Sequences Assembly

833

rently, it will sleep for a while. When a certain Task Element is dispatched to a certain PHRAP server, the server receives task and run PHRAP to align the sequences. Multiple PHRAP servers work independently and concurrently. Local assembly results are generated in plain file format and transferred back to MDS. After MDS gets the results, it updates server’s capability information for future dispatching decision and sequence alignment information for next-round use. When a certain portion of sequences has been processed, next round starts. New data source file and BLAST db is reconstructed,. This procedure goes recursively. It does not cease until no contigs be generated any more and returns results to the client. Additionally, we design a Web progress bar in JSP format as user interface to visualize the undergoing progress. Obviously, we can benefit a lot from such an architecture and work flow of GiSA. The bottleneck of lacking enough RAM in a single computer is overcome by partitioning overall sequence data into smaller clusters. All the available service resources of servers contribute to GiSA to accelerate the assembly procedure. This is the common characteristic of grid system. Moreover, when a server finishes earlier than others, it will immediately get another assembly task from Task Queue until it is empty. As a result, the computing ability of each server is well exerted. In summary, this grid system provides new solutions to large scales of genome sequences assembly and it is a meaningful application of Grid in the area of sequence assembly.

References 1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol., 215:403-410, 1990. 2. F. Sanger, S. Nicklen, and A.R. Coulsen. DNA Sequencing with Chain Terminating Inhibitors. In Proc. Natl. Acad. Sci, 74: 5463-5467, 1977. 3. P. Green. PHRAP Documentation, http://www.phrap.org, 1994. 4. X. Huang, and A. Madan. CAP3: A DNA Sequence Assembly Program. Genome Res., 9:868-877, 1999. 5. E.W. Myers, G.G. Sutton, A.L. Delcher, and I.M. Dew, et al. A Whole-genome Assembly of Drosophila. Science, 287:2196-2204, 2000.

Analytical View of Business Data: An Example Adam Yeh, Jonathan Tang, Youxuan Jin, and Sam Skrivan Microsoft Corporation, One Microsoft Way Redmond, WA, USA {adamyeh,jontang,yjin,samsk}@mcirosoft.com

Abstract. This paper describes an example of how the Analytical View (AV) in Microsoft Business Framework (MBF) works. AV consists of three components: Design time Model Service, Business Intelligence Entity (BIE) programming model, and the runtime Intell-Drill for navigation between OLTP and OLAP data sources. Model Service transforms an “object model (transactional view)” to a “multi-dimensional model (analytical view).” It infers dimensionality from the object layer where richer metadata is stored, eliminating the guesswork that a traditional data warehousing process requires. Model Service also generates BI Entity classes that enable a consistent object oriented programming model with strong types and rich semantics for OLAP data. Intelli-Drill links together all the information in MBF using metadata, making information navigation in MBF fully discover-able.

1 Introduction The goals of the analytical view [1] are to ensure less contention on the transactional databases, easier access of information, and tighter integration with the application framework’s programming model, such as Microsoft Business Framework (MBF), with a focus on prescriptiveness [2]. Furthermore, we want to unleash the information and data stored in the application through a set of framework level programming models so they can be fully leveraged for BI, data mining, and information navigation in business applications. In MBF, Entity-Relational Maps (ER-Maps) describe how each field in a business entity (e.g., a “customer name” in the customer entity) is originated from a column in a database table (e.g., the CustomerName column in the Customers table). The Model Service infers respective OLAP cubes from the MBF object models - business entities in form of metadata. After this model transformation, a set of classes, namely the Business Intelligence (BI) Entities, are code generated as well to objectify the access to the multi-dimensional data in OLAP cubes. AV automatically infers the corresponding analytical model from the transaction business logic. This process not only enables BI entities to be generated automatically but also preserves the “transformation” logic to offer the full fidelity of the metadata describing relationships between business entities and BI Entities. The end result of this process is a technical break-through that enables BI Entities to drill back to business entities and navigate among them in design and run –time, using metadata. The Intelli-Drill run-time service furthers the idea used by hypermedia [3] for the object transversal in an object graph. Figure 1 illustrates the architecture vision for AV. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 834–837, 2004. © Springer-Verlag Berlin Heidelberg 2004

Analytical View of Business Data: An Example

835

Fig. 1. Our Architecture Vision for Analytical View

Fig. 2. UML Model for the Example

2 An Example A developer uses MBF to design an application object model using UML (see Figure 2.) He maps the entities and relationships to relational objects. He runs Model Service to infer the dimensional model from the defined object model and the O-R mapping. The translator uses a rules engine to create dimensions and hierarchies. The translator first examines the model and determines the “reachable” object from all defined measures. A reachable object implies that a path exists to that object through relationships of the correct cardinality, from the measures. This insures that the dimensions that are built can “slice” the measures.

836

AdamYeh et al.

Fig. 3. Inferred Star Schema for the Example

The translation engine then generates a Data Source View [4] which describes the entities as data sources in the OLAP mode. Object relationships such as associations and compositions are emulated by foreign keys understood by the OLAP. Additional objects, known as the FACT objects, are built by traversing the object tree rooted by each focal point following foreign keys in a many-to-one direction. Finally, the translation engine builds a dimensional model from the Data Source View. A Sales cube is built with two measure groups [4] and dimensions (Figure 3). The measure groups are derived from objects with decorated measures. The rules engine determines the structure of the dimensions, rolling up some entities into a single dimension and constructing hierarchies with the appropriate levels. The deployment engine of the Model Service deploys the dimensional model on a specified UDM server and generates the BI Entity code for programmatic access. We also introduced a notion of “Smart Report” to make information more accessible to the end users wherever they are in a business application by leveraging the metadata and Intelli-Drill runtime services. Figure 4 shows a mockup to illustrate the idea. In Smart Report, data points are traversable through Intelli-Drill. E.g., when a user types information about a customer in a sales order, the user can see the credit rating and payment history of this customer.

Analytical View of Business Data: An Example

837

Fig. 4. Sample Smart Report

3 Conclusions and Future Work Traditionally, converting an object model into a dimensional model is done manually to re-construct the business logic, which could be lost in the process. Importing data from the object model into the dimension model also creates a big overhead for the process of data analysis. The conversion from an object oriented model to a dimensional model is a new concept in OLAP. Often, the two models are not related to each other because people who deal with them have different backgrounds. Our break-through automates the conversion process and removes the need to reconstruct the business logic. As such, we provide a lowest cost of entry point for application developers to include BI or data mining functionality in their applications. Most work described here has been done and will be part of MBF. We are working diligently to support prescriptive navigation using Intelli-Drill.

References 1. Adam Yeh, Jonathan Tang, Youxuan Jin, Sam Skrivan, Analytical View of Business Data, to be published in ACM SIGKDD 2004 proceedings. 2. Microsoft Business Framework (http://microsoft.sitestrearn.com/PDC2003/DAT/ DAT340_files/Botto_files/DAT340_Brookins.ppt) 3. Alejandra Garrido and Gustavo Rossi, A Framework for Extending Object-Oriented Applications with Hypermedia Functionality (http://www.cs.colorado.edu/~kena/classes/7818/f99/framework.pdf). 4. Business Intelligence and Data Warehousing in SQL Server Yukon (http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/next/ DWSQLSY.asp)

Ontological Approaches to Enterprise Applications* Dongkyu Kim1, Yuan-Chi Chang2, Juhnyoung Lee2, and Sang-goo Lee3 1

CoreLogiX, Inc., Seoul, Korea

[emailprotected] 2

IBM T. J. Watson Research Center, Hawthorne, NY, USA {jyl,yuanchi}@us.ibm.com

3

Center for e-Business Technology, Seoul National University, Seoul, Korea [emailprotected]

1 Introduction One of the main challenges in building enterprise applications has been to balance between general functionality and domain/scenario-specific customization. The lack of formal ways to extract, distill, and standardize the embedded domain knowledge has been a barrier to minimizing the cost of customization. Using ontology, as many would hope, will give application builders the much needed methodology and standard to achieve the objective of building flexible enterprise solutions [1,2]. However, even with a rich amount of research and quite a few excellent results on designing and building ontologies [3, 4], there are still gaps to be filled for actual deployment of the technology and concept in a real life commercial environment. The problems are hard especially in those applications that require well-defined semantics in mission critical operations. In this presentation, we introduce two of our projects where ontological approaches are used for enterprise applications. Based on these experiences we discuss the challenges in applying ontology-based technologies to solving business applications.

2 Product Ontology In our current project, an ontology system is being built for the Public Procurement Services (PPS) of Korea, which is responsible for procurement for government and public agencies of the country. The main focus is the development of a system of ontologies representing products and services. This will include the definitions, properties, and relationships of the concepts that are fundamental to products and services. The system will supply tools and operations for managing catalog standards, and will serve a standard reference system for e-catalogs. Strong support for semantics of product data and processes will allow for dynamic, real-time data integration, and also real-time tracking and configuration of products, despite differing standards and conventions at each stage. * This work has been conducted in part under the Joint Study Agreement between IBM T. J. Watson Research Center, USA, and the Center for e-Business Technology, Seoul National University, Korea. D. Kim and S. Lee’s work was supported by Ministry of Information & Communications, Korea, under the Information Technology Research Center (ITRC) Support Program. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 838–840, 2004. © Springer-Verlag Berlin Heidelberg 2004

Ontological Approaches to Enterprise Applications

839

An important component of the ontology model will be the semantic model for product classification schemes such as UNSPSC1, since that alone can be used to enrich the current classification standards to include machine-readable semantic descriptions of products. The model will provide a logical structure in which to express the standards. The added semantics will enhance the accuracy of mappings between classification standards.

3 Performance Monitoring Ontology We extend our discussions to another domain to argue that the issues and principles in the above project are not specific to product information, but rather can be generally applied to other database applications that deal with diverse semantics. In this second project, we attempt to create a worldwide monitoring and collaboration platform for petroleum surveillance engineers of a major oil production and distribution company. The job of a surveillance engineer is to constantly monitor multiple time series sensor data, which takes measurements of production equipment outputs as well as natural environmental factors. An ontology to describe operational data and events in association with oil production is expected to serve as the reference to all operational sensor data and all equipment failure monitors. A performance monitoring ontology primarily serves three objectives in our system. First, the ontology organizes the matrix of sensor data in a semantically meaningful way for engineers to navigate and browse. Second, through ontology, the pattern recognizers can be de-coupled from actual sensor data, which may be added, upgraded, and retired in the lifetime of a well. Third, the ontology helps to link treatment actions to pending failure events. The use of ontology for performance monitoring appears through the working loop of sense, alert, decision, and reaction.

4 Discussions and Conclusion Based on our experiences from the projects described above, we discuss some of the practical issues that hinder widespread use of ontology-based applications in enterprise settings. We came to realize the lack of modeling methodology, domain user tools, persistent storage, lifecycle management and access control for the creation, use, and maintenance of ontology on a large, deployable scale. While our engagement is specific to government procurement and oil production, we believe that one can infer this paradigm to similar business applications in other industries. Modeling: Level of abstraction problem haunts all aspects of ontology design. Multiple views and taxonomies, often with conflicting semantics, present another challenge for the field engineer. Ontology – DB Integration: The ontology can be modeled as meta data for the database, where the database alone represents the information content of the system and the ontology is a secondary facility. On the other hand, the ontology can be modeled 1

UNSPSC, United Nations Standard Products and Services Code, UNDP, http://www.unspsc.org/

840

Dongkyu Kim et al.

as an integral part of the database, in which case, ontology must be part of all queries and operations. Trade-off includes implementation complexity, semantic richness, and efficiency. Ontology Lifecycle Management: Populating the ontology is a daunting task which can make or break the project. The job is complicated by multiple formats, semantic mismatches, errors or duty data in pre-existing information sources. Change management (versions, mergers, decompositions, etc.) is another complicated issue. Accountability and Control: One of the biggest concerns inhibiting ontology adoption in enterprise applications is its lack of control. When is an ontology complete, in the sense that it holds sufficient content to support all mission critical operations? Is the behavior/performance predictable? Human Factors: Building and maintaining the ontology requires much more than software engineers. Domain experts must define the concepts and relationships of the domain model. Ontological information model is not a concept easily understood by non-computer/ontology experts. A set of intuitive guidelines must be provided. Easyto-use tools are also essential. Through this presentation, we wish to share our experiences on these issues and solutions to some of them. The solutions to these problems are most likely to come as disciplines, guidelines, and tools that implement these guidelines. In our future research, we plan to build a map that links individual ontological requirements to ontology issues, and then to applicable ontology technology.

References 1. D. L. McGuinness: Ontologies Come of Age. In: D. Fensel, et al. (eds.): The Semantic Web: Why, What, and How. MIT Press (2001) 2. N. Guarino: Formal Ontology and Information Systems. Proc. of Formal Ontology in Information Systems, Trento, Italy (1998) 3. P. Spyns, et al: Data Modeling versus Ontology Engineering. SIGMOD Record, Vol. 31(4), ACM (2002) 4. C.W. Holsapple & K.D. Joshi: A Collaborative Approach to Ontology Design. Comm. Of the ACM, Vol. 45(2), ACM (2002)

FASTAXON: A System for FAST (and Faceted) TAXONomy Design Yannis Tzitzikas1, Raimo Launonen1, Mika Hakkarainen1, Pekka Korhonen2, Tero Leppänen2, Esko Simpanen2, Hannu Törnroos2, Pekka Uusitalo2, and Pentti Vänskä2 1 VTT Information Technology, P.O.Box 1201, 02044 VTT, Finland [emailprotected], {Raimo.Launonen,Mika.Hakkarainen}@vtt.fi 2 Helsinki University of Technology, Finland

Building very big taxonomies is a laborious task vulnerable to errors and management/scalability deficiencies. FASTAXON is a system for building very big taxonomies in a quick, flexible and scalable manner that is based on the faceted classification paradigm [4] and the Compound Term Composition Algebra [5]. Below we sketch the architecture and the functioning of this system and we report our experiences from using this system in real applications. Taxonomies, i.e. hierarchies of names, is probably the oldest and most widely used conceptual modeling tool still used in Web directories, Libraries and the Semantic Web (e.g. see XFML [1]). Moreover, the advantages of the taxonomybased conceptual modeling approach for building large scale mediators and P2P systems that support semantic-based retrieval services have been analyzed and reported in [7,6,8]. However, building very big taxonomies is a laborious task vulnerable to errors and management/scalability deficiencies. One method for building efficiently a very big taxonomy is to first define a faceted taxonomy (i.e. a set of independently defined taxonomies called facets) like the one presented in Figure 1, and then derive automatically the inferred compound taxonomy i.e. the taxonomy of all possible compound terms (conjunctions of terms) over the faceted taxonomy. Faceted taxonomies carry a number of well known advantages over single hierarchies in terms of building and maintaining them, as well as using them in multicriteria indexing (e.g. see [3]). FASTAXON is a system for building big (compound) taxonomies based on the above mentioned idea. Using the system, the designer at first defines a number of facets and assigns to each one of them one taxonomy. After that the system can generate dynamically (and on the fly) a navigation tree that allows to the designer (as well to the object indexer or end user) to browse the set of all possible compound terms. A drawback, however, of faceted taxonomies is the cost of avoiding the invalid (meaningless) compound terms, i.e. those that do not apply to any object in the domain. Let’s consider the faceted taxonomy of Figure 1. Clearly we cannot do any winter sport in the Greek islands (Crete and Cefalonia) as they never have enough snow, and we cannot do any sea sport in Olympus because Olympus is a mountain. For the sake of this example, let us also suppose that only Cefalonia has a Casino. According to this assumption, the partition of the set of compound terms to the set of valid (meaningful) and invalid (meaningless) P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 841–843, 2004. © Springer-Verlag Berlin Heidelberg 2004

842

Yannis Tzitzikas et al.

Fig. 1. A faceted taxonomy for indexing hotel Web pages

is shown in Table 1. The availability of such a partition would be very useful during the construction of a materialized faceted taxonomy (i.e. a catalog based on a faceted taxonomy). It could be exploited in the indexing process for preventing indexing errors, i.e. for allowing only meaningful compound terms to be assigned to objects. It could also aid the indexer during the indexing process, by generating dynamically a single hierarchical navigation tree that allows selecting the desired compound term by browsing only the meaningful compound terms. However, even from this toy example, it is more than obvious that the definition of such a partition would be a formidably laborious task for the designer. FASTAXON allows specifying the meaningful compound terms in a very flexible manner. It is the first system that implements the recently emerged Compound Term Composition Algebra (CTCA) [5]. This allows to the designer to use an algebraic expression for specifying the valid compound terms. This involves declaring only a small set of valid or invalid compound terms from which other (valid or invalid) compound terms are then inferred. For instance, the partition shown in Table 1, can be defined using the expression: with the following P and N parameters: N = {{Crete, Winter Sports}, {Cefalonia, Winter Sports}}, P = {{Cefalonia, SeaSki, Casino}, {Cefalonia, Windsurfing, Casino}}. Specifically, FASTAXON provides an Expression Builder for formulating CTCA expressions in a flexible, interactive and guided way. Only the expression that defines the desired compound terminology is stored (and not the inferred partition), as an inference mechanism is used to check (in polynomial time) whether a compound term belongs to the compound terminology of the expression. The productivity obtained using FASTAXON is quite impressive. The so far experimental evaluation has shown that in many cases a designer can define from scratch a compound taxonomy of around 1000 indexing terms in some minutes. FASTAXON has been implemented as a client/server Web-based system written in Java. The server is based on the Apache Web server, the Tomcat application server and uses MySQL for persistent storage. The user interface is based on DHTML (dynamic HTML), JSP (Java Server Pages) and Java Servlet technologies (J2EE). The client only needs a Web browser that support JavaScripts (e.g. Microsoft Internet Explorer 6). Future extensions include modules for importing and exporting XFML [1] and XFML+CAMEL [2] files. FASTAXON will be published under the VTT Open Source Licence within 2004 (for more see http://fastaxon.erve.vtt.fi/).

FASTAXON: A System for FAST (and Faceted) TAXONomy Design

843

References 1. “XFML: exchangeable Faceted Metadata Language”. http://www.xfml.org. 2. “XFML+CAMEL:Compound term composition Algebraically-Motivated Expression Language”. http://www.csi.forth.gr/markup/xfml+camel. 3. Ruben Prieto-Diaz. “Implementing Faceted Classification for Software Reuse”. Communications of the ACM, 34(5):88–97, 1991. 4. S. R. Ranganathan. “The Colon Classification”. In Susan Artandi, editor, Vol IV of the Rutgers Series on Systems for the Intellectual Organization of Information. New Brunswick, NJ: Graduate School of Library Science, Rutgers University, 1965. 5. Y. Tzitzikas, A. Analyti, N. Spyratos, and P. Constantopoulos. “An Algebraic Approach for Specifying Compound Terms in Faceted Taxonomies”. In Information Modelling and Knowledge Bases XV, Procs of EJC’03, pages 67–87. IOS Press, 2004. 6. Y. Tzitzikas and C. Meghini. “Ostensive Automatic Schema Mapping for Taxonomybased Peer-to-Peer Systems”. In 7th Int. Workshop on Cooperative Information Agents, CIA-2003, pages 78–92, Helsinki, Finland, August 2003. 7. Y. Tzitzikas, C. Meghini, and N. Spyratos. “Taxonomy-based Conceptual Modeling for Peer-to-Peer Networks”. In Procs of 22th Int. Conf. on Conceptual Modeling, ER’2003, pages 446–460, Chicago, Illinois, October 2003. 8. Y. Tzitzikas, N. Spyratos, and P. Constantopoulos. “Mediators over Taxonomybased Information Sources”. VLDB Journal, 2004. (to appear).

CLOVE: A Framework to Design Ontology Views Rosario Uceda-Sosa1, Cindy X. Chen2, and Kajal T. Claypool2 1

IBM T. J. Watson Research Center, Hawthorne, NY 10532, USA [emailprotected] 2

Department of Computer Science, University of Massachusetts, Lowell, MA 01854, USA {cchen,kajal}@cs.uml.edu

1 Introduction The management and exchange of knowledge in the Internet has become the cornerstone of technological and commercial progress. In this fast-paced environment, the competitive advantage belongs to those businesses and individuals that can leverage the unprecedented richness of web information to define business partnerships, to reach potential customers and to accommodate the needs of these customers promptly and flexibly. The Semantic Web vision is to provide a standard information infrastructure that will enable intelligent applications to automatically or semi-automatically carry out the publication, the searching, and the integration of information on the Web. This is to be accomplished by semantically annotating data and by using standard inferencing mechanisms on this data. This annotation would allow applications to understand, say, dates and time intervals regardless of their syntactic representation. For example, in the e-business context, an online catalog application could include the expected delivery date of a product based on the schedules of the supplier, the shipping times of the delivery company and the address of the customer. The infrastructure envisioned by the Semantic Web would guarantee that this can be done automatically by integrating the information of the online catalog, the supplier and the delivery company. No changes to the online catalog application would be necessary when suppliers and delivery companies change. No syntactic mapping of metadata will be necessary between the three data repositories. To accomplish this, two things are necessary: (1) the data structures must be rich enough to represent the complex semantics of products and services and the various ways in which these can be organized; and (2) there must be flexible customization mechanisms that enable multiple customers to view and integrate these products and services with their own categories. Ontologies are the answer to the former, ontology views are the key to the latter. We propose ontology views as a necessary mechanism to support the ubiquitous and collaborative utilization of ontologies. Different agents (human or computational) require different organization of data and different vocabularies to suit their information seeking needs, but the lack of flexible tools to customize and evolve ontologies makes it impossible to find and use the right nuggets of P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 844–849, 2004. © Springer-Verlag Berlin Heidelberg 2004

CLOVE: A Framework to Design Ontology Views

845

information in such environments. When using an ontology, an agent should be able to introduce new classes using high level constraints, and define contexts to enable efficient, effective and secure information searching. In this paper we present a framework that enables users to design customized ontology views and show that the views are the right mechanism to enhance the usability of ontologies.

2

Ontology Views

Databases views and XML views [1–3, 5–7], have been used extensively to both tailor data to specific applications and to limit access to sensitive data. Much like traditional views, it is imperative for ontology views to provide a flexible model that meets the demands of different applications as well as different categories of users. For example, consider an online furniture retailer, OLIE, that wants to take advantage of ontology-based technologies and provide a flexible and extensible information model for its web-based applications. The retailer creates an ontology that describes the furniture inventory, manufacturers and customer transactions. Let us assume that two primary applications use this ontology. The first application, a catalog browsing application, allows customers to browse the furniture catalog and make online purchases, while the second application, a pricing application, allows marketing strategists to define sales promotions and pricing. The information needs of these two applications are very different. For example, customers should not be allowed to access the wholesale price of a furniture piece. Similarly, an analyst is only concerned with attributes of a furniture piece that describe it as a marketable entity, not those that refer to its dimensions, which are primarily of interest to customers. The catalog browsing and the pricing applications need to take these restrictions into consideration when querying and displaying the ontology to their respective users. If the ontology changes, regardless of how powerful the inferencing is, the applications will invariably need to change their queries. This hard-coded approach to accessing ontologies is costly in development time and error prone, and underlies the need for a flexible model for ontology views. In this case, it is desirable to be able to define the Marketing View and Customer View as in the ontology fragment shown in Figure 1. Despite their similarities with relational database views, ontology views have also differentiating characteristics. First, ontology views need to be first-class citizens in the model, with relations and properties just like regular ontology classes. For example, suppose that the pricing analyst wants to define the PreferredCustomer category, as a customer with a membership card that offers special prices for furniture and accessories. Now the catalog application needs a PreferredCustomer View, similar to the CustomerView defined in Figure 1, adding the promotional price for card holders. It would also be desirable to define the PreferredCustomerView as a subclass of CustomerView, so that whenever some information is added or removed to the CustomerView, the changes are automatically reflected in the PreferredCustomerView. Notice that, in this case, we

846

Rosario Uceda-Sosa, Cindy X. Chen, and Kajal T. Claypool

Fig. 1. Two Views of a Furniture Piece.

Fig. 2. Inheritance Hierarchy in Ontology Views.

have an inheritance hierarchy within the views, that is PreferredCustomerView IsA CustomerView, as shown in Figure 2. Second, views need to be used as contexts to interpret further queries. For example, suppose that the marketing analyst defines the class SeasonalItems as a set of furniture pieces or accessories that have unusually high volume of sales in a given shopping season, based on previous years sales statistics. The analyst also defines ChristmasItems, SummerItems and FallItems as refinements of SeasonalItems. When a customer queries for information on large oval tablecloths in Christmas, items in the ChristmasItems view should be selected, and the information on each item should be filtered through either the CustomerView or the PreferredCustomerView, depending on the type of customer. It is easy to see that views need to represent structures of the ontology (like CustomerView) as well as new classes defined through constraints, much like OWL [4] class operators. In fact, the views proposed here are extensions to OWL classes and expressions, as discussed in Section 2.1.

2.1

CLOVE – A View Definition Language for OWL

We focus on the systematic description and management of views as first-class objects in ontologies, as described in the scenarios above. To the best of our knowledge, this work is the first of its kind in defining ontology views as firstclass objects. In particular, we extend OWL [4], a recently proposed standard of W3C, to describe ontologies and their views. OWL allows the definition of classes from other classes through set operations, thereby providing the basic infrastructure support for defining simple views, like the SeasonalItems category described above. However, it has limitations. First, even though ontology views can be considered as classes that are derived from the underlying ontology, they can also refer to subnetworks or structures of classes and relations (like in the case of the CustomerView), underscoring the need for a language rich enough to define both types of views. Second, we need to define a set of standard rules that govern the creation and management of these views, as well as their scope and visibility. While the later is still an open problem, there are some simple mechanisms that allow adequate view definitions. In this paper, we present an overview of a high level constraint language – CLOVE (Constraint Language

CLOVE: A Framework to Design Ontology Views

847

for Ontology View Environments) that extends OWL constraints. We employ CLOVE as the underlying mechanism to support the creation of OWL views. A view in CLOVE is defined by a set of (1) subject clauses; (2) object clauses; and (3) variable definitions. The subject clauses describe the constraints under which the view is valid, as well as the range of instances for which the view is applicable. The subject clauses are used to check whether the view (if declared active) should be used in the current query. CLOVE does not restrict the number of subjects of a view. For example, the Customer View defines as subjects all types of customers. It is also possible to not specify the subject of a view by using the keyword ANY, in which case, the CLOVE runtime system uses the view to filter all queries when the view is active. A subject is defined through a NavigationExpression which is described below The objects are expressions that describe the content of a view, and have the form:

where the keywords INCLUDE and EXCLUDE indicate whether the classes or instances satisfying the clause are included or excluded from the view. The NavigationExpression is a Boolean expression of relations or properties that are navigated from the set of currently evaluated classes and instances or from a variable or name included in the expression. For example, ?object SUBSUMES and IS-A are valid navigation expressions. The ConstraintExpression is an extension of an OWL expression. In its simplest form is just the name of a class or instance, but it can also describe the content of its data (the WITH CONTENT in Figure 3) or the data type of the properties of a class or instance (with WITH TYPE) among others. In the example below, Customer and MarketableEntity are valid and very simple- constraint expressions. CLOVE also defines variables that can be directly used in clauses, as well as it allows users to define their own variables. A variable in CLOVE is preceded by the question mark. In Figure 3, the variable ?object refers to the currently evaluated content of the view. There is also a pre-defined variable, ?subject that refers to all the currently evaluated subjects of the view. User-defined variables can be used to define scripts or procedures to calculate data from the existing data, like LastYearXmSales in Figure 3, which is evaluated from existing properties of LastYearSales, the November and December sales. The full specification of CLOVE is beyond the scope of this paper but Figure 3 gives a brief example of the creation of some of the OWL views of the scenarios above using CLOVE. CLOVE allows arbitrary relations among views, in particular, inheritance (that is, IsA). CLOVE also allows the dynamic creation of classes to evaluate views (like the LastYearXmSales as a refinement of LastYearSales in Figure 3. After defining them, views can be activated or de-activated by their authors or by users with administrative privileges. The runtime system requires that every query is tagged with information about the user, which is associated to a class in the ontology. Queries are evaluated with respect to the currently active views in

848

Rosaxio Uceda-Sosa, Cindy X. Chen, and Kajal T. Claypool

Fig. 3. Creating views with CLOVE.

the order that they were defined. The result is that queries against the ontology are automatically filtered by one or more views, according to the current user context. All users with access to the ontology should be able to create views. This is one of the most important design principles of CLOVE. However, not every view should be used to filter every query, thats why the CLOVE runtime system keeps track of view dependencies and who created them, with a simple access control system based on user IDs.

3

Conclusions

The Semantic Web brings forth the possibility of heterogeneous ontologies that are universally accessible to arbitrary agents through the Internet. These agents may not only access these ontologies, but also customize their organization and information with their own knowledge and communicate it in turn to their own users. Hence, the ability to create views and contexts on ontologies becomes as crucial as the view mechanism in traditional database technologies, providing a scope and filtering of information necessary to modularize and evolve ontologies.

CLOVE: A Framework.to Design Ontology Views

849

However, ontology views are not just straightforward extensions of database views. We have designed and implemented a framework that explores the issues of authoring and management of views and their underlying ontologies. Among them, we have focused on the dual nature of views as classes in the ontology and contexts to interpret new queries. As contextual elements, views are structures of classes and as classes they have relations to other views and even to other classes. We have also implemented a constraint language, CLOVE that takes into account this duality and allows users to both create and query views with an easy-to-use, natural interface.

References 1. S. Abiteboul, R. Hull, and V. V. Foundations of Databases. Addison-Wesley Publishing Company, 1995. 2. J. Gilbert. Supporting user views. Computer Standards and Interfaces, 13:293–296, 1991. 3. G. Gottlob, P. Paolini, and R. Zicari. Properties and update semantics of consistent views. ACM Trans. on Database Systems, vol.13(4):486–524, Dec. 1988. 4. OWL Web Ontology Language. http://www.w3.org/TR/owl-guide/. 5. A. Rosenthal and E. Sciore. First-Class Views: A Key to User-Centered Computing. SIGMOD Record, 28(3):29–36, May 1999. 6. P. V. S. Cluet and D. Vodislav. Views in a large scale xml repository. In Proceedings of International Conference on Very Large Data Bases, pages 271–280, 2001. 7. T. W. L. Y. B. Chen and M. L. Lee. Designing valid xml views. In Proceedings of International Conference on Conceptual Modeling, pages 463–478, 2002.

Dr. Rosario Uceda-Sosa is a researcher in intelligent information infrastructures, knowledge representation and usability at IBM T.J. Watson Research Center. She received a BS in Philosophy by the University of Seville (Spain), as well as a MS in Mathematics and a PhD in Computer Science by the University of Michigan. She is currently interested in the usability of ontologies and in ontology query languages. Dr. Cindy Chen is an assistant professor at Department of Computer Science, University of Massachusetts at Lowell. She received B.S. degree in Space Physics from Peking University, China, M.S. and Ph.D. degrees in Computer Science from University of California, Los Angeles. Her current research interests include Spatio-Temporal Databases, XML, and Data Mining, etc. Dr. Kajal T. Claypool is an Assistant Professor in the Department of Computer Science at University of Massachusetts - Lowell. She received her B.E. degree in Computer Engineering from Manipal Institute of Technology, India, and her Ph.D degree in Computer Science from Worcester Polytechnic Institute, Worcester MA. Her research interests are Data Integration focused on XML integration and Life Science Integration, Data Stream Engineering, and Software Engineering.

iRM: An OMG MOF Based Repository System with Querying Capabilities Ilia Petrov, Stefan Jablonski, Marc Holze, Gabor Nemes, and Marcus Schneider Chair for Database Systems, Department of Computer Science, University of Erlangen-Nürnberg, Martensstrasse 3, Erlangen, D-91058, Germany {petrov,jablonski,holze,nemes,schneider}@cs.fau.de http://www6.informatik.uni-erlangen.de/

Abstract. In this work we present iRM – an OMG MOF-compliant repository system that acts as custom-defined application or system catalogue. iRM enforces structural integrity using a novel approach. iRM provides declarative querying support. iRM finds use in evolving data intensive applications, and in fields where integration of heterogeneous models is needed.

1 Introduction Repository systems are “shared databases about engineered artifacts” [2]. They facilitate integration among various tools and applications, and are therefore central to an enterprise. Loosely speaking repository systems are data stores with a customizable system catalogue – a new and distinguishing feature. Repository systems exhibit an architecture comprising several layers of metadata [3], e.g. repository application’s instance data repository application’s model meta-model and metameta-model In comparison to database systems repository systems contain an additional metadata layer, allowing for a custom definable and extensible system catalogue Preserving consistency between the different layers is a major challenge specific to repository systems. A declarative query language with higher order capabilities is needed to provide model (schema) independent querying. Treating data and metadata in uniform manner is a key principle when querying repository objects on different meta-layers. Areas of applications are domain-driven application engineering, scientific repositories, data-intensive Web applications. In this demonstration we present an OMG MOF based repository system developed in the frame of the iRM (Fig. 1) project [1].

2 iRM/RMS Repository System and mSQL Query Language Structural consistency is one of the key issues in repository systems and must be enforced automatically by the RMS. It ensures that the structure of the repository objects conforms to its definition on the upper meta-layer. Without structural integrity the repository data will be inconsistent (no type conformity), which has profound consequences on any repository applications, since they rely heavily on reflection. The concept of repository transactions is an integral part of the structural consistency of the repository data. Repository systems must be able to handle concurrent multiclient access, i.e. concurrent atomic sets of operations from multiple repository cliP. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 850–851, 2004. © Springer-Verlag Berlin Heidelberg 2004

iRM: An OMG MOF Based Repository System with Querying Capabilities

851

ents. Implementing isolation requires extension of the traditional locking mechanisms, e.g. multi-granularity locking mechanisms in OODB [4]. In iRM/RMS we introduce “instance lattice” in addition to aggregation and class (type) lattices.

Fig. 1. Logical Architecture of the iRM Project

We introduce mSQL (meta SQL) as a query language-extension of SQL, to account for the specifics of the repository systems. The mSQL syntax is inspired by SchemaSQL [5]. The main value of mSQL lies in its declarative nature – especially beneficial in the context of repository systems. Given only a programmatic access, through the RMS API, a repository application needs additional code to load repository objects. mSQL queries significantly simplify this task and reduce application complexity. mSQL allows model independent querying: querying attributes values in classes on meta-layer instances of a specified meta-class.

3 The Demonstration The demonstration will show the enforcement of structural integrity in iRM. We will consider several cases: (a) creation of new models and import of data; (b) modification of existing meta-metamodels (M2) with existing M1 models and instance data; (c) concurrent multi-client access. The second and the third cases illustrate the main value of structural integrity. We will showcase mSQL queries’ execution, illustrating the value of mSQL to repository applications. We shall demonstrate model independent querying and dynamic schema discovery with mSQL.

References 1. Petrov, I., Jablonski, S.: An OMG MOF based Repository System with Querying Capability - the iRM Project. To appear In Proceedings of iiWAS 2004. 2. Bernstein, P., Dayal, U.: An overview of repository technology. In Proc. of VLDB 1994. 3. Object Management Group: Meta Object Facility Specification Version 1.4. 4. Ozsu, T.: Transaction Models and transaction management in Object-oriented database management systems. In Advances in Object-oriented Data-base Systems, 1994. 5. Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: SchemaSQL: An extension to SQL for multidatabase interoperability. TODS, Vol. 26/ 4. Dec 2001.

Visual Querying for the Semantic Web Sacha Berger, François Bry, and Christoph Wieser University of Munich, Institute for Informatics http://www.ifi.lmu.de

This paper presents a demonstration of visXcerpt [BBS03,BBSW03], a visual query language for both, standard Web as well as Semantic Web applications. Principles of visXcerpt. The Semantic Web aims at enhancing data and service retrieval on the Web using meta-data and automated reasoning. Meta-data on the Semantic Web is heterogeneous. Several formalisms have been proposed. RDF, Topic Maps and OWL, e.g., and some of these formalisms have already a large number of syntactic variants. Like Web data, Web meta-data will be highly distributed. Thus, meta-data retrieval for Semantic Web applications will most likely call for query languages similar to those developed for the standard Web. This paper presents a demonstration of a visual query language for the Web and Semantic Web called visXcerpt. visXcerpt is based on three main principles. First, visXcerpt has been conceived for querying not only Web meta-data, but also all kind of Web data. The reason is that many Semantic Web applications will most likely refer to both, standard Web and Semantic Web data, i.e. to Web data and Web meta-data. Using a single query language well-suited for data of both kinds is preferable to using different languages for it reduces the programming effort and hence costs and it avoids mismatches resulting from interoperating languages. Second, visXcerpt is a query language capable of inference. The inferences visXcerpt can perform are limited to simple inference like needed in querying database views, in logic programming, and in usual forms of Semantic Web reasoning. Offering both, inference and querying, in a same language avoids e.g. the impedance mismatch, which is commonly arising when querying and inferencing are performed in different processes. Third, visXcerpt has been conceived as a mere Hypertext rendering of a textual query language. This approach to developing a visual language is fully new. It has several advantages. It results in a visual language tightly connected to a textual language, namely the textual language it is a rendering of. This tight connection makes it possible to use both, the visual and the textual language, in the development of applications. Last but not least, a visual query language conceived as an Hypertext application is especially accessible for Web and Semantic Web application developers. Further principles of visXcerpt are as follows. visXcerpt is rule-based. visXcerpt is referentially transparent and answer-closed. Answers to visXcerpt queries can be arbitrary XML data. visXcerpt uses (like the celebrated visual database query language QBE) patterns for binding variables in query expressions instead of path expressions – as do e.g. the Web query languages XQuery and XSLT. visXcerpt keeps queries and constructions separated. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 852–853, 2004. © Springer-Verlag Berlin Heidelberg 2004

Visual Querying for the Semantic Web

853

Language Visualization as Hypertext Rendering. XML and hence modelling languages for the Semantic Web based on XML like RDF, Topic Maps and OWL, are visualized in visXcerpt as nested, labeled boxes, each box representing an XML element. Graph structures are represented using Hyperlinks. Colors are used for conveying the nesting depth of XML elements. As visXcerpt’s query and construction patterns can be seen as samples, the same visualization can be used for query and construction patterns. This makes visXcerpt’s visualization of queries and answer constructions very close to the visualization of the data the queries and answer constructions refer to. visXcerpt has interactive features helping for a quick understanding of large programs: boxes representing XML elements can be folded and unfolded and semantically related portions of programs like e.g. different occurrences of the same variable), can be highlighted. visXcerpt programs can be composed using a novel Copy-and-Paste paradigm specifically designed for tree (or term) editing. Patterns are provided as templates to support easy construction of visXcerpt programs without in-depth prior knowledge of visXcerpt’s syntax. Today’s Web Standards together with Web browsers offer a ideal basis for the implementation of a language such as visXcerpt.The visXcerpt prototype demonstrated is implemented using only well-established techniques like CSS, ECMAScript, and XSL and, of course, the run time system of the textual query language Xcerpt [SB04] (cf. http://xcerpt.org). Demonstrated Application. The application used for demonstrating visXcerpt is based on data inspired by “Friend of a Friend” cf. http://xmlns.com/foaf/0.1/ expressed in various formats, including plain XML and RDF formats. The demonstration illustrates the following aspects of the visual query language visXcerpt. Standard Web and Semantic Web data can be retrieved using the same visual query language, visXcerpt. Meta-data formated in various Semantic Web formats are conveniently retrieved using visXcerpt. visXcerpt queries and answer constructions are expressed using patterns that are intuitive and easy to express (cf. [BBS03.BBSW03] for examples). Hypertext features are used by visXcerpt such as Hypertext links for following references forward and backward or different renderings (such as hiding and showing of program components or XML elements) so as to help screening large programs. Recursive visXcerpt programs are presented and evaluated demonstrating that visXcerpt gives rise to a rather simple expression of transitive closures of Semantic Web relations and of recursive traversal of nested Web documents. This research has been funded within the 6th Framework Programme project REWERSE number 506779 (cf. http://www.rewerse.net).

References S. Berger, F. Bry, and S. Schaffert. A Visual Language for Web Querying and Reasoning. In Workshop on Principles and Practice of Semantic Web Reasoning, LNCS 2901, Springer Verlag, 2003. [BBSW03] S. Berger, F. Bry, S. Schaffert, and C. Wieser. Xcerpt and visXcerpt: From PatternBased to Visual Querying of XML and Semistructured Data. In 29th Intl. Conference on Very Large Data Bases, 2003. S. Schaffert and F. Bry. Querying the Web Reconsidered: A Practical Introduction to [SB04] Xcerpt. In Extreme Markup Languages, 2004.

[BBS03]

Query Refinement by Relevance Feedback in an XML Retrieval System Hanglin Pan, Anja Theobald, and Ralf Schenkel Max-Planck-Institute for Computer Science D-66123 Saarbrücken, Germany {pan,atb,schenkel}@mpi-sb.mpg.de

1 Introduction In recent years, ranked retrieval systems for heterogeneous XML data with both structural search conditions and keyword conditions have been developed for digital libraries, federations of scientific data repositories, and hopefully portions of the ultimate Web. These systems, such as XXL [2], are based on pre-defined similarity measures for atomic conditions (using index structures on contents, paths and ontological relationships) and then use rank aggregation techniques to produce ranked result lists. An ontology can play a positive role for term expansion [2], by improving the average precision and recall in the INEX 2003 benchmark [3]. Due to the users’ lack of information on the structure and terminology of the underlying diverse data sources, and the complexity of the (powerful) query language, users can often not avoid posing overly broad or overly narrow initial queries, thus getting either too many or too few results. For the user, it is more appropriate and easier to provide relevance judgments on the best results of an initial query execution, and then refine the query, either interactively or automatically by the system. This calls for applying relevance feedback technology in the new area of XML retrieval [1]. The key question is how to appropriately generate a refined query based on a user’s feedback in order to obtain more relevant results among the top-k result list. Our demonstration will show an approach for extracting user information needs by relevance feedback, maintaining more intelligent personal ontologies, clarifying uncertainties, reweighting atomic conditions, expanding query, and automatically generating a refined query for the XML retrieval system XXL.

2 Stages of the Retrieval Process a. Query Decomposition and Weight Initialization: A query is composed of weighted (i.e., differently important) atomic conditions, for example, XML element content constrains, XML element name (tag) constrains, path pattern constrains, ontology similarity constrains, variable constrains, search space constrains, and output constrains. In the XXL system, each atomic condition has an initial weight. If some constrains are uncertain, we specify them by the operator ‘~’. Concrete examples are shown in the poster. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 854–855, 2004. © Springer-Verlag Berlin Heidelberg 2004

Query Refinement by Relevance Feedback in an XML Retrieval System

855

b. Retrieval with Ontology based Similarity Computation: Content index and path index structures are pre-computed and used for the relevance score evaluation of result item candidates. The global ontology index is built beforehand as a table of concepts from WordNet, and frequency-based correlations of concepts are computed statistically using large web crawls. To enable efficient query refinement in the following feedback iterations, we have a set of strategies to maintain a query-specified personal ontology which is automatically generated from fragments of the global ontology. This is the source for further query term expansion, as well as ontological similarity computations. c. Result Navigation and Feedback Capturing: The retrieved ranking list is visualized in a user-friendly way supporting zoom plus focus. Features like group selection and re-ranking are supported in our system, which can capture richer feedback at various levels, i.e., content, path and overall level. d. Strategy Selection for Query Reweighting and Query Expansion: The strategy selection module will choose an appropriate rank aggregation function over atomic conditions for overall score computation. After each feedback iteration, tuning functions (e.g., minimum weight algorithm, average weight algorithm, as in [4]), are used to derive the relative importance among all atomic conditions, and to update the personal ontology [1]. e. Adaptable Query Reformulation: Our system is adaptable using reweighting and expansion techniques. The open architecture allows us easily add new rank aggregation functions, reweighting strategies, or expansion strategies.

3 Demonstration The INEX 2003 benchmark [3] consists of a set of content-and-structure queries and content-only queries over 12117 journal articles. Each document in a result set of a query is assigned a relevance assessment score provided by human experts. We run our method on this data set to show the improvement of average precision and recall using relevance feedback with up to four iterations. Our baseline is using only ontology-based expansion [2]. We show the comparison between different strategies of rank aggregation, query reweighting and expansion. We also show our approach to refine structural XML queries based on relevance feedback.

References 1. Hanglin Pan. Relevance feedback in XML retrieval. In: Proceedings of the ICDE/EDBT Joint Ph.D. Workshop, Boston, pages 193-202, March 2004. To appear in LNCS 3268, Current Trends in Database Technology, Springer, 2004. 2. Ralf Schenkel, Anja Theobald, and Gerhard Weikum. XXL@INEX2003. In: Proceedings of the 2003 INEX Workshop, Dagstuhl Castle, Germany, December 15-17, 2003. 3. Norbert Fuhr, Mounia Lalmas. Initiative for the evaluation of XML retrieval (INEX), 2003. http://inex.is.informatik.uni-duisburg.de:2003/. 4. Michael Ortega-Binderberger, Kaushik Chakrabarti, and Sharad Mehrotra. An approach to integrating query refinement in SQL. In: Proceedings of EDBT 2002. In LNCS 2287, Advances in Database Technology, Springer, 2002.

Semantics Modeling for Spatiotemporal Databases Peiquan Jin, Lihua Yue, and Yuchang Gong Department of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, P.R. China

1 Introduction How to model spatiotemporal changes is one of the key issues in the researches on Spatiotemporal databases. Due to the inefficiency of previous spatiotemporal data models [1,2], none of them has been widely accepted so far. This paper investigates the types of spatiotemporal changes and the approach to describing spatiotemporal changes. The semantics of spatiotemporal changes are studied and a systematic classification on spatiotemporal changes is proposed, based on which a framework of spatiotemporal semantic model is presented.

2 Semantic Modeling of Spatiotemporal Changes The framework for modeling spatiotemporal changes is shown in Fig. 1 as an And/Or Tree. Spatiotemporal changes are represented by object-level spatiotemporal changes that result in changes of object identities and attribute-level spatiotemporal changes that do not change any objects’ identities but only the internal attributes of an object. Attribute-level spatiotemporal changes are spatial attribute changes or thematic attribute changes, which are described by spatial descriptor and attribute descriptor, while object-level spatiotemporal changes are discrete identity changes, which are represented by history topology. The modeling of spatiotemporal changes as shown in Fig.1 is complete. The proof can be found in the reference [3].

Fig. 1. The framework for modeling spatiotemporal changes P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 856–857, 2004. © Springer-Verlag Berlin Heidelberg 2004

Semantics Modeling for Spatiotemporal Databases

857

3 A Framework of Spatiotemporal Semantic Model The framework of spatiotemporal semantic model is shown in Fig. 2. The circle notation represents identity-level changes, and the triangle notation represents attributelevel changes. The attribute descriptor describes the time-varying thematic properties of the spatiotemporal object. The spatial descriptor represents the time-varying spatial value of the spatiotemporal object. And the history topology, which represents identity-level changes, describes the life cycle of spatiotemporal objects, such as split and mergence. Thus a Spatiotemporal object can be defined as a quadruple of object identity, spatial descriptor, attribute descriptor and history topology, which is O = < OID, SD, AD, HT>. This structure can represent both spatiotemporal data and spatiotemporal changes: a static state of a spatiotemporal object can be determined by inputting a definite time value into SD, AD and HT, while a dynamic state during a period of time can be obtained by the SD, AD and HT in the period.

Fig. 2. The spatiotemporal semantic model

References 1. Xiaoyu, W., Xiaofang, Z.: Spatiotemporal data modeling and management: a survey. In Proceedings of the 36th TOOLS Conference, IEEE Press (2000) 202-211 2. Forlizzi, L., Güting, R., Nardelli, E., Schneider, M.: A data model and data structures for moving objects databases. ACM SIGMOD (2000) 319-330 3. Peiquan, J., Lihua Y., Yuchang, G.: Semantics and modeling of spatiotemporal changes. LNCS 2888, CoopIS/DOA/ODBASE, Springer-Verlag, Berlin (2003) 924-933.

Temporal Information Management Using XML Fusheng Wang, Xin Zhou, and Carlo Zaniolo Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA {wangfsh,xinzhou,zaniolo}@cs.ucla.edu

1 Introduction A closer integration of XML and database systems is actively pursued by researchers and vendors because of the many practical benefits it offers. Additional special benefits can be achieved on temporal information management – an important application area that represents an unsolved challenge for relational databases [1]. Indeed, XML data model and query languages support: Temporally grouped representations that have long been recognized as a natural data model for historical information [2], and Turing-complete query languages, such as XQuery [3], where all the constructs needed for temporal queries can be introduced as user-defined libraries, without requiring extensions to existing standards. By contrast, the flat relational tables of traditional DBMSs are not well-suited for temporally grouped representations [4]; moreover, significant extensions are required to support temporal information in SQL and, in the past, they were poorly received by SQL standard committees. We will show that (i) XML hierarchical structure can naturally represent the history of databases and XML documents via temporally-grouped data models, and (ii) powerful temporal queries can be expressed in XQuery without requiring any extension to current standards. This approach is quite general and, in addition to the evolution history of databases, it can be used to support the version history of XML documents for transaction-time, valid-time, and bitemporal chronicles [5]. We will demo the queries discussed in [5] and show that this approach leads to simple programming environments that are fully-integrated with current XML tools and commercial DBMSs.

2

The Systems ArchIS and ICAP

In our demo, we first show that transaction-time history of relational databases can be effectively published as XML views, where complex temporal queries on the evolution of database relations can be expressed in standard XQuery [6]. Therefore, we will demonstrate our ArchIS prototype that supports these queries efficiently on traditional database systems enhanced with SQL/XML [7]. A temporal library of XQuery functions is used to facilitate the writing of the P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 858–859, 2004. © Springer-Verlag Berlin Heidelberg 2004

Temporal Information Management Using XML

859

more complex queries and hide some implementation details (e.g., the internal representation of ‘now’). We can thus support the complete gamut of historical queries, including snapshot and time-slicing queries, element-history queries, since and until queries. These temporal queries in XQuery are then mapped and executed as equivalent SQL/XML queries executing on the RDBMS. The next topic in the demo is the application of our temporal representations and queries to XML documents of arbitrary nesting complexity. In the ICAP project [8], we store the version history of documents of public interest in ways that assure that powerful historical queries can be easily expressed and supported. Examples include successive versions of standards and normative documents, such as the UCLA course catalog [9], and the W3C Xlink specs [10], which are issued in XML form. Toward this objective, (i) we use structured diff algorithms [11–14] to compute the validity periods of the elements in the multi-version document, (ii) we use the output generated by the diff algorithm, to build a concise representation history of the document using a temporally grouped data model. Then, on this representation, (iii) we use XQuery, enhanced with the library of temporal functions, to formulate temporal queries on the evolution of the document and its content.

The ICAP system also provides additional version-support services, including the ability of color-marking changes between versions, and annotating the changes with explanations and useful metainformation.

References 1. G. Ozsoyoglu and R.T. Snodgrass, “Temporal and real-time databases: A survey”. in TKDE, 7(4):513–532, 1995. 2. J. Clifford. Formal Semantics and Pragmatics for Natural Language Querying. Cambridge University Press, 1990. 3. XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/. 4. J. Clifford, A. Croker, F. Grandi,and A. Tuzhilin, “On Temporal Grouping”, in Proc. of the Intl. Workshop on Temporal Databases, 1995. 5. F. Wang and C. Zaniolo. “XBiT: An XML-based Bitemporal Data Model”, in ER 2004. 6. F. Wang and C. Zaniolo, “Publishing and Querying the Histories of Archived Relational Databases in XML”, in WISE 2003. 7. “SQL/XML, http://www.sqlx.org”, http://www.sqlx.org. 8. UCLA ICAP Project. http://wis.cs.ucla.edu/projects/icap/. 9. UCLA Catalog. http://www.registrar.ucla.edu/catalog/. 10. XML Linking Language (XLink). http://www.w3.org/TR/xlink/. 11. S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, “Change Detection in Hierarchically Structured Information”, in SIGMOD 1996. 12. Microsoft XML Diff. http://apps.gotdotnet.com/xmltools/xmldiff/. 13. Gregory Cobena, Serge Abiteboul, Amelie Marian, “Detecting Changes in XML Documents”, in ICDE 2002. 14. Y. Wang, D. J. DeWitt, and J. Cai, “X-Diff: A Fast Change Detection Algorithm for XML Documents”, in ICDE 2003.

SVMgr: A Tool for the Management of Schema Versioning Fabio Grandi Alma Mater Studiorum – Università di Bologna, Dipartimento di Elettronica, Informatica e Sistemistica, Viale Risorgimento 2, I-40136 Bologna, Italy [emailprotected]

1

Overview of the SVMgr Tool

The SVMgr tool is an integrated development environment for the management of a relational database supporting schema versioning, based on the multi-pool implementation solution [2]. In a few words, the multi-pool solution allows the extensional data connected to each schema version (data pool) to evolve independently from each other. The multi-pool solution is more flexible and potentially useful for advanced applications as it allows the coexistence of different full-fledged conceptual viewpoints on the mini-world modeled by the database [5], and it has partially been adopted also by other authors [3]. The multi-pool implementation underlying the SVMgr tool is based on the Logical Storage Model presented in [4] and allows the underlying multi-version database to be implemented on top of MS Access. The software prototype has been written in Java (it is downward compatible with the 1.2 version) and interacts with the underlying database via JDBC/ODBC on a MS Windows platform. In order to show the multi-pool approach features in practice and test its potentialities against applications, SVMgr has been equipped with a multi-schema query interface, initially supporting select-project-join queries written in the Multi-Schema Query Language MSQL [4,5]. Hence, the SVMgr prototype represents the first implemented relational database system with schema versioning support which is able to answer multi-schema queries. The MSQL language includes two syntax extensions to refer to different schema versions: naming qualifiers and extensional qualifiers [4,5]. The former allow users to denote a schema object (e.g. an attribute or relation name) through its name used in different schema versions: for instance, “[SV1:R]” denotes the relation named R in schema version SV1. The latter allow users to denote object values as stored in different data pools: for instance “SV2:S” denotes the instance of relation S in the data pool connected to schema version SV2. Recently, the MSQL language has been further developed with the addition of grouping and ordering facilities: multi-schema extensions of the SQL GROUP BY, HAVING and ORDER BY clauses, and of the SQL aggregate functions are fully supported by the SVMgr tool in its current release. The SVMgr environment, in its current release (ver 13.02, as of July 2004), supports four main groups of functions: P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 860–861, 2004. © Springer-Verlag Berlin Heidelberg 2004

SVMgr: A Tool for the Management of Schema Versioning

861

Database Content Management, which allow users to inspect and modify the contents of the data pools associated to different schema versions. Also integrity checks on primary key uniqueness (which may be threatened by the execution of schema changes) can be effected. Schema Version Management, which allow users to effect schema changes and create a new schema version. Supported schema changes are: add, rename, drop a table; add, rename, drop a table column; change the primary key of a table. Integration Support Tools, which allow users to support the integration activities [1] when the underlying database is used as a data source in an heterogeneous environment. Multi-schema Queries, which allow users to execute multi-schema SPJ queries, by implementing a MSQL query language interface. MSQL queries are translated into standard SQL queries which are executed via JDBC on the underlying database implementing the Logical Storage Model. Users of the tool are supposed to be database administrators, which have complete control and responsibility over the database schema and contents, including management of schema versions. Although SVMgr users are supposed to have a reasonable knowledge of the main features of schema versioning with the multi-pool implementation solution, the prototype has been equipped with a user-friendly interface, and requires a minimum knowledge of the underlying data model beyond the intuition. Also several correctness checks have been carefully encoded in all the available user functions to protect, as much as possible, the database integrity from an incorrect use.

References 1. S. Bergamaschi, S. Castano, A. Ferrara, F. Grandi, F. Guerra, G. Ornetti, M. Vincini, Description of the Methodology for the Integration of Strongly Heterogeneous Sources, Tech. Rep. D1.R6, D2I Project, 2002, http://www.dis.uniromal.it/~lembo/D2I/Prodotti/index.html. 2. C. De Castro, F. Grandi, and M. R. Scalas. Schema Versioning for Multitemporal Relational Databases. Information Systems, 22(5):249–290, 1997. 3. R. de Matos Galante, A. Bueno da Silva Roma, A. Jantsch, N. Edelweiss, and C. Saraiva dos Santos. Dynamic Schema Evolution Management Using Version in Temporal Object-Oriented Databases. In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA 2002), pages 524–533, Aix-en-Provence, France, September 2002. Springer Verlag. 4. F. Grandi, “A Relational Multi-Schema Data Model and Query Language for Full Support of Schema Versioning”, Proc. of SEBD 2002, Portoferraio – Isola d’Elba, Italy, pp. 323-336. 2002. 5. F. Grandi, “Boosting the Schema Versioning Potentialities: Querying with MSQL in the Multi-Pool Approach”, 2004 (in preparation).

GENNERE: A Generic Epidemiological Network for Nephrology and Rheumatology Ana Simonet1, Michel Simonet1, Cyr-Gabin Bassolet1, Sylvain Ferriol1, Cédric Gueydan1, Rémi Patriarche1, Haijin Yu2, Ping Hao2, Yi Liu2, Wen Zhang2, Nan Chen2, Michel Forêt5, Philippe Gaudin4, Georges De Moor6, Geert Thienpont6, Mohamed Ben Saïd3, Paul Landais3, and Didier Guillon5 1

Laboratoire TIMC-IMAG, Faculté de Médecine de Grenoble, France {Ana.Simonet, Michel.Simonet}@imag.fr 2

3

Rui Jin Hospital, Shanghai, China LBIM, Université Paris 5 Necker, France 4 CHU Grenoble, France 5 AGDUC Grenoble, France 6 RAMIT, Belgium

Abstract. GENNERE is a networked information system designed to answer epidemiological needs. Based on a French experiment in the field of End-Stage Renal Diseases (ESRD), it has been thought of so as to be adapted to Chinese medical needs and administrative rules. It has been implemented for nephrology and rheumatology at the Rui Jin hospital in Shanghai, but its design and implementation have been guided by genericity in order to make easier its adaptation and extension to other diseases and other countries. The genericity aspects have been considered at the levels of events design, database design and production, and software design. This first experiment in China leads to some conclusions about the adaptability of the system to several diseases and the multilinguality of the interface and in medical terminologies.

1 Introduction GENNERE, for Generic Epidemiological Network for Nephrology and Rheumatology, is a project supported by the European ASIA-ITC program in 2003-2004. This project come from a cooperation established between partners of the French MSISREIN project [1] and the Rui Jin hospital in Shanghai. The main goal of the project is the setting up of a system for the epidemiological follow-up of chronic diseases, following the MSIS-REIN approach, although adapted to China’s needs [2], A secondary objective was to find a methodology to design and implement software that will support an easy adaptation to other chronic diseases and to other countries. To demonstrate the generic aspects which have been emphasized in the GENNERE program, two medical fields have been considered in this experiment: nephrology (as in MSISREIN) and rheumatology. In this demonstration we present the GENNERE systems, with an emphasis on genericity aspects in their design and implementation.

2 The GENNERE Project Genericity, which is the major non-medical objective of the GENNERE project, has been considered mainly at two levels: the design process and the software implemenP. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 862–864, 2004. © Springer-Verlag Berlin Heidelberg 2004

GENNERE: A Generic Epidemiological Network for Nephrology and Rheumatology

863

tation. From the design point of view, the methodology followed was: 1) to highlight the events and functionalities by categories of users, 2) to define the abstract ontological model for chronic diseases, 3) to isolate the data specific to a country, 4) to design the interfaces blocks. Events. In GENNERE the events for nephrology are the same as the events taken into account in the French MSIS-REIN project. The events model for rheumatology is very similar to that for nephrology, which demonstrates the possibility to constitute a core of minimal events for this kind of system. Database design. For database design we have used the CASE tool ISIS (Information System Initial Specification) developed at the TIMC laboratory [3]. It enables the designer to work at the conceptual level and specify behavioral aspects rather than implement them directly at the level of relational tables. Moreover, thanks to its ability to check the consistency of a specification and to automatically generate the database (logical and physical schema), the ISIS system has considerably shortened the cycle of knowledge extraction from medical experts and its validation by users. Views. To ensure the minimum work when modifying the database schema we have been careful to access the database only through views – except for some database updates which were too complex to be supported by the DBMS. Ontological data model. As in the French model, the core of the conceptual model is centered on three generic concepts: PATIENT, FOLLOW UP and TREATMENT. However, each concept is derived according to a specific structure for each disease. For example, the TREATMENT concept is much more complex for Rhumatoid Arthritis than for ESRD because this illness has several anatomic localizations and can be treated by several means: medicines, local manipulations or traditional Chinese medicine. Country-specific data. The setting up of the GENNERE system has put into light the data which are specific of a given country, e.g., addresses and insurance for patient, which depend strongly on the geographical and administrative organization of the country; another example is that of Chinese traditional medicine (acupuncture, herbs, massages, etc). These categories of data are much country-specific and will have to be studied anew when a new country is to be considered. Multilingualism. Dealing simultaneously with several languages, while keeping the possibility of adding a new language, imposes the choice of an encoding system which supports a wide variety of languages [4]. In GENNERE, which must support Chinese, English and French, the choice was UTF8-Unicode, which also supports most known languages, including other Asian languages. Metadata. To present multilingual interfaces without needing to duplicate the interface code, we built a specific database for the management of the metadata of the GENNERE system (Nephrology and Rheumatology). This database contains the description of the database objects: tables, columns, domain values (for enumerated attributes). It also contains the objects necessary to the various interfaces, e.g., labels used in the Graphical User Interface. Each object has a unique identifier and is asso-

864

Ana Simonet et al.

ciated to as many items as different languages, Chinese, English and French at the present time. Thus, a value or a label may be rapidly retrieved according to the selected language. This method also facilitates the addition of a new language: one must only fill the metadata tables with the corresponding items in this language.

3 Conclusion Genericity has constituted a major concern in the design and the implementation of the GENNERE information system. This project showed us a need for tools allowing a cooperative work, especially when the team in charge of conceiving and developing the project is culturally, socially and geographically heterogeneous. This lead to several cycles for knowledge extraction and validation in order to come to a consensus. The CASE tool ISIS [3] played a very important role for reaching consensus more rapidly among the partners. For the success of the MSIS-REIN the human factor has been determining. Beside technology improvements, didactic efforts have been made and are still necessary to help users and to identify impediments to changes. This factor is at least as important in China. The GENNERE program in its final development phase will be installed at the Rui Jin hospital in Shanghai at the end of the year 2004 on two servers, one for nephrology and one for rheumatology. During the next phase (2005-2006) data integration from other centres will be carried out and a data warehouse will be implemented, also in a generic way, along with epidemiological and data presentation tools [5]. A Geographical Information System, currently under implementation in France, seems very promising to support public health decision and will be considered in a later phase [6].

References 1. Landais P., Simonet A., et le groupe de travail SIMS@REIN. SIMS@REIN: Un Système d’Information Multisources pour l’Insuffisance Rénale Terminale. CR Acad Sci (série (III) Sciences de la Vie). 2002; 325: 515-528 2. Chen N., et al. Shanghai Cooperation group: The clinical epidemiology of cardiovascular disease in chronic renal failure in Shanghai. Chin. Journal of Nephrology 2001; 17: 91-94 3. Simonet A., Simonet, M.: The ISIS Methodology for Object Database Conceptual Modelling. Poster E/R 99: 18th Conference on Conceptual Modeling, Paris, 15-18 Nov. 99 4. Kumaran A., Haritsa J. R.: On the costs of multilingualism in database systems. Procs. of the VLDB Conference, Berlin, Germany, 2003. 5. Simonet A. et al.: Un entrepôt de données pour l’aide à la décision sanitaire en néphrologie. Revue des Sciences et Technologies de l’Information, série Ingénierie des Systèmes d’Information, Vol. 8 – N° 1/2003, pp 75-89, HERMES. 6. Simonet M., Toubiana L., Simonet A., Ben Said M., Landais P.: Ontology and Geographical Information System for End-Stage Renal Disease: the SIGNE. Workshop on fundamental issues on geographic and spatial ontologies, held with COSIT’03 (Conference on Spatial Information Theory), Ittingen, 23 sept. 2003.

Panel: Beyond Webservices – Conceptual Modelling for Service Oriented Architectures Peter Fankhauser Fraunhofer IPSI 64293 Darmstadt, Germany [emailprotected]

Abstract. Webservices are evolving as the paradigm for loosely coupled architectures. The prospect of automatically composing complex processes from simple services is promising. However, a number of open issues remain: Which aspects of service semantics need to be explicated? Does it suffice to just model datastructures and interfaces, or do we also need process descriptions, behavioral semantics, and quality of service specifications? How can we deal with heterogeneous service descriptions? Should we use shared ontologies or adhoc mappings? This panel shall discuss to which extent established techniques from conceptual modelling can help in describing services to enable their discovery, selection, composition, negotiation, and invocation.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, p. 865, 2004. © Springer-Verlag Berlin Heidelberg 2004

This page intentionally left blank

Author Index

Fan, Xiaocong 582 Fankhauser, Peter 865 Feng, Jiansheng 18 Fernandes, David 402 Fernández-Medina, Eduardo Ferriol, Sylvain 862 Flesca, Sergio 286 Forêt, Michel 862

Acuña, César J. 780 Ahnizeret, Keyla 402 Al-Kamha, Reema 150 Atluri, Vijayalakshmi 624 Bailey, James 479 Bassolet, Cyr-Gabin 862 Bell, David 313 Benatallah, Boualem 524 Ben Saïd, Mohamed 862 Berger, Sacha 852 Bergholtz, Maria 724 Bhowmick, Sourav S. 493 Böhlen, Michael H. 610 Bowers, Shawn 668 Bresciani, Paolo 682 Bry, François 852 Busetta, Paolo 682 Cabot, Jordi 69 Casati, Fabio 524, 825 Castellanos, Malu 825 Cavalcanti, João M.B. 402 Chakravarthy, Sharma 420 Chang, Yuan-Chi 828, 838 Chen, Cindy X. 844 Chen, Huowang 348 Chen, Nan 862 Cheung, Shing-Chi 542 Chillakuru, Vamsavardhana R. Claypool, Kajal T. 844 Conesa, Jordi 122 Conrad, Stefan 388 da Silva, Altigran S. 402 Davies, Islay 30 Delcambre, Lois 668 Demetrovics, János 166 De Moor, Georges 862 de Moura, Edleno Silva 402 Dong, Yisheng 300 Embley, David W. Eppili, Ajay 420

Fan, Hao 639

150

Gallo, Stan 30 Ganascia, Jean-Gabriel 83 Gao, Like 464 Garcia-Molina, Hector 1 Gaudin, Philippe 862 Goelman, Don 43 Gong, Yuchang 856 Gordijn, Jaap 709 Gosain, Anjana 205 Grandi, Fabio 860 Graupmann, Jens 3 Greco, Gianluigi 231 Green, Peter 30, 110 Gueydan, Cédric 862 Guillon, Didier 862 Guo, Gongde 313 Gupta, Amarnath 55

828

Härder, Theo 654 Hakkarainen, Mika 841 Hao, Ping 862 He, Qi 245 Hilsbos, Margaret 740 Holland, Stefan 511 Holze, Marc 850 Hu, Dabin 18 Huang, Dong 831 Hwang, Jeong Hee 377 Hwang, San-Yih 596 Indulska, Marta

110

Jablonski, Stefan 850 Jacob, Jyoti 420 Jayaweera, Prasad 724 Jensen, Ole G. 610 Ji, Xiao 18

217

868

Author Index

Jin, Peiquan 856 Jin, Youxuan 834 Johannesson, Paul 724 Ke, Yiping 567 Kießling, Werner 511 Kim, Dongkyu 838 Kitsuregawa, Masaru 450 Korhonen, Pekka 841 Kovse, Jernej 654 Kuflik, Tsvi 682 Landais, Paul 862 Launonen, Raimo 841 Lee, Juhnyoung 838 Lee, Sang-goo 838 Lembo, Domenico 231 Leppänen, Tero 841 Li, HaiJun 327 Li, Ling 327 Li, Shijun 273 Li, Zhoujun 348 Liao, Zhining 313 Liddle, Stephen W. 150 Ling, Tok Wang 245 Liu, Lin 555 Liu, Mengchi 273 Liu, Yi 862 Lu, An 259 Luján-Mora, Sergio 191 Madria, Sanjay 493 Marcos, Esperanza 780 Miao, Huaikou 753 Molnár, András 166 Mondal, Anirban 450 Nemes, Gabor 850 Ng, Wilfred 259, 567 Olivé, Antoni 122, 136 Padmanabhan, Sriram 464 Pan, Hanglin 854 Patriarche, Rémi 862 Paul, Raymond A. 596 Penserini, Loris 682 Pérez de Laborda, Cristian 388 Petrov, Ilia 850 Piattini, Mario 217

Popfinger, Christopher 388 Poulovassilis, Alexandra 639 Prakash, Naveen 205 Prakash, Sandeep 493 Purao, Sandeep 336, 582

Qin, Li 624 Ram, Sudha 696 Raventós, Ruth 69 Reinhartz-Berger, Iris 766 Rosemann, Michael 30, 110 Rundensteiner, Elke A. 795 Ryu, Keun Ho 377 Sachde, Alpa 420 Santini, Simone 55 Schenkel, Ralf 3, 854 Schneider, Marcus 850 Shan, Ming-Chien 825 Shi, Baile 831 Simonet, Ana 862 Simonet, Michel 862 Simpanen, Esko 841 Singh, Yogesh 205 Skrivan, Sam 834 Song, Il-Yeol 43, 740 Srivastava, Jaideep 596 Stojanovic, Ljiljana 434 Stojanovic, Nenad 434 Storey, Veda C. 336 Sturm, Arnon 766 Tagarelli, Andrea 286 Tan, Hee Beng Kuan 180 Tan, Qingzhao 567 Tang, Haidong 18 Tang, Jonathan 834 Tang, Jun 831 Thalheim, Bernhard 166 Theobald, Anja 854 Theobald, Martin 3 Thienpont, Geert 862 Toumani, Farouk 524 Trujillo, Juan 191, 217 Tzitzikas, Yannis 841 Törnroos, Hannu 841 Uceda-Sosa, Rosario 844 Umapathy, Karthikeyan 582 Uusitalo, Pekka 841

Author Index

van der Aalst, W.M.P. 362 van Dongen, B.F. 362 van Eck, Pascal 709 Vänskä, Pentti 841 Vassiliadis, Panos 191 Vela, Belén 780 Velcin, Julien 83 Vigna, Sebastiano 96 Villarroel, Rodolfo 217

Wei, Wei 696 Weikum, Gerhard 3 Wieringa, Roel 709 Wieser, Christoph 852 Wohed, Petia 724

Wang, Chen 831 Wang, Fusheng 810, 858 Wang, Haojun 596 Wang, Hengjie 18 Wang, Hui 313 Wang, LiMin 327 Wang, Ling 795 Wang, Min 464, 828 Wang, Wei 831 Wang, X. Sean 464 Wang, Xiaoling 300 Waworuntu, Stella 479 Weber, Christian 654 Wei, Wanxia 273

Yan, Yuejin 348 Yeh, Adam 834 Yen, John 582 Yu, Eric 555 Yu, Haijin 862 Yuan, SenMiao 327 Yue, Lihua 856

Xiao, Xiangye 542 Xu, Chang 542

Zaniolo, Carlo 810, 858 Zhan, Xuede 753 Zhang, Dell 300 Zhang, Wen 862 Zhao, Yuan 180 Zhou, Xin 858

869

This page intentionally left blank

This page intentionally left blank

This page intentionally left blank

Lecture Notes in Computer Science For information about Vols. 1–3198 please contact your bookseller or Springer

Vol. 3305: P.M.A. Sloot, B. Chopard, A.G. Hoekstra (Eds.), Cellular Automata. XV, 883 pages. 2004. Vol. 3302: W.-N. Chin (Ed.), Programming Languages and Systems. XIII, 453 pages. 2004. Vol. 3299: F. Wang (Ed.), Automated Technology for Verification and Analysis. XII, 506 pages. 2004. Vol. 3294: C.N. Dean, R.T. Boute (Eds.), Teaching Formal Methods. X, 249 pages. 2004. Vol. 3293: C.-H. Chi, M. van Steen, C. Wills (Eds.), Web Content Caching and Distribution. IX, 283 pages. 2004. Vol. 3292: R. Meersman, Z. Tari,A. Corsaro(Eds.), On the Move to Meaningful Internet Systems 2004: OTM 2004 Workshops. XXIII, 885 pages. 2004. Vol. 3291: R. Meersman, Z. Tari (Eds.), On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. XXV, 824 pages. 2004. Vol. 3290: R. Meersman, Z. Tari (Eds.), On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. XXV, 823 pages. 2004. Vol. 3289: S. Wang, K. Tanaka, S. Zhou, T.W. Ling, J. Guan, D. Yang, F. Grandi, E. Mangina, I.-Y. Song, H.C. Mayr (Eds.), Conceptual Modeling for Advanced Application Domains. XXII, 692 pages. 2004.

Vol. 3270: M. Jeckle, R. Kowalczyk, P. Braun (Eds.), Grid Services Engineering and Management. X, 165 pages. 2004. Vol. 3269: J. Lopez, S. Qing, E. Okamoto (Eds.), Information and Communications Security. XI, 564 pages. 2004. Vol. 3266: J. Solé-Pareta, M. Smirnov, P.V. Mieghem, J. Domingo-Pascual, E. Monteiro, P. Reichl, B. Stiller, R.J. Gibbens (Eds.), Quality of Service in the Emerging Networking Panorama. XVI, 390 pages. 2004. Vol. 3265: R.E. Frederking, K.B. Taylor (Eds.), Machine Translation: From Real Users to Research. XI, 392 pages. 2004. (Subseries LNAI). Vol. 3264: G. Paliouras, Y Sakakibara (Eds.), Grammatical Inference: Algorithms and Applications. XI, 291 pages. 2004. (Subseries LNAI). Vol. 3263: M. Weske, P. Liggesmeyer (Eds.), ObjectOriented and Internet-Based Technologies. XII, 239 pages. 2004. Vol. 3262: M.M. Freire, P. Chemouil, P. Lorenz, A. Gravey (Eds.), Universal Multiservice Networks. XIII, 556 pages. 2004. Vol. 3261: T. Yakhno (Ed.), Advances in Information Systems. XIV, 617 pages. 2004.

Vol. 3288: P. Atzeni, W. Chu, H. Lu, S. Zhou, T.W. Ling (Eds.), Conceptual Modeling – ER 2004. XXI, 869 pages. 2004.

Vol. 3260: I.G.M.M. Niemegeers, S.H. de Groot (Eds.), Personal Wireless Communications. XIV, 478 pages. 2004.

Vol. 3287: A. Sanfeliu, J.F.M. Trinidad, J.A. Carrasco Ochoa (Eds.), Progress in Pattern Recognition, Image Analysis and Applications. XVII, 703 pages. 2004.

Vol. 3258: M. Wallace (Ed.), Principles and Practice of Constraint Programming – CP 2004. XVII, 822 pages. 2004.

Vol. 3286: G. Karsai, E. Visser (Eds.), Generative Programming and Component Engineering. XIII, 491 pages. 2004.

Vol. 3257: E. Motta, N.R. Shadbolt, A. Stutt, N. Gibbins (Eds.), Engineering Knowledge in the Age of the Semantic Web. XVII, 517 pages. 2004. (Subseries LNAI).

Vol. 3284: A. Karmouch, L. Korba, E.R.M. Madeira (Eds.), Mobility Aware Technologies and Applications. XII, 382 pages. 2004. Vol. 3281: T. Dingsøyr (Ed.), Software Process Improvement. X, 207 pages. 2004. Vol. 3280: C. Aykanat, T. Dayar, (Eds.), Computer and Information Sciences - ISCIS 2004. XVIII, 1009 pages. 2004. Vol. 3278: A. Sahai, F. Wu (Eds.), Utility Computing. XI, 272 pages. 2004. Vol. 3274: R. Guerraoui (Ed.), Distributed Computing. XIII, 465 pages. 2004. Vol. 3273: T. Baar, A. Strohmeier, A. Moreira, S.J. Mellor (Eds.), 2004 - The Unified Modelling Language. XIII, 454 pages. 2004. Vol. 3271: J. Vicente, D. Hutchison (Eds.), Management of Multimedia Networks and Services. XIII, 335 pages. 2004.

Vol. 3256: H. Ehrig, G. Engels, F. Parisi-Presicce, G. Rozenberg (Eds.), Graph Transformations. XII, 451 pages. 2004. Vol. 3255: A. Benczúr, J. Demetrovics, G. Gottlob (Eds.), Advances in Databases and Information Systems. XI, 423 pages. 2004. Vol. 3254: E. Macii, V. Paliouras, O. Koufopavlou (Eds.), Integrated Circuit and System Design. XVI, 910 pages. 2004. Vol. 3253: Y. Lakhnech, S. Yovine (Eds.), Formal Techniques, Modelling and Analysis of Timed and FaultTolerant Systems. X, 397 pages. 2004. Vol. 3252: H. Jin, Y. Pan, N. Xiao, J. Sun (Eds.), Grid and Cooperative Computing - GCC 2004 Workshops. XVIII, 785 pages. 2004. Vol. 3251: H. Jin, Y. Pan, N. Xiao, J. Sun (Eds.), Grid and Cooperative Computing - GCC 2004. XXII, 1025 pages. 2004.

Vol. 3250: L.-J. (LJ) Zhang, M. Jeckle (Eds.), Web Services. X, 301 pages. 2004.

Vol. 3222: H. Jin, G.R. Gao, Z. Xu, H. Chen (Eds.), Network and Parallel Computing. XX, 694 pages. 2004.

Vol. 3249: B. Buchberger, J.A. Campbell (Eds.), Artificial Intelligence and Symbolic Computation. X, 285 pages. 2004. (Subseries LNAI).

Vol. 3221: S. Albers, T. Radzik (Eds.), Algorithms–ESA 2004. XVIII, 836 pages. 2004.

Vol. 3246: A. Apostolico, M. Melucci (Eds.), String Processing and Information Retrieval. XIV, 332 pages. 2004.

Vol. 3220: J.C. Lester, R.M. Vicari, F. Paraguaçu (Eds.), Intelligent Tutoring Systems. XXI, 920 pages. 2004.

Vol. 3245: E. Suzuki, S. Arikawa (Eds.), Discovery Science. XIV, 430 pages. 2004. (Subseries LNAI).

Vol. 3219: M. Heisel, P. Liggesmeyer, S. Wittmann (Eds.), Computer Safety, Reliability, and Security. XI, 339 pages. 2004.

Vol. 3244: S. Ben-David, J. Case, A. Maruoka (Eds.), Algorithmic Learning Theory. XIV, 505 pages. 2004. (Subseries LNAI).

Vol. 3217: C. Barillot, D.R. Haynor, P. Hellier (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004. XXXVIII, 1114 pages. 2004.

Vol. 3243: S. Leonardi (Ed.), Algorithms and Models for the Web-Graph. VIII, 189 pages. 2004.

Vol. 3216: C. Barillot, D.R. Haynor, P. Hellier (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004. XXXVIII, 930 pages. 2004.

Vol. 3242: X. Yao, E. Burke, J.A. Lozano, J. Smith, J.J. Merelo-Guervós, J.A. Bullinaria, J. Rowe, A. Kabán, H.-P. Schwefel (Eds.), Parallel Problem Solving from Nature - PPSN VIII XX, 1185 pages. 2004.

Vol. 3215: M.G.. Negoita, R.J. Howlett, L.C. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems. LVII, 906 pages. 2004. (Subseries LNAI).

Vol. 3241: D. Kranzlmüller, P. Kacsuk, J.J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. XIII, 452 pages. 2004.

Vol. 3214: M.G.. Negoita, R.J. Howlett, L.C. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems. LVIII, 1302 pages. 2004. (Subseries LNAI).

Vol. 3240:1. Jonassen, J. Kim (Eds.), Algorithms in Bioinformatics. IX, 476 pages. 2004. (Subseries LNBI).

Vol. 3213: M.G.. Negoita, R.J. Howlett, L.C. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems. LVIII, 1280 pages. 2004. (Subseries LNAI).

Vol. 3239: G. Nicosia, V. Cutello, P.J. Bentley, J. Timmis (Eds.), Artificial Immune Systems. XII, 444 pages. 2004. Vol. 3238: S. Biundo, T. Frühwirth, G. Palm (Eds.), KI 2004: Advances in Artificial Intelligence. XI, 467 pages. 2004. (Subseries LNAI). Vol. 3236: M. Núñez, Z. Maamar, F.L. Pelayo, K. Pousttchi, F. Rubio (Eds.), Applying Formal Methods: Testing, Performance, and M/E-Commerce. XI, 381 pages. 2004. Vol. 3235: D. de Frutos-Escrig, M. Nunez (Eds.), Formal Techniques for Networked and Distributed Systems – FORTE 2004. X, 377 pages. 2004.

Vol. 3212: A. Campilho, M. Kamel (Eds.), Image Analysis and Recognition. XXIX, 862 pages. 2004. Vol. 3211: A. Campilho, M. Kamel (Eds.), Image Analysis and Recognition. XXIX, 880 pages. 2004. Vol. 3210: J. Marcinkowski, A. Tarlecki (Eds.), Computer Science Logic. XI, 520 pages. 2004. Vol. 3209: B. Berendt, A. Hotho, D. Mladenic, M. van Someren, M. Spiliopoulou, G. Stumme (Eds.), Web Mining: From Web to Semantic Web. IX, 201 pages. 2004. (Subseries LNAI).

Vol. 3234: M.J. Egenhofer, C. Freksa, H.J. Miller (Eds.), Geographic Information Science. VIII, 345 pages. 2004.

Vol. 3208: H.J. Ohlbach, S. Schaffert (Eds.), Principles and Practice of Semantic Web Reasoning. VII, 165 pages. 2004.

Vol. 3232: R. Heery, L. Lyon (Eds.), Research and Advanced Technology for Digital Libraries. XV, 528 pages. 2004.

Vol. 3207: L.T. Yang, M. Guo, G.R. Gao, N.K. Jha (Eds.), Embedded and Ubiquitous Computing. XX, 1116 pages. 2004.

Vol. 3231: H.-A. Jacobsen (Ed.), Middleware 2004. XV, 514 pages. 2004.

Vol. 3206: P. Sojka, I. Kopecek, K. Pala (Eds.), Text, Speech and Dialogue. XIII, 667 pages. 2004. (Subseries LNAI).

Vol. 3230: J.L. Vicedo, P. Martínez-Barco, R. Muñoz, M. Saiz Noeda (Eds.), Advances in Natural Language Processing. XII, 488 pages. 2004. (Subseries LNAI).

Vol. 3205: N. Davies, E. Mynatt, I. Siio (Eds.), UbiComp 2004: Ubiquitous Computing. XVI, 452 pages. 2004.

Vol. 3229: J.J. Alferes, J. Leite (Eds.), Logics in Artificial Intelligence. XIV, 744 pages. 2004. (Subseries LNAI).

Vol. 3204: C.A. Peña Reyes, Coevolutionary Fuzzy Modeling. XIII, 129 pages. 2004.

Vol. 3226: M. Bouzeghoub, C. Goble, V. Kashyap, S. Spaccapietra (Eds.), Semantics of a Networked World. XIII, 326 pages. 2004.

Vol. 3203: J. Becker, M. Platzner, S. Vernalde (Eds.), Field Programmable Logic and Application. XXX, 1198 pages. 2004.

Vol. 3225: K. Zhang, Y. Zheng (Eds.), Information Security. XII, 442 pages. 2004.

Vol. 3202: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Knowledge Discovery in Databases: PKDD 2004. XIX, 560 pages. 2004. (Subseries LNAI).

Vol. 3224: E. Jonsson, A. Valdes, M. Almgren (Eds.), Recent Advances in Intrusion Detection. XII, 315 pages. 2004. Vol. 3223: K. Slind, A. Bunker, G. Gopalakrishnan (Eds.), Theorem Proving in Higher Order Logics. VIII, 337 pages. 2004.

Vol. 3201: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine Learning: ECML 2004. XVIII, 580 pages. 2004. (Subseries LNAI). Vol. 3199: H. Schepers (Ed.), Software and Compilers for Embedded Systems. X, 259 pages. 2004.

Conceptual Modeling - ER 2004: 23rd International Conference on Conceptual Modeling, Shanghai, China, November 8-12, 2004. Proceedings

Read more

Conceptual Modeling - ER 2011. 30th International Conference on Conceptual Modeling Brussels. Proceedings (Lecture Notes in Computer Science)

Read more

Conceptual Modeling ER 2001

Read more

Conceptual Modeling - ER 2002

Read more

Conceptual Modeling ER'99: 18th International Conference on Conceptual Modeling Paris, France, November 15-18, 1999 Proceedings

Read more

Conceptual Modeling - ER '98: 17th International Conference on Conceptual Modeling, Singapore, November 16-19, 1998, Proceedings: v. 1507

Read more

Conceptual Modeling - ER '97: 16th International Conference on Conceptual Modeling, Los Angeles, CA, USA, November 3-5, 1997. Proceedings

Read more

Conceptual Modeling - ER 2006: 25th International Conference on Conceptual Modeling, Tucson, AZ, USA, November 6-9, 2006, Proceedings

Read more

Conceptual Modeling

Read more

Conceptual Modeling - ER 2009: 28th International Conference on Conceptual Modeling, Gramado, Brazil, November 9-12, 2009, Proceedings (Lecture Notes ... Applications, incl. Internet Web, and HCI)

Read more

Conceptual Modeling - ER '96: 15th International Conference on Conceptual Modeling, Cottbus, Germany, October 7 - 10, 1996. Proceedings: 15th

Read more

Advances in Conceptual Modeling: ER '99

Read more

Conceptual Modeling - ER 2000: 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA, October 9-12, 2000 Proceedings

Read more

Advances in Conceptual Modeling.. ER '99

Read more

Conceptual Modeling - ER 2010: 29th International Conference on Conceptual Modeling, Vancouver, BC, Canada, November 1-4, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Read more

Conceptual Modeling - ER 2007, 26 conf

Read more

Software Process Improvement: 11th European Conference, EuroSPI 2004, Trondheim, Norway, November 10-12, 2004. Proceedings (Lecture Notes in Computer Science)

Read more

Biometric Authentication: ECCV 2004 International Workshop, BioAW 2004, Prague, Czech Republic, May 15, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Distributed Computing: 18th International Conference, DISC 2004, Amsterdam, The Netherlands, October 4-8, 2004. Proceedings (Lecture Notes in Computer Science)

Read more

Conceptual Modeling: Foundations and Applications

Read more

Business Process Management: Second International Conference, BPM 2004, Potsdam, Germany, June 17-18, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Natural Language Generation: Third International Conference, INLG 2004, Brockenhurst, UK, July 14-16, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Component Deployment: Second International Working Conference, CD 2004, Edinburgh, UK, May 20-21, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Electronic Government: Third International Conference, EGOV 2004, Zaragoza, Spain, August 30-September 3, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Advances in Conceptual Modeling - Foundations and Applications, ER 2007

Read more

Conceptual Modeling of Information Systems

Read more

Computational and Information Science: First International Symposium, CIS 2004, Shanghai, China, December 16-18, 2004, Proceedings

Read more

Formal Methods in Computer-Aided Design: 5th International Conference, FMCAD 2004, Austin, Texas, USA, November 15-17, 2004, Proceedings (Lecture Notes in Computer Science, 3312)

Read more

Teaching Formal Methods: CoLogNET FME Symposium, TFM 2004, Ghent, Belgium, November 18-19, 2004. Proceedings (Lecture Notes in Computer Science)

Read more

Formal Methods and Software Engineering: 6th International Conference on Formal Engineering Methods, ICFEM 2004, Seattle, WA, USA, November 8-12, 2004, Proceedings (Lecture Notes in Computer Science)

Read more

Recommend Documents

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen2224 3Berlin Heidelberg New Y...

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen2503 3Berlin Heidelberg New Y...

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen1728 3 Berlin Heidelberg New Yor...

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen1507 3 Berlin Heidelberg New Yor...

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen1565 3 Berlin Heidelberg New Yor...

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Sign In

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data.

Brazil
Conceptual Modeling - ER 2004: 23rd International Conference on Conceptual Modeling, Shanghai, China, November 8-12, 2004. Proceedings (Lecture Notes in Computer Science) - PDF Free Download (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Arielle Torp

Last Updated:

Views: 6116

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.