Ontology Based Data Warehousing for Improving Touristic Web Sites

ONTOLOGY BASED DATA WAREHOUSING FOR IMPROVING TOURISTIC WEB SITES

Alberto Salguero, Francisco Araque, Cecilia Delgado Department of Computer Languages and Systems - University of Granada

C/ Periodista Daniel Saucedo Aranda s/n, 18071, Granada (Andalucía), Spain

ABSTRACT

The World Wide Web (WWW) is continuously evolving and its information is dispersed. It is not always easy for a user to find the information he is looking for. By mean of a Data Warehouse approach we will store and integrate some of the interesting tourist information in the WWW. This information will be used to expand the information of the WWW when navigating through web pages using a Firefox plug-in. The Data Warehouse architecture has been designed using an ontology approach so this plug-in is able to perform some kind of reasoning about the relevant information to display.

KEYWORDS

Data Warehouse, tourism, ontology, e-business, World Wide Web.

1. INTRODUCTION

There is an increase in the number of Web sites which can be queried across the WWW. Such data sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. One drawback to these data sources is that the information is not well structured and is usually volatile. Structured objects have to be extracted from the HTML documents which contain irrelevant data.

One of the main problem of using the Web as data source is that all web sites are developed and managed independently. Every organization is responsible of defining the scheme of its data as well as its representation. This fact implies the need of a software layer which integrates all the data coming from all the web data sources. The ability to integrate data from a wide range of data sources is an important field of research in data engineering. Data integration is a prominent theme in many areas and enables widely distributed, heterogeneous, dynamic collections of information sources to be accessed and handled.

The Data Warehouse (DW) approach is usually selected in business environments as the best solution to store and integrate all the information coming from independent data sources. The DW architecture is designed to simplify and enhance the querying and the analysis process of its data. Web information sources usually have their own information delivery schedules (Watanabe et al., 2001). Generally, the enterprises and organizations develop systems that are continuously polling the sources to enable (near) real-time changes capturing and loading. This approach is not efficient and can produce overload problems if it is necessary to query a lot of sources. It is more efficient to poll the web sites when it is needed. In order to address this problem, we propose a system which allows distributed information monitoring of web data sources on the WWW. The approach relies on monitoring information distributed on different resources and alerting the user (in our case the DW refreshments process) when certain conditions regarding this information are satisfied (temporal properties).

We are going to apply this DW approach for retrieving and integrating interesting touristic data from Web. Tourism is a prominent area in electronic commerce. However, the growth of the on-line tourism market has not been as fast as previously expected (Davidson & Yu, 2005). As pointed out by Lexhagen (2005), tourism businesses should try to develop more value-added services. The goal is to build up strong customer relationships and loyalties, which may provide continuous buying behavior. Some examples of ICT value-added services that a tourism enterprise can offer are automatic categorization of user travel preferences in order to match them up with travel options (Galindo et al., 2002), search engine interface metaphors for trip planning (Xiang & Fesenmaier, 2005) and semantic brokering systems (Antoniou et al.,

ISBN: 978-972-8924-66-9 © 2008 IADIS

120

https://www.researchgate.net/publication/220542928_The_Internet_and_the_Occidental_Tourist_An_Analysis_of_Taiwan's_Tourism_Websites_From_the_Perspective_of_Western_Tourists?el=1_x_8&enrichId=rgreq-6f2f6cb7-3a40-45d7-9e2c-808ed04e54b1&enrichSource=Y292ZXJQYWdlOzI2NzcxNDcyMjtBUzoyMjM2MTE0MjUxNjk0MTdAMTQzMDMyNDUyNjg3Ng==

https://www.researchgate.net/publication/222048012_Applying_fuzzy_databases_and_FSQL_to_the_management_of_rural_accommodation?el=1_x_8&enrichId=rgreq-6f2f6cb7-3a40-45d7-9e2c-808ed04e54b1&enrichSource=Y292ZXJQYWdlOzI2NzcxNDcyMjtBUzoyMjM2MTE0MjUxNjk0MTdAMTQzMDMyNDUyNjg3Ng==

2005). The use of a DW has been proposed previously in the tourism field. In (Haller et al., 2000) the Integrating Heterogeneous Tourism Information data sources problem is addressed using three-tier architecture. As a step forward, in this paper we will describe how new information technology techniques such as data warehousing and ontologies can be used in the electronic commerce tourism industry to create value added services.

An ontology is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. They allow the use of automatic reasoning methods. The use of a data model based on ontologies is proposed as a common data model (CDM) to deal with the data sources schemes integration. Although it is not the first time the ontology model has been proposed for this purpose (Skotas & Simitsis, 2006), in this case the work has been focused on the integration of spatio-temporal data. Moreover, to our knowledge this is the first time the metadata storage capabilities of some ontology definition languages has been used in order to improve the DW data refreshment process design.

The system proposed in this paper is able to extend the touristic information in web pages incorporating the knowledge in the DW. To do so we have developed a plug-in for the Firefox browser with access to the data in the DW through an ontology inference engine, allowing the use of reasoning capabilities of ontologies.

The remaining part of this paper is organized as follows. In section 2, some basic concepts are revised; in section 3 our architecture is presented; in section 4 the web augmentation process is detailed; finally, section 5 summarizes the conclusions of this paper.

2. PRELIMINARS

The proposed system makes use of two technologies basically: Data Warehousing and Ontologies. In this section are both introduced.

2.1 Data Warehouse

Inmon (Inmon, 2002) defined a Data Warehouse as “a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process.” A DW is a database that stores a copy of operational data with an optimized structure for query and analysis. The scope is one of the issues which defines the DW: it is the entire enterprise. In terms of a more limited scope, a new concept is defined: a Data Mart (DM) is a highly focused DW covering a single department or subject area. The DW and data marts are usually implemented using relational databases (Harinarayan, 1996) which define multidimensional structures. A federated database system (FDBS) is formed by different component database systems; it provides integrated access to them: they co-operate (inter-operate) with each other to produce consolidated answers to the queries defined over the FDBS. Generally, the FDBS has no data of its own as the DW has. Queries are answered in the FDBS by accessing the component database systems.

We have extended the Sheth & Larson five-level FDBS architecture (Sheth & Larson, 1990), which is very general and encompasses most of the previously existing architectures. In this architecture three types of data models are used: first, each component database can have its own native model; second, a canonical data model (CDM) which is adopted in the FDBS; and third, external schema can be defined in different user models.

One of the fundamental characteristics of a DW is its temporal dimension, so the scheme of the warehouse has to be able to reflect the temporal properties of the data. The extracting mechanisms of this kind of data from operational system will be also important. In order to carry out the integration process, it will be necessary to transfer the data of the data sources, probably specified in different data models, to a common data model, that will be then used as the model to design the scheme of the warehouse. In our case, we have decided to use an ontological model as canonical data model.

IADIS International Conference e-Commerce 2008

121

https://www.researchgate.net/publication/2593555_Implementing_Data_Cubes_Efficiently?el=1_x_8&enrichId=rgreq-6f2f6cb7-3a40-45d7-9e2c-808ed04e54b1&enrichSource=Y292ZXJQYWdlOzI2NzcxNDcyMjtBUzoyMjM2MTE0MjUxNjk0MTdAMTQzMDMyNDUyNjg3Ng==

https://www.researchgate.net/publication/228608534_Integrating_Heterogeneous_Tourism_Information_in_TIScover-The_MIRO-Web_Approach?el=1_x_8&enrichId=rgreq-6f2f6cb7-3a40-45d7-9e2c-808ed04e54b1&enrichSource=Y292ZXJQYWdlOzI2NzcxNDcyMjtBUzoyMjM2MTE0MjUxNjk0MTdAMTQzMDMyNDUyNjg3Ng==

Data sources

Operational dbs

External sources

ExtractTransformLoad Refresh

Data Warehouse

Data Marts

AnalysisOLAP Servers

Data Mining

Query/Reporting

Metadata Repository

Monitoring & Admnistration

Tools

Serve

Figure 1. A generic DW architecture.

2.2 Ontologies

An ontology is the specification of a conceptualisation of a knowledge domain. It is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. They allow the use of automatic reasoning methods. OWL is a language for defining ontologies. OWL ontologies may be categorized into three species or sub-languages: OWL-Lite, OWL-DL and OWL-Full.

OWL-Lite is the syntactically simplest sub-language. It is intended to be used in situations where only a simple class hierarchy and simple constraints are needed. OWL-Full is the most expressive OWL sub-language but is not able to guarantee the decidability or computational completeness of the language, so it is not always possible to perform automated reasoning on it. OWL-DL is much more expressive than OWL-Lite and allows the use of automated reasoning restricting some OWL-Full constructions.

We have extended OWL with temporal and spatial elements. We call this OWL extension STOWL. This ontological-oriented model is used as the common data model in the DW, enhanced with temporal and spatial features, for the definition of the data warehouse schema to define the refreshment of the data warehouse.

This model focuses on the integration of temporal and spatial information. This kind of information has grown in importance in recent years due to the proliferation of GIS-based applications and the global positioning system (GPS). Increasingly, companies rely on such information to develop their business or enhance their productivity. The model is restricted to spatial information for reducing complexity in terms of integration possibilities: it is easier to perform the integration if the working environment is considerably reduced. It would be impractical to propose a model which considers all aspects of the real world. Moreover, we believe that this type of information is general enough to solve most of the problems which we can be found nowadays and especially touristic ones.

3. SYSTEM ARCHITECTURE

Taking paper (Araque et al., 2006) as point of departure, we propose the reference architecture in figure 2. In this figure, the data flow as well as the metadata flow are illustrated. Metadata flow represents how all the data that refers to the data, i.e. the schemes of the data sources, the rules for integrating the data…, are populated through the processing units of the architecture. These processing units have been designed to be independent among them. As it has been explained in section 2, it extends the Sheth & Larson FDBS base architecture. The involved components are explained in the following.

Native Schema. Initially we have the different web data sources. Each data source will have a scheme, the data inherent to the source and the metadata about its scheme. In the metadata we will have huge temporal information about the source: temporal and spatial data on the scheme, metadata about the availability of the source…

Preintegration. In the Preintegration phase, the semantic enrichment of the data source native schemes is made by the conversion processor. In addition, the data source temporal and spatial metadata are used to enrich the data source scheme with temporal and spatial properties. We obtain the component scheme (CS) expressed in the CDM, in our case using STOWL (OWL enriched with temporal and spatial elements).

ISBN: 978-972-8924-66-9 © 2008 IADIS

122

From the CS, expressed in STOWL, the negotiation processor generates the export schemes (ES), also expressed in STOWL. The ES represents the part of a component scheme which is available for the DW designer. It is expressed in the same CDM as the Component Scheme. These ES are the part of the CS that is considered necessary for its integration in the DW. For security or privacy reasons part of the CS can be hidden. The ES can be seen as an external scheme of the CS.

External schema generation involves obtaining all the relationships existing between the classes selected to make up an external schema. We use the method described in (Torres & Samos, 2001), (Araque & Samos, 1999) for defining external schemas in object oriented models (ODMG actually), adapted for the ontology model. External schemas generated with this process avoids the generation of unnecessary intermediate classes.

Integration. The DW scheme corresponds to the integration of multiple ES according to the DW designer needs. It is expressed in an enriched CDM (STOWL in our case) so that temporal and spatial concepts could be expressed straightforwardly. This process is made by the Schema Integration Processor which suggests how to integrate the Export Schemes, helping to solve semantic heterogeneities (out of the scope of this paper), and defining the Extracting, Transforming and Loading processes (ETL). In the definition of the DW scheme, the DW Processor participates in order to contemplate the characteristics of structuring and storage of the data in the DW.

The integration processor consist of two modules which have been added to the reference architecture in order to carry out the integration of the temporal and spatial properties of data, considering the data source extraction method used: the Temporal and Spatial Integration Processor and the Metadata Refreshment Generator.

The Temporal and Spatial Integration Processor uses the set of semantic relations and the conformed schemes obtained during the detection phase of similarities (Oliva & Saltor, 1996). This phase is part of the integration methodology of data schemes. As a result, we obtain data in form of rules about the integration possibilities existing between the original data from the data sources (minimum resultant granularity…). This information is kept in the Warehouse Scheme using STOWL as well as the data sources were annotated with temporal and spatial metadata.

In addition, as a result of the integration process, a set of mapping functions is obtained. This set of functions identifies the attributes of the schemes of the data sources that are self-integrated to obtain an attribute of the DW scheme.

The Metadata Refreshment Generator determines the most suitable parameters to carry out the refreshment of the data in the DW (Araque & Samos, 2003). The DW scheme is generated in the resolution phase of the methodology of integration of schemes of data. It is in this second phase where, from the minimum requirements generated by the temporal integration and stored in the Temporal Metadata warehouse, the DW designer fixes the refreshment parameters. As result, the DW scheme is obtained along with the Refreshment Metadata necessary to update the former according to the data source extraction method and other temporal and spatial properties of a concrete data source.

Obtaining the DW scheme is not a linear process. We need the Integration and Negotiation Processors to collaborate in an iterative process where the participation of the local and the DW administrators is necessary (Oliva & Saltor, 1996). Taking both the minimum requirements to fulfil the needs of carrying out the integration between two data of different data sources (obtained by means of the Temporal Integration module) and the integrated scheme (obtained by the resolution module) the refreshment parameters of the data stored in the DW are established.

Data Warehouse Derivation. The DW scheme is firstly obtained in STOWL. This scheme is usually transformed to a multidimensional DW Scheme where OLAP tools get de data from (darkened part of the figure 2). In the case of this work, due to the expressivity and reasoning capabilities of the ontology approach we decided to maintain the scheme of the DW in this form. The main problem is that is not efficient to maintain all the data extracted from web data sources in form of ontology instances. To ensure the queries are performed in a reasonable time the data have to be reduced considerably. This process is performed by the negotiation process and usually implies the reduction of the granularity level of the data.

At the end a DW is generated containing instances of STOWL classes. It is possible to make queries to this data through an ontology inference engine. The use of an ontology inference engine allows the automatic reasoning about the result of the queries. We will explain how to make use of this capability in point 4.

Data Warehouse Refreshment. After the schema integration and once the DW scheme is obtained, its maintenance and update will be necessary. This function is carried out by the Data Integration Processor.


123

The set of Data Integration Processors can actually be seen as a unique data warehouse data refreshment processor. For the sake of simplicity it has been sketched independently in figure 2 but, in fact, they cooperate in the refreshment process to produce the integrated data. Each Data Integration Processor is responsible of doing the incremental capture of its corresponding data source and transforming them to solve the semantic heterogeneities, according to the integration rules obtained in the integration phase. Each Data Integration Processor accesses to its corresponding data source according to the temporal and spatial requirements obtained in the integration stage. The Data Integrator Processors use a parallel, fuzzy data integration algorithm to integrate the data (Araque et al., 2007b).

Figure 2. System functional architecture.

ISBN: 978-972-8924-66-9 © 2008 IADIS

124

4. WEB AUGMENTATION

We have developed a plug-in for the Firefox web browser to display the information stored in the DW. It is designed to be as less intrusive as possible. When activated, it marks the words in the current web page we can extend according to the knowledge in the DW. When the user points the cursor over them a pop-up containing the extended information is displayed. The steps performed in this process are shown graphically in figure 3 and detailed following.

Firstly, instead of sending all the words in the web page to the ontology inference engine, the web browser sends a message to it to retrieve the list of known concepts (1). These known concepts correspond to the instance objects stored in the DW (2). The ontology inference engine acts as the query front-end of the DW and has to be located near it (it is actually located in the same computer in our case). Thus, the plug-in has to know the location of the server running the inference engine because the queries will be done through Internet.

Once retrieved the list of keywords, for each of them appearing in the document a message is sent to the ontology inference engine (3). The inference engine firstly obtains the STOWL classes/concepts the instance belongs to (4). The inference engine has a list of general queries to perform given a class. All the queries associated to the classes the instance belongs to are performed (5). The result of these queries are summarized and returned to the web browser, which create a pop-up for the keyword sent containing this information.

Figure 3. Augmentating process.

As we have explained in the introduction section we focus on tourism services. We have developed a basic prototype, using this architecture, which is able to expand web sites with tourist information about Andalucía (southern region of Spain). In figure 4 an example of the use of this application is shown. The results given by the inference engine for the keyword “Alhambra” corresponds to the result of the query “Monuments located in streets accessible by public transports from any street adjacent to the Alhambra”. This query is defined in the server for the instances of the class/concept “monument”, which “Alhambra” belongs to. At the moment the prototype does not show information about the results. It only redirects to the results given by a search engine about each of this result.


125

Figure 4. A pop up over the keyword “Alhambra”.

5. CONCLUSION

The information in the Web is evolving continuously. We have presented an architecture that is able to record the historical information dispersed by the Web and presents it conveniently when navigating through web sites.

The architecture is based on the Data Warehouse approach. Each web site the information is extracted is treated as a data source. To integrate the information coming from the different, autonomous data sources all their schemas are translated to a common data model. In this way we can resolve the heterogeneities related to them. The common data model we have used is a spatio-temporal extension of OWL, an ontology definition language. This allows the use of automatic reasoning capability of ontologies for supporting some processes of the system (the integration of the data source schemes, the refreshment process design and the query and analysis of the data).

To allow a more sophisticated query and analysis of data in the DW we have stored a part of its data directly in form of ontology. We have also developed a browser plug-in able to make proper use of this information. When the plug-in detects an instance of a class/concept in the web page we are visiting a query is sent to the server and a list of results relating to the keyword are returned to the browser and added to the web using dynamic pop-ups.

ACKNOWLEDGEMENT

This work has been supported by the Research Program under project GR2007/07-2 and by the Spanish Research Program under projects EA-2007-0228 and TIN2005-09098-C05-03.

ISBN: 978-972-8924-66-9 © 2008 IADIS

126

REFERENCES

Antoniou, G., Skylogiannis, T., Bikakis, A. & Bassiliades, N., 2005. A semantic brokering system for the tourism domain. Information Technology and Tourism, 7(3-4), 183-200.

Araque, F, Samos. J., 1999. External Schemas in Real-Time Object-Oriented Databases. 20th IEEE Real-Time Systems Symposium, WIP Proceedings, pp. 105-109. Phoenix.

Araque, F., Carrasco, R. A., Salguero, A., Delgado, C., Vila, M. A., 2007b. Fuzzy Integration of a Web data sources for Data Warehousing. Lecture Notes in Computer Science (Vol 4739). ISSN: 0302-9743 Springer-Verlag.

Araque, F., Salguero, A.., Delgado, C., Samos, J., 2006. Algorithms for integrating temporal properties of data in DW. 8th Int. Conf. on Enterprise Information Systems (ICEIS). Paphos, Cyprus. May.

Araque, F., Samos, J., 2003. Data warehouse refreshment maintaining temporal consistency. 5th Intern. Conference on Enterprise Information Systems, ICEIS´03.Angers. France.

Davidson, A. & Yu, Y., 2005. The internet and the occidental tourist: an analysis of Taiwan's Tourism Websites from the perspective of Western tourists. Information Technology and Tourism, 7(2), 91-102.

Galindo, J., Aranda, M. C., Caro, J. L., Guevara, A., Aguayo, A. Applying fuzzy databases and FSQL to the management of rural accommodation. Tourism Management Volume: 23, Issue: 6, December, 2002, pp. 623-629.

Haller, M., Pröll, B., Retschitzegger, W., Tjoa, A. M., & Wagner, R. R., 2000. Integrating Heterogeneous Tourism Information in TIScover - The MIRO-Web Approach. Proc. Of Information and Communication Technologies in Tourism, ENTER 2000. Barcelona.

Harinarayan, V., Rajaraman, A., Ullman, J, 1996. Implementing Data Cubes Efficiently. Proc. of ACM SIGMOD Conference. Montreal.

Inmon W.H, 2002. Building the Data Warehouse. John Wiley. Lexhagen, M., 2005. The importance of value-added services to support the customer search and purchase process on

travel websites. Information Technology and Tourism, 7(2), 119-135. Oliva, M., Saltor, F., 1996. A Negotiation Process Approach for Building Federated Databases. In Proceedings of 10th

ERCIM Database Research Group Workshop on Heterogeneous Information Management, Prague. 43–49. Oliva, M., Saltor, F., 1996. A Negotiation Process Approach for Building Federated Databases. In Proceedings of 10th

ERCIM Database Research Group Workshop on Heterogeneous Information Management, Prague. 43–49. Sheth, A., Larson, J., 1990. Federated Database Systems for Managing Distributed, Heterogeneous and Autonomous

Databases.” ACM Computing Surveys, Vol. 22, No. 3. Skotas, D., Simitsis, A., 2006. Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-

Structured Data. International Journal on Semantic Web and Information Systems, Vol. 3, Issue 4. pp. 1-24. Torres, M., Samos, J., 2001. Generation of External Schemas in ODMG Databases. In proceedings of International

Database Engineering and Applications Symposium (IDEAS'2001), IEEE Computer Society Press, pp. 89-98, Grenoble.

Watanabe, Y., Kitagawa, H., Ishikawa, Y., 2001. Integration of Multiple Dissemination-Based Information Sources Using Source Data Arrival Properties. Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan.

Xiang, Z. & Fesenmaier, D., 2005. An analysis of two search engine interface metaphors for trip planning. Information Technology and Tourism, 7(2), 103-117.


127

Date post:	29-Nov-2023
Category:	Documents
Upload:	granada
View:	1 times
Download:	0 times

Ontology Based Data Warehousing for Improving Touristic Web Sites

Documents