Web data integration

Last updated

Web data integration (WDI) is the process of aggregating and managing data from different websites into a single, homogeneous workflow. This process includes data access, transformation, mapping, quality assurance and fusion of data. Data that is sourced and structured from websites is referred to as "web data". WDI is an extension and specialization of data integration that views the web as a collection of heterogeneous databases.

Contents

Data integration techniques in the context of the web, forms the foundation for businesses taking advantage of data available on the ever-increasing number of publicly-accessible websites. [1] Corporate spending on this area amounted to about USD 2.5bn in 2017, and it is expected that by 2020 the market will reach almost USD 7bn. [2]

Sources

Web data integration extends and specializes data integration to see the web as a collection of views of databases accessible over the web protocols, including, but not limited to: [3]

Data access and transformation

WDI has technical challenges different from data integration due to the data access and transformation required for the web data sources being often unstructured or semi-structured data without a standard query mechanism.

Data quality

Understanding the quality and veracity of data is even more important in WDI than in data integration, as the data is generally less implicitly trusted and of lower quality than that which is collected from a trusted source. There are attempts to try to automate a trust rating for web data. [4]

Data quality in data integration can generally happen after data access and transformation, but in WDI quality may need to be monitored as data is collected, due to both the time and the cost of re-collecting the data.

Applications

WDI has application in many fields, including bioinformatics, [5] search engines, [6] price comparison, [7] and forensic search [8] data analysis, business intelligence, ecommerce, [9] healthcare, pharmaceutical [10] and product development.

Most price comparison engines and recommendation systems use user generated data to create recommendations for their users. Similarly, healthcare systems use results of competitions conducted on websites like Kaggle [11] to see the accuracy of data and to create user-focused products. In fact, IBM estimates that poor quality WDI is costing companies over $3 trillion [12] in revenue each year.

Related Research Articles

The Semantic Web is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Home automation

Home automation or domotics is building automation for a home, called a smart home or smart house. A home automation system will monitor and/or control home attributes such as lighting, climate, entertainment systems, and appliances. It may also include home security such as access control and alarm systems. When connected with the Internet, home devices are an important constituent of the Internet of Things ("IoT").

Hushmail is an encrypted proprietary web-based email service offering PGP-encrypted e-mail and vanity domain service. Hushmail uses OpenPGP standards. If public encryption keys are available to both recipient and sender, Hushmail can convey authenticated, encrypted messages in both directions. For recipients for whom no public key is available, Hushmail will allow a message to be encrypted by a password and stored for pickup by the recipient, or the message can be sent in cleartext. In July, 2016, the company launched an iOS app that offers end-to-end encryption and full integration with the webmail settings. The company is located in Vancouver, British Columbia, Canada.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Virtuoso Universal Server computer software

Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional relational database management system (RDBMS), object–relational database (ORDBMS), virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is an "universal server"; it enables a single multithreaded server process that implements multiple protocols. The free and open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso. The software has been developed by OpenLink Software with Kingsley Uyi Idehen and Orri Erling as the chief software architects.

A comparison shopping website, sometimes called a price comparison website, price analysis tool, comparison shopping agent, shopbot, aggregator or comparison shopping engine, is a vertical search engine that shoppers use to filter and compare products based on price, features, reviews and other criteria. Most comparison shopping sites aggregate product listings from many different retailers but do not directly sell products themselves, instead earning money from affiliate marketing agreements. In the United Kingdom, these services made between £780m and £950m in revenue in 2005. Hence, E-commerce accounted for an 18.2 percent share of total business turnover in the United Kingdom in 2012. Online sales already account for 13% of the total UK economy, and its expected to increase to 15% by 2017. There is a huge contribution of comparison shopping websites in the expansion of the current E-commerce industry.

An RDF query language is a computer language, specifically a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.

Semantic interoperability is the ability of computer systems to exchange data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data federation between information systems.


This is a comparison of notable free and open-source configuration management software, suitable for tasks like server configuration, orchestration and infrastructure as code typically performed by a system administrator.

The Internet of things (IoT) describes the network of physical objects, so known as, "things" — that are embedded with sensors, software, and other technologies that is used for the purpose of connecting and exchanging data with other devices and systems over the Internet.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 106, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Qlik [pronounced "klik"] provides a business analytics platform. The SaaS software company was founded in 1993 in Lund, Sweden and is now based in King of Prussia, Pennsylvania, United States. The company's main products are Qlik Sense and Qlik Replicate, both cloud-based software for business intelligence and data integration.

Microsoft Azure, commonly referred to as Azure, is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers. It provides software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS) and supports many different programming languages, tools, and frameworks, including both Microsoft-specific and third-party software and systems.

Sebastian Schaffert

Sebastian Schaffert is a software engineer and researcher. He was born in Trostberg, Bavaria, Germany on March 18, 1976 and obtained his doctorate in 2004.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, file storage, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning. Registration requires a credit card or bank account details.

Matrix (protocol) Networking protocol for real-time communication and data synchronization

Matrix is an open standard and communication protocol for real-time communication. It aims to make real-time communication work seamlessly between different service providers, in the way that standard Simple Mail Transfer Protocol email currently does for store-and-forward email service, by allowing users with accounts at one communications service provider to communicate with users of a different service provider via online chat, voice over IP, and videotelephony. It therefore serves a similar purpose to protocols like XMPP, but is not based on any existing communication protocol.

SensorThings API is an Open Geospatial Consortium (OGC) standard providing an open and unified framework to interconnect IoT sensing devices, data, and applications over the Web. It is an open standard addressing the syntactic interoperability and semantic interoperability of the Internet of Things. It complements the existing IoT networking protocols such CoAP, MQTT, HTTP, 6LowPAN. While the above-mentioned IoT networking protocols are addressing the ability for different IoT systems to exchange information, OGC SensorThings API is addressing the ability for different IoT systems to use and understand the exchanged information. As an OGC standard, SensorThings API also allows easy integration into existing Spatial Data Infrastructures or Geographic Information Systems.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Thing Description

The Thing Description (TD) is a royalty-free, open information model with a JSON based representation format for the Internet of Things (IoT). A TD provides a unified way to describe the capabilities of an IoT device or service with its offered data model and functions, protocol usage, and further metadata. Using Thing Descriptions help reduce the complexity of integrating IoT devices and their capabilities into IoT applications.

Ontotext GraphDB RDF-store

Ontotext GraphDB is a graph database and knowledge discovery tool compliant with RDF and SPARQL and available as a high-availability cluster. Ontotext GraphDB is used in various European research projects.

References

  1. "IE 670 Web Data Integration". www.uni-mannheim.de. 2019-01-24. Retrieved 2019-02-11.
  2. "Opimas: The Web Data Extraction Market". Opimas: We begin with an understanding. Retrieved 2019-02-12.
  3. "Introduction :: Web Data Integration". www.webdataintegration.io. Retrieved 2019-02-14.
  4. Giménez-García, José M.; Thakkar, Harsh; Zimmermann, Antoine (2016). "Assessing Trust with PageRank in the Web of Data". In Sack, Harald; Rizzo, Giuseppe; Steinmetz, Nadine; Mladenić, Dunja; Auer, Sören; Lange, Christoph (eds.). The Semantic Web. Lecture Notes in Computer Science. 9989. Springer International Publishing. pp. 293–307. doi:10.1007/978-3-319-47602-5_45. ISBN   9783319476025.
  5. "Web Data Integration". Database Group Leipzig.
  6. "Web-scale Data Integration - You Can Only Afford to Pay as You Go". www.datascienceassn.org. Retrieved 2019-02-12.
  7. Siegel, Michael D.; Madnick, Stuart E.; Zhu, Hongwei (2008). "Enabling global price comparison through semantic integration of web data". International Journal of Electronic Business. 6 (4): 319. doi:10.1504/IJEB.2008.020672. hdl: 1721.1/40084 . S2CID   7995576 . Retrieved 2019-02-12.
  8. "PwC buys Kusiri, London-based fraud detection start-up". www.consultancy.uk. 2015-10-30. Retrieved 2019-02-12.
  9. Osial, P.; Kauranen, K.; Ahmed, E. (April 2017). "Smartphone recommendation system using web data integration techniques". 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE): 1–5. doi:10.1109/CCECE.2017.7946845. ISBN   978-1-5090-5538-8.
  10. "How Data Integration is Revamping Healthcare and Pharma". Data Integration Info. 2020-04-27. Retrieved 2020-05-04.
  11. "Kaggle: Your Machine Learning and Data Science Community". www.kaggle.com. Retrieved 2020-05-04.
  12. Import.io. "Web Data Integration: Revolutionizing the Way You Work with Web Data". www.import.io. Retrieved 2020-05-04.