Data synchronization

Last updated
Synchronization process between a server and two clients Data Synchronization.png
Synchronization process between a server and two clients

Data synchronization is the process of establishing consistency between source and target data stores, and the continuous harmonization of the data over time. It is fundamental to a wide variety of applications, including file synchronization and mobile device synchronization. Data synchronization can also be useful in encryption for synchronizing public key servers.

Contents

Data synchronization is needed to update and keep multiple copies of a set of data coherent with one another or to maintain data integrity, Figure 3. [1] For example, database replication is used to keep multiple copies of data synchronized with database servers that store data in different locations.

Examples

Examples include:

Challenges

Some of the challenges which user may face in data synchronization:

Data formats complexity

Data formats tend to grow more complex with time as the organization grows and evolves. This results not only in building simple interfaces between the two applications (source and target), but also in a need to transform the data while passing them to the target application. ETL (extraction transformation loading) tools can be helpful at this stage for managing data format complexities.

Real-timeliness

In real-time systems, customers want to see the current status of their order in e-shop, the current status of a parcel delivery—a real time parcel tracking—, the current balance on their account, etc. This shows the need of a real-time system, which is being updated as well to enable smooth manufacturing process in real-time, e.g., ordering material when enterprise is running out stock, synchronizing customer orders with manufacturing process, etc. From real life, there exist so many examples where real-time processing gives successful and competitive advantage.

Data security

There are no fixed rules and policies to enforce data security. It may vary depending on the system which you are using. Even though the security is maintained correctly in the source system which captures the data, the security and information access privileges must be enforced on the target systems as well to prevent any potential misuse of the information. This is a serious issue and particularly when it comes for handling secret, confidential and personal information. So because of the sensitivity and confidentiality, data transfer and all in-between information must be encrypted.

Data quality

Data quality is another serious constraint. For better management and to maintain good quality of data, the common practice is to store the data at one location and share with different people and different systems and/or applications from different locations. It helps in preventing inconsistencies in the data.

Performance

There are five different phases involved in the data synchronization process:

Each of these steps is critical. In case of large amounts of data, the synchronization process needs to be carefully planned and executed to avoid any negative impact on performance.

File-based solutions

There are tools available for file synchronization, version control (CVS, Subversion, etc.), distributed filesystems (Coda, etc.), and mirroring (rsync, etc.), in that all these attempt to keep sets of files synchronized. However, only version control and file synchronization tools can deal with modifications to more than one copy of the files.

Theoretical models

Several theoretical models of data synchronization exist in the research literature, and the problem is also related to the problem of Slepian–Wolf coding in information theory. The models are classified based on how they consider the data to be synchronized.

Unordered data

The problem of synchronizing unordered data (also known as the set reconciliation problem) is modeled as an attempt to compute the symmetric difference between two remote sets and of b-bit numbers. [3] Some solutions to this problem are typified by:

Wholesale transfer
In this case all data is transferred to one host for a local comparison.
Timestamp synchronization
In this case all changes to the data are marked with timestamps. Synchronization proceeds by transferring all data with a timestamp later than the previous synchronization. [4]
Mathematical synchronization
In this case data are treated as mathematical objects and synchronization corresponds to a mathematical process. [3] [5] [6]

Ordered data

In this case, two remote strings and need to be reconciled. Typically, it is assumed that these strings differ by up to a fixed number of edits (i.e. character insertions, deletions, or modifications). Then data synchronization is the process of reducing edit distance between and , up to the ideal distance of zero. This is applied in all filesystem based synchronizations (where the data is ordered). Many practical applications of this are discussed or referenced above.

It is sometimes possible to transform the problem to one of unordered data through a process known as shingling (splitting the strings into shingles[ clarification needed ]). [7]

Error handling

In fault-tolerant systems, distributed databases must be able to cope with the loss or corruption of (part of) their data. The first step is usually replication, which involves making multiple copies of the data and keeping them all up to date as changes are made. However, it is then necessary to decide which copy to rely on when loss or corruption of an instance occurs.

The simplest approach is to have a single master instance that is the sole source of truth. Changes to it are replicated to other instances, and one of those instances becomes the new master when the old master fails.

Paxos and Raft are more complex protocols that exist to solve problems with transient effects during failover, such as two instances thinking they are the master at the same time.

Secret sharing is useful if failures of whole nodes are very common. This moves synchronization from an explicit recovery process to being part of each read, where a read of some data requires retrieving encoded data from several different nodes. If corrupt or out-of-date data may be present on some nodes, this approach may also benefit from the use of an error correction code.

DHTs and Blockchains try to solve the problem of synchronization between many nodes (hundreds to billions).

See also

Related Research Articles

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

<span class="mw-page-title-main">Vector clock</span> Algorithm for partial ordering of events and detecting causality in distributed systems

A vector clock is a data structure used for determining the partial ordering of events in a distributed system and detecting causality violations. Just as in Lamport timestamps, inter-process messages contain the state of the sending process's logical clock. A vector clock of a system of N processes is an array/vector of N logical clocks, one clock per process; a local "largest possible values" copy of the global clock-array is kept in each process.

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Google file system was replaced by Colossus in 2010.

File synchronization in computing is the process of ensuring that computer files in two or more locations are updated via certain rules.

Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual consistency, also called optimistic replication, is widely deployed in distributed systems and has origins in early mobile computing projects. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is a weak guarantee – most stronger models, like linearizability, are trivially eventually consistent.

The Lamport timestamp algorithm is a simple logical clock algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. The algorithm is named after its creator, Leslie Lamport.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

GPFS is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 TOP500 list of supercomputers. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

<span class="mw-page-title-main">Distributed Replicated Block Device</span> Distributed replicated storage system for Linux

DRBD is a distributed replicated storage system for the Linux platform. It is implemented as a kernel driver, several userspace management applications, and some shell scripts. DRBD is traditionally used in high availability (HA) computer clusters, but beginning with DRBD version 9, it can also be used to create larger software defined storage pools with a focus on cloud integration.

Gluster Inc. was a software company that provided an open source platform for scale-out public and private cloud storage. The company was privately funded and headquartered in Sunnyvale, California, with an engineering center in Bangalore, India. Gluster was funded by Nexus Venture Partners and Index Ventures. Gluster was acquired by Red Hat on October 7, 2011.

A grid file system is a computer file system whose goal is improved reliability and availability by taking advantage of many smaller file storage areas.

A clustered file system (CFS) is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

Microsoft Sync Framework is a data synchronization platform from Microsoft that can be used to synchronize data across multiple data stores. Sync Framework includes a transport-agnostic architecture, into which data store-specific synchronization providers, modelled on the ADO.NET data provider API, can be plugged in. Sync Framework can be used for offline access to data, by working against a cached set of data and submitting the changes to a master database in a batch, as well as to synchronize changes to a data source across all consumers and peer-to-peer synchronization of multiple data sources. Sync Framework features built-in capabilities for conflict detection – whether data to be changed has already been updated – and can flag them for manual inspection or use defined policies to try to resolve the conflict. Sync Services includes an embedded SQL Server Compact database to store metadata about the synchronization relationships as well as about each sync attempt. The Sync Framework API is surfaced both in managed code, for use with .NET Framework applications, as well as unmanaged code, for use with COM applications. It was scheduled to ship with Visual Studio 2008 in late November 2007.

Optimistic replication, also known as lazy replication, is a strategy for replication, in which replicas are allowed to diverge.

LOCUS is a discontinued distributed operating system developed at UCLA during the 1980s. It was notable for providing an early implementation of the single-system image idea, where a cluster of machines appeared to be one larger machine.

<span class="mw-page-title-main">Reo Coordination Language</span>

Reo is a domain-specific language for programming and analyzing coordination protocols that compose individual processes into full systems, broadly construed. Examples of classes of systems that can be composed with Reo include component-based systems, service-oriented systems, multithreading systems, biological systems, and cryptographic protocols. Reo has a graphical syntax in which every Reo program, called a connector or circuit, is a labeled directed hypergraph. Such a graph represents the data-flow among the processes in the system. Reo has formal semantics, which stand at the basis of its various formal verification techniques and compilation tools.

A version vector is a mechanism for tracking changes to data in a distributed system, where multiple agents might update the data at different times. The version vector allows the participants to determine if one update preceded another (happened-before), followed it, or if the two updates happened concurrently. In this way, version vectors enable causality tracking among data replicas and are a basic mechanism for optimistic replication. In mathematical terms, the version vector generates a preorder that tracks the events that precede, and may therefore influence, later updates.

Industrial process data validation and reconciliation, or more briefly, process data reconciliation (PDR), is a technology that uses process information and mathematical methods in order to automatically ensure data validation and reconciliation by correcting measurements in industrial processes. The use of PDR allows for extracting accurate and reliable information about the state of industry processes from raw measurement data and produces a single consistent set of data representing the most likely process operation.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

References

  1. Nakatani, Kazuo; Chuang, Ta-Tao; Zhou, Duanning (2006). "Data Synchronization Technology: Standards, Business Values and Implications". Communications of the Association for Information Systems. 17. doi: 10.17705/1cais.01744 . ISSN   1529-3181.
  2. A. Tridgell (February 1999). "Efficient algorithms for sorting and synchronization" (PDF). PhD thesis. The Australian National University.
  3. 1 2 Minsky, Y.; Ari Trachtenberg; Zippel, R. (2003). "Set reconciliation with nearly optimal communication complexity". IEEE Transactions on Information Theory. 49 (9): 2213–2218. CiteSeerX   10.1.1.73.5806 . doi:10.1109/TIT.2003.815784.
  4. "Palm developer knowledgebase manuals". Archived from the original on 2002-03-11. Retrieved 2007-01-09.
  5. Ari Trachtenberg; D. Starobinski; S. Agarwal. "Fast PDA Synchronization Using Characteristic Polynomial Interpolation" (PDF). IEEE INFOCOM 2002. doi:10.1109/INFCOM.2002.1019402.
  6. Y. Minsky and A. Trachtenberg, Scalable set reconciliation, Allerton Conference on Communication, Control, and Computing, Oct. 2002
  7. S. Agarwal; V. Chauhan; Ari Trachtenberg (November 2006). "Bandwidth efficient string reconciliation using puzzles" (PDF). IEEE Transactions on Parallel and Distributed Systems. 17 (11): 1217–1225. doi:10.1109/TPDS.2006.148. S2CID   4300693 . Retrieved 2007-05-23.