Content migration

Last updated

Content migration is the process of moving information stored on a given computer information system (IS) to a new system. The IS may be a Web content management system (CMS), a digital asset management (DAM), or a document management system (DMS). The IS may also be based on flat HTML content, including HTML files, Active Server Pages (ASP), JavaServer Pages (JSP), PHP, or content stored in some type of HTML/JavaScript based system and can be either static or dynamic content.

Contents

Business drivers

Reasons to consider migrating content

Content Migrations can solve a number of issues ranging from:

Arguments against migrating content

Content migrations entail risks. Even though some of the reasons like cost might be obvious, there are some less obvious reasons to avoid a migration exercise. These include corruption in transit and loss of context, particularly the unstructured content, which is typically one of the larger artifacts of business. There is also the risk of external references not being considered (broken links to content). The size of the data to be migrated makes the very resource-intensive (Source- Destination- Temporary- storage, network bandwidth, etc.), which means that auditing the migration process could also be complex and require consistency and traceability.

Another common issue in content migration is the loss of SEO and page rank in search engines. Migrating to another location and adopting a new software means that all website URLs are going to be changed as well, hence, search engines would have to make some adjustments even if it is informed about the process.[ citation needed ] In a white paper, Oracle also outlined several issues involving the so-called people perspective. It cited the probability that people involved in the content migration might not have a thorough grasp of the history, structure, and meaning of the source data as well as the new system, which could lead not only to the loss of information but also incur additional resources. [1]

One of the methods that address the risks is the use of metadata. It is employed to describe, access, and manage records, serving as the ultimate means by which the integrity, trustworthiness, and authenticity of a record can be proven. [2] The process, for instance, could adopt a two-track framework where one track deals with the overall content, structure, layout, and vision, while the other is focused on metadata. [3]

Approaches

There are many ways to access the content stored in a CMS. Depending on the CMS vendor they offer either an Application programming interface (API), Web services, rebuilding a record by writing SQL queries, XML exports, or through the web interface.

  1. The API [4] requires a developer to read and understand how to interact with the source CMS’s API layer then develop an application that extracts the content and stores it in a database, XML file, or Excel. Once the content is extracted the developer must read and understand the target CMS API and develop code to push the content into the new System. The same can be said for Web Services.
  2. Most CMSs use a database to store and associate content so if no API exists the programmer must reverse engineer the table structure. Once the structure is reverse engineered, very complex SQL queries are written to pull all the content from multiple tables into an intermediate table or into some type of Comma-separated values (CSV) or XML file. Once the developer has the files or database the developer must read and understand the target CMS API and develop code to push the content into the new System. The same can be said for Web Services.
  3. XML export creates XML files of the content stored in a CMS but after the files are exported they need to be altered to fit the new scheme of the target CMS system. This is typically done by a developer by writing some code to do the transformation.
  4. HTML files, JSP, ASP, PHP, or other application server file formats are the most difficult. The structure for Flat HTML files is based on a culmination of folder structure, HTML file structure, and image locations. In the early days of content migration, the developer had to use programming languages to parse the HTML files and save them as structured databases, XML, or CSV. Typically PERL, JAVA, C++, or C# were used because of the regular expression handling capability. JSP, ASP, PHP, ColdFusion, and other Application Server technologies usually rely on server-side includes helping simplify development but makes it very difficult to migrate content because the content is not assembled until the user looks at it on their web browser. This makes it very difficult to look at the files and extract the content from the file structure.
  5. Web Scraping allows users to access most of the content directly from the Web User Interface. Since a web interface is visual (this is the point of a CMS) some Web Scrapers leverage the UI to extract the content and place it into a structure like a Database, XML, or CSV format. All CMSs, DAMs, and DMSs use web interfaces so extracting the content for one or many source sites is basically the same process. In some cases, it is possible to push the content into the new CMS using the web interface but some CMSs use JAVA applets or Active X Control which are not supported by most web scrapers. In that case, the developer must read and understand the target CMS API and develop code to push the content into the new System. The same can be said for Web Services.

The basic content migration flow

  1. Obtain an inventory of the content.
  2. Obtain an inventory of Binary content like Images, PDFs, CSS files, Office Docs, Flash, and any binary objects.
  3. Find any broken links in the content or content resources.
  4. Determine the Menu Structure of the Content.
  5. Find the parent/sibling connection to the content so the links to other content and resources are not broken when moving them.
  6. Extract the Resources from the pages and store them into a Database or File structure. Store the reference in a database or a File.
  7. Extract the HTML content from the site and store it locally.
  8. Upload the resources to the new CMS either by using the API or the web interface and store the new location in a Database or XML.
  9. Transform the HTML to meet the new CMSs standards and reconnect any resources.
  10. Upload the transformed content into the new system.

Old to new

  1. Remember the content strategy on your new site can evolve as brand objectives change and as you start to understand how content performs in this new environment. It may be necessary to bring back old content that hadn’t initially been migrated — make sure you archive everything that doesn’t make the initial cut for this reason.

Related Research Articles

<span class="mw-page-title-main">Serialization</span> Conversion process for computer data

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of objects does not include any of their associated methods with which they were previously linked.

Jakarta Server Pages is a collection of technologies that helps software developers create dynamically generated web pages based on HTML, XML, SOAP, or other document types. Released in 1999 by Sun Microsystems, JSP is similar to PHP and ASP, but uses the Java programming language.

<span class="mw-page-title-main">Jakarta Servlet</span> Jakarta EE programming language class

A Jakarta Servlet, formerly Java Servlet is a Java software component that extends the capabilities of a server. Although servlets can respond to many types of requests, they most commonly implement web containers for hosting web applications on web servers and thus qualify as a server-side servlet web API. Such web servlets are the Java counterpart to other dynamic web content technologies such as PHP and ASP.NET.

A content management system (CMS) is computer software used to manage the creation and modification of digital content . A CMS is typically used for enterprise content management (ECM) and web content management (WCM).

A web service (WS) is either:

<span class="mw-page-title-main">WebObjects</span> Java web application server and framework originally developed by NeXT Software

WebObjects is a discontinued Java web application server and a server-based web application framework originally developed by NeXT Software, Inc.

<span class="mw-page-title-main">Adobe ColdFusion</span> Rapid Web app development platform

Adobe ColdFusion is a commercial rapid web-application development computing platform created by J. J. Allaire in 1995. ColdFusion was originally designed to make it easier to connect simple HTML pages to a database. By version 2 (1996) it had become a full platform that included an IDE in addition to a full scripting language.

Jakarta Faces, formerly Jakarta Server Faces and JavaServer Faces (JSF) is a Java specification for building component-based user interfaces for web applications. It was formalized as a standard through the Java Community Process as part of the Java Platform, Enterprise Edition. It is an MVC web framework that simplifies the construction of user interfaces (UI) for server-based applications by using reusable UI components in a page.

<span class="mw-page-title-main">OpenCms</span> Content management system

OpenCms is an open-source content management system written in Java. It is distributed by Alkacon Software under the LGPL license. OpenCms requires a JSP Servlet container such as Apache Tomcat.

Web development is the work involved in developing a website for the Internet or an intranet. Web development can range from developing a simple single static page of plain text to complex web applications, electronic businesses, and social network services. A more comprehensive list of tasks to which Web development commonly refers, may include Web engineering, Web design, Web content development, client liaison, client-side/server-side scripting, Web server and network security configuration, and e-commerce development.

A web content management system is a software content management system (CMS) specifically for web content. It provides website authoring, collaboration, and administration tools that help users with little knowledge of web programming languages or markup languages create and manage website content. A WCMS provides the foundation for collaboration, providing users the ability to manage documents and output for multiple author editing and participation. Most systems use a content repository or a database to store page content, metadata, and other information assets the system needs.

A web framework (WF) or web application framework (WAF) is a software framework that is designed to support the development of web applications including web services, web resources, and web APIs. Web frameworks provide a standard way to build and deploy web applications on the World Wide Web. Web frameworks aim to automate the overhead associated with common activities performed in web development. For example, many web frameworks provide libraries for database access, templating frameworks, and session management, and they often promote code reuse. Although they often target development of dynamic web sites, they are also applicable to static websites.

Content Repository API for Java (JCR) is a specification for a Java platform application programming interface (API) to access content repositories in a uniform manner. The content repositories are used in content management systems to keep the content data and also the metadata used in content management systems (CMS) such as versioning metadata. The specification was developed under the Java Community Process as JSR-170, and as JSR-283. The main Java package is javax.jcr.

<span class="mw-page-title-main">Silverstripe CMS</span> Content management system

Silverstripe CMS is a free and open source content management system (CMS) and framework for creating and maintaining websites and web applications. It provides an out of the box web-based administration panel that enables users to make modifications to parts of the website, which includes a WYSIWYG website editor. The core of the software is Silverstripe Framework, a PHP Web application framework.

Web2py is an open-source web application framework written in the Python programming language. Web2py allows web developers to program dynamic web content using Python. Web2py is designed to help reduce tedious web development tasks, such as developing web forms from scratch, although a web developer may build a form from scratch if required.

A single-page application (SPA) is a web application or website that interacts with the user by dynamically rewriting the current web page with new data from the web server, instead of the default method of a web browser loading entire new pages. The goal is faster transitions that make the website feel more like a native app.

Java view technologies and frameworks are web-based software libraries that provide the user interface, or "view-layer", of Java web applications. Such application frameworks are used for defining web pages and handling the HTTP requests (clicks) generated by those web pages. As a sub-category of web frameworks, view-layer frameworks often overlap to varying degrees with web frameworks that provide other functionality for Java web applications.

<span class="mw-page-title-main">API</span> Software interface between computer programs

An application programming interface (API) is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build or use such a connection or interface is called an API specification. A computer system that meets this standard is said to implement or expose an API. The term API may refer either to the specification or to the implementation. Whereas a system's user interface dictates how its end-users interact with the system in question, its API dictates how to write code that takes advantage of that system's capabilities.

Omni CMS (formerly OU Campus) is a web content management system (CMS) for colleges, universities, and other higher education institutions.

References

  1. Oracle (October 2011). "Successful Data Migration" (PDF). Oracle. Retrieved September 4, 2018.
  2. TAHO (September 2015). "Information Management Advice 60 Part 5 Successfully manage Information Risks during System Migration" (PDF). Tasmanian Government. Retrieved September 4, 2018.
  3. Sanchez-Alonso, Salvador; Athanasiadis, Ioannis (2010). Metadata and Semantic Research. Berlin: Springer. p. 28. ISBN   9783642165511.
  4. What the Content Migration APIs Are Not