OutWit Hub

Last updated
OutWit Hub
Developer(s) OutWit Technologies
Operating system Microsoft Windows, macOS, Linux
Type Web scraping, download manager
License Proprietary
Website outwit.com

OutWit Hub is a Web data extraction software application designed to automatically extract information from online or local resources. It recognizes and grabs links, images, documents, contacts, recurring vocabulary and phrases, rss feeds and converts structured and unstructured data into formatted tables which can be exported to spreadsheets or databases. The first version was released in 2010. The current version (9.0) is available for Windows 10 & Windows 11, Linux and MacOS 10.

Contents

The program includes a Mozilla-based browser and a side bar which gives access to a number of views with pre-set extractors. Web pages and textual documents are broken down into their different constituents, presented as tables in these views. The application can navigate through series of links and sequences of search engine results pages to extract information elements, organize them in tables and export them to various formats. The predefined extractors allow to collect structured tables, lists or feeds. Custom scrapers can also be created to extract data from less structured page elements. [1] Regular expressions can be included in scrapers as well as in other parts of the application to define variable recognition markers. [2]

Although OutWit Hub is presented as a tool for non-technical users, the fact that the application doesn't use the document object model structure for its extractions prevents visual "point & grab" data scraping and forces the user who wants to create custom scrapers to define markers in the source code of the page. The advantage of this approach, however, is that it allows a more precise definition of extraction masks than HTML nodes and faster execution, as the document object model tree doesn't need to be rendered by the browser at extraction time.

Versions

A limited free version can be downloaded from the publisher's site and shareware download websites. [3]

Features

Advanced features

An Enterprise edition of the application includes advanced extraction and automation features for specific or large volume extractions, sending series of automatically generated HTTP or POST queries and uploading scraped data to FTP servers.

Browser extensions

Firefox

OutWit Hub is a discontinued Firefox extension. [4]

See also

Similar Tools

Related Research Articles

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

WinFS was the code name for a canceled data storage and management system project based on relational databases, developed by Microsoft and first demonstrated in 2003. It was intended as an advanced storage subsystem for the Microsoft Windows operating system, designed for persistence and management of structured, semi-structured and unstructured data.

<span class="mw-page-title-main">Web feed</span> Data format

On the World Wide Web, a web feed is a data format used for providing users with frequently updated content. Content distributors syndicate a web feed, thereby allowing users to subscribe a channel to it by adding the feed resource address to a news aggregator client. Users typically subscribe to a feed by manually entering the URL of a feed or clicking a link in a web browser or by dragging the link from the web browser to the aggregator, thus "RSS and Atom files provide news updates from a website in a simple form for your computer."

Mozilla Firefox has features which distinguish it from other web browsers, such as Google Chrome, Safari, and Microsoft Edge.

OpenSearch is a collection of technologies that allow the publishing of search results in a format suitable for syndication and aggregation. Introduced in 2005, it is a way for websites and search engines to publish search results in a standard and accessible format.

Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary. Microformats emerged around 2005 and were predominantly designed for use by search engines, web syndication and aggregators such as RSS.

<span class="mw-page-title-main">News aggregator</span> Client software that aggregates syndicated web content

In computing, a news aggregator, also termed a feed aggregator, content aggregator, feed reader, news reader, or simply an aggregator, is client software or a web application that aggregates digital content such as online newspapers, blogs, podcasts, and video blogs (vlogs) in one location for easy viewing. The updates distributed may include journal tables of contents, podcasts, videos, and news items.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

In the context of the World Wide Web, a bookmark is a Uniform Resource Identifier (URI) that is stored for later retrieval in any of various storage formats. All modern web browsers include bookmark features. Bookmarks are called favorites or Internet shortcuts in Internet Explorer and Microsoft Edge, and by virtue of that browser's large market share, these terms have been synonymous with bookmark since the First Browser War. Bookmarks are normally accessed through a menu in the user's web browser, and folders are commonly used for organization. In addition to bookmarking methods within most browsers, many external applications offer bookmarks management.

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload.

QF-Test from Quality First Software is a cross-platform software tool for automated testing of programs via the graphical user interface. The program is specialized on cross-browser test automation of static and dynamic web-based applications. Version 4.1 added support for MacOS and the Apple Safari and Microsoft Edge browsers via the Selenium WebDriver. RESTful web service testing. From version 5.0, Windows applications can also be tested and modern C++ applications. Version 5.3 added support for the Chrome DevTools protocol, which allows browsers to be controlled using CDP drivers.

iMacros Browser-based application for macro recording, editing and playback

iMacros is a browser-based application for macro recording, editing and playback for web automation and testing. It is provided as a standalone application and extension for Mozilla Firefox, Google Chrome, and Internet Explorer web browsers. Developed by iOpus/Ipswitch, It adds record and replay functionality similar to that found in web testing and form filler software. The macros can be combined and controlled via JavaScript. Demo macros and JavaScript code examples are included with the software. Running strictly JavaScript-based macros was removed in later versions of iMacros browser extensions. However, users can use alternative browser like Pale Moon, based on older versions of Mozilla Firefox to use JavaScript files for web-based automated testing with Moon Tester Tool.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Content migration is the process of moving information stored on a given computer information system (IS) to a new system. The IS may be a Web content management system (CMS), a digital asset management (DAM), or a document management system (DMS). The IS may also be based on flat HTML content, including HTML files, Active Server Pages (ASP), JavaServer Pages (JSP), PHP, or content stored in some type of HTML/JavaScript based system and can be either static or dynamic content.

Data Toolbar is a Web scraping computer software add-on to the Internet Explorer, Mozilla Firefox, and Google Chrome Web browsers that collects and converts the structured data from Web pages into a tabular format that can be loaded into a spreadsheet or database management program.

<span class="mw-page-title-main">Hierarchical Cluster Engine Project</span>

Hierarchical Cluster Engine (HCE) is a FOSS complex solution for: construct custom network mesh or distributed network cluster structure with several relations types between nodes, formalize the data flow processing goes from upper node level central source point to down nodes and backward, formalize the management requests handling from multiple source points, support native reducing of multiple nodes results, internally support powerful full-text search engine and data storage, provide transactions-less and transactional requests processing, support flexible run-time changes of cluster infrastructure, have many languages bindings for client-side integration APIs in one product build on C++ language.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

<span class="mw-page-title-main">Tables (Google)</span> Cloud-based collaborative database software

Tables is a collaborative database program developed out of Google's Area 120 incubator. Tables is available as a web application. The app allows users to collaborate in real-time to track work more efficiently using automation.

References

  1. "Using "separators and labels" in Outwit Hub pro". Datacrumble. May 2013.
  2. "How-to: Scraping ugly HTML using 'regular expressions' in an OutWit Hub scraper". Online Journalism. Nov 2012.
  3. "How to use OutWit Hub to scrape data for free". Interhacktives. Mar 2014.
  4. "OutWit Hub – Add-ons for Firefox". 15 November 2017. Archived from the original on 15 November 2017.