Poliqarp

Last updated December 02, 2020

Poliqarp is an open source search engine designed to process text corpora, among others the National Corpus of Polish created at the Institute of Computer Science, Polish Academy of Sciences.^[1]^[2]

Features

Custom query language ^[3]
Two-level regular expressions:
- operating at the level of characters in words
- operating at the level of words in statements/paragraphs
Good performance
Compact corpus representation (compared to similar projects)
Portability across operating systems: Linux/BSD/Win32
Lack of portability across endianness (current release works only on little endian devices)

Related Research Articles

In computing, endianness is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most significant byte of a word at the smallest memory address and the least significant byte at the largest. A little-endian system, in contrast, stores the least-significant byte at the smallest address. Endianness may also be used to describe the order in which the bits are transmitted over a communication channel, e.g., big-endian in a communications channel transmits the most significant bits first. Bit-endianness is seldom used in other contexts.

OpenStep is a defunct object-oriented application programming interface (API) specification for a legacy object-oriented operating system, with the basic goal of offering a NeXTSTEP-like environment on non-NeXTSTEP operating systems. OpenStep was principally developed by NeXT with Sun Microsystems, to allow advanced application development on Sun's operating systems, specifically Solaris. NeXT produced a version of OpenStep for its own Mach-based Unix, stylized as OPENSTEP, as well as a version for Windows NT. The software libraries that shipped with OPENSTEP are a superset of the original OpenStep specification, including many features from the original NeXTSTEP.

CiteSeer^x is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

Apache Nutch is a highly extensible and scalable open source web crawler software project.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engines. The opposite term to the deep web is the "surface web", which is accessible to anyone/everyone using the Internet. Computer-scientist Michael K. Bergman is credited with coining the term deep web in 2001 as a search-indexing term.

Stratus VOS is a proprietary operating system running on Stratus Technologies fault-tolerant computer systems. VOS is available on Stratus's ftServer and Continuum platforms. VOS customers use it to support high-volume transaction processing applications which require continuous availability. VOS is notable for being one of the few operating systems which run on fully lockstepped hardware.

GLib is a bundle of three low-level system libraries written in C and developed mainly by GNOME. GLib's code was separated from GTK, so it can be used by software other than GNOME and has been developed in parallel ever since.

Google Developers is Google's site for software development tools and platforms, application programming interfaces (APIs), and technical resources. The site contains documentation on using Google developer tools and APIs—including discussion groups and blogs for developers using Google's developer products.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time.

Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.

NeuroLex is a dynamic lexicon of neuroscience concepts. It is a structured as a semantic wiki, using Semantic MediaWiki. NeuroLex is supported by the Neuroscience Information Framework project.

The Cafu Engine is a game engine developed by Carsten Fuchs. It is portable across platforms and runs on Windows and Linux, with plans to be adapted to OS X. The engine's source code is freely available under the MIT Licence.

The National Corpus of Polish is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function.

Waterfox is an open-source web browser for x64, ARM64, and PPC64LE systems. It is intended to be ethical and maintain support for legacy extensions dropped by Firefox, from which it is forked. There are official releases for Windows, Mac OS, Linux and Android in two versions: Classic and Current.

Falkon Web browser with built-in AdBlock

Falkon is a free and open-source web browser. It is built on the Qt WebEngine which is a wrapper for the Chromium browser core.

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. There are also some specialized English corpora, such as American English, British English, and English Fiction.

Translatewiki.net is a web-based translation platform powered by the Translate extension for MediaWiki. It can be used to translate various kinds of texts but is commonly used for creating localisations for software interfaces.

Semantic Scholar is a project developed at the Allen Institute for Artificial Intelligence. Publicly released in November 2015, it is designed to be an AI-backed search engine for academic publications. The project uses a combination of machine learning, natural language processing, and machine vision to add a layer of semantic analysis to the traditional methods of citation analysis, and to extract relevant figures, entities, and venues from papers. In comparison to Google Scholar and PubMed, Semantic Scholar is designed to highlight the most important and influential papers, and to identify the connections between them.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (10¹⁰) words per language, which gave rise to the corpus family's name.

References

↑ "Poliqarp search engine for NKJP data". nkjp.pl. Retrieved 1 December 2020.
↑ "Poliqarp 1.1". nlp.ipipan.waw.pl. Retrieved 1 December 2020.
↑ Janus, Daniel; Przepiórkowski, Adam (25 June 2007). "Poliqarp: an open source corpus indexer and search engine with syntactic extensions". Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics. pp. 85–88. doi:10.5555/1557769.1557795 Check |doi= value (help). Retrieved 1 December 2020.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Poliqarp search engine for NKJP data". nkjp.pl. Retrieved 1 December 2020.

[2] "Poliqarp 1.1". nlp.ipipan.waw.pl. Retrieved 1 December 2020.

[3] Janus, Daniel; Przepiórkowski, Adam (25 June 2007). "Poliqarp: an open source corpus indexer and search engine with syntactic extensions". Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics. pp. 85–88. doi:10.5555/1557769.1557795 Check |doi= value (help). Retrieved 1 December 2020.

[1]

[2]

[3]

Poliqarp

Contents

Features

Related Research Articles

References

External links