Apache Solr

Last updated
Solr
Developer(s) Apache Software Foundation
Stable release
9.6.1 [1] / 29 May 2024;19 days ago (29 May 2024)
Repository Solr Repository
Written in Java
Operating system Cross-platform
Type Search and index API
License Apache License 2.0
Website solr.apache.org OOjs UI icon edit-ltr-progressive.svg

Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features [2] and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. [3] Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

Contents

Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages. Solr's external configuration allows it to be tailored to many types of applications without Java coding, and it has a plugin architecture to support more advanced customization.

Apache Solr is developed in an open, collaborative manner by the Apache Solr project at the Apache Software Foundation.

History

In 2004, Solr was created by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website. [4]

In January 2006, CNET Networks decided to openly publish the source code by donating it to the Apache Software Foundation. [5] Like any new Apache project, it entered an incubation period that helped solve organizational, legal, and financial issues.

In January 2007, Solr graduated from incubation status into a standalone top-level project (TLP) and grew steadily with accumulated features, thereby attracting users, contributors, and committers. Although quite new as a public project, it powered several high-traffic websites. [6]

In September 2008, Solr 1.3 was released including distributed search capabilities and performance enhancements among many others. [7]

In January 2009, Yonik Seeley along with Grant Ingersoll and Erik Hatcher joined Lucidworks (formerly Lucid Imagination), the first company providing commercial support and training for Apache Solr search technologies.[ citation needed ] Since then, support offerings around Solr have been abundant. [8]

In November 2009, saw the release of Solr 1.4. This version introduced enhancements in indexing, searching and faceting along with many other improvements such as rich document processing (PDF, Word, HTML), Search Results clustering based on Carrot2 and also improved database integration. The release also features many additional plug-ins. [9]

In March 2010, the Lucene and Solr projects merged. [10] Separate downloads continued, but the products were now jointly developed by a single set of committers.

In 2011, the Solr version number scheme was changed in order to match that of Lucene. After Solr 1.4, the next release of Solr was labeled 3.1, in order to keep Solr and Lucene on the same version number. [11]

In October 2012, Solr version 4.0 was released, including the new SolrCloud feature. [12] 2013 and 2014 saw a number of Solr releases in the 4.x line, steadily growing the feature set and improving reliability.

In February 2015, Solr 5.0 was released, [13] the first release where Solr is packaged as a standalone application, [14] ending official support for deploying Solr as a war. Solr 5.3 featured a built-in pluggable Authentication and Authorization framework. [15]

In April 2016, Solr 6.0 was released. [16] Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.

In September 2017, Solr 7.0 was released. [17] This release among other things, added support multiple replica types, auto-scaling, and a Math engine.

In March 2019, Solr 8.0 was released including many bugfixes and component updates. [18] Solr nodes can now listen and serve HTTP/2 requests. Be aware that by default, internal requests are also sent by using HTTP/2. Furthermore, an admin UI login was added with support for BasicAuth and Kerberos. And plotting math expressions in Apache Zeppelin is now possible.

In November 2020, Bloomberg donated the Solr Operator to the Lucene/Solr project. The Solr Operator helps deploy and run Solr in Kubernetes.

In February 2021, Solr was established as a separate Apache project (TLP), independent from Lucene.

In May 2022, Solr 9.0 was released, [19] as the first release independent from Lucene, requiring Java 11, and with highlights such as KNN "Neural" search, better modularization, more security plugins and more.

Operations

In order to search a document, Apache Solr performs the following operations in sequence:

  1. Indexing: converts the documents into a machine-readable format.
  2. Querying: understanding the terms of a query asked by the user. These terms can be images or keywords, for example.
  3. Mapping: Solr maps the user query to the documents stored in the database to find the appropriate result.
  4. Ranking: as soon as the engine searches the indexed documents, it ranks the outputs by their relevance.

Community

Solr has both individuals and companies who contribute new features and bug fixes. [20] [21] [22] [23] [24]

Integrating Solr

Solr is bundled as the built-in search in many applications such as content management systems and enterprise content management systems. Hadoop distributions from Cloudera, [25] Hortonworks [26] and MapR all bundle Solr as the search engine for their products marketed for big data. DataStax DSE integrates Solr as a search engine with Cassandra. [27] Solr is supported as an end point in various data processing frameworks and Enterprise integration frameworks.[ citation needed ]

Solr exposes industry standard HTTP REST-like APIs with both XML and JSON support, and will integrate with any system or programming language supporting these standards. For ease of use there are also client libraries available for Java, C#, PHP, Python, Ruby and most other popular programming languages. [28]

See also

Related Research Articles

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Doug Cutting</span> American information theorist

Douglass Read Cutting is a software designer, advocate for and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.

<span class="mw-page-title-main">NewGenLib</span>

NewGenLib is an integrated library management system developed by Verus Solutions Pvt Ltd. Domain expertise is provided by Kesavan Institute of Information and Knowledge Management in Hyderabad, India. NewGenLib version 1.0 was released in March 2005. On 9 January 2008, NewGenLib was declared free and open-source under GNU GPL. The latest version of NewGenLib is 3.1.1 released on 16 April 2015. Many libraries across the globe are using NewGenLib as their Primary integrated library management system as seen from the NewGenlib discussion forum.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

Cloudera, Inc. is an American data lake software company.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

The Oracle data appliance consists of hardware and software from Oracle Corporation sold as a computer appliance. It was announced in 2011,and is used for the consolidating and loading unstructured data into Oracle Database software. Larry Ellison founded of Oracle.

<span class="mw-page-title-main">Hortonworks</span> American software company

Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software designed to manage big data and associated processing.

<span class="mw-page-title-main">Foswiki</span> Enterprise wiki

Foswiki is an enterprise wiki, typically used to run a collaboration platform, knowledge base or document management system. Users can create wiki applications using the Topic Markup Language (TML), and developers can extend its functionality with plugins.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">JanusGraph</span> Graph database

JanusGraph is an open source, distributed graph database under The Linux Foundation. JanusGraph is available under the Apache License 2.0. The project is supported by IBM, Google, Hortonworks and Grakn Labs.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink, and Apache Hadoop.

References

  1. https://solr.apache.org/news.html#apache-solrtm-961-available.{{cite web}}: Missing or empty |title= (help)
  2. "Solr 4 preview: SolrCloud, NoSQL, and more | SearchHub | Lucene/Solr Open Source Search". Archived from the original on 2014-07-06. Retrieved 2014-07-10.
  3. "Apache Solr -". apache.org. Retrieved 16 January 2017.
  4. Thuma, John (2018-08-09). "What is Apache Solr". Medium. Retrieved 2022-10-16.
  5. "[SOLR-1] CNET code contribution - ASF JIRA". apache.org. Retrieved 16 January 2017.
  6. "PublicServers - Solr Wiki". apache.org. Retrieved 16 January 2017.
  7. "Apache Solr -". apache.org. Retrieved 16 January 2017.
  8. "Support - Solr Wiki". apache.org. Retrieved 16 January 2017.
  9. "Apache Solr -". apache.org. Retrieved 16 January 2017.
  10. "[VOTE] merge lucene/solr development (take 3) - Yonik Seeley - org.apache.lucene.general - MarkMail". markmail.org. Archived from the original on 24 April 2021. Retrieved 16 January 2017.
  11. Solr3.1 - Solr Wiki. Wiki.apache.org (2013-05-16). Retrieved on 2013-07-21.
  12. Apache Lucene. Lucene.apache.org. Retrieved on 2013-07-21.
  13. "Apache Solr - News". apache.org. Retrieved 16 January 2017.
  14. "[SOLR-6733] Umbrella issue - Solr as a standalone application - ASF JIRA". apache.org. Retrieved 16 January 2017.
  15. "Solr 5.3 Release announcement". lucene.apache.org. Retrieved 2015-09-24.
  16. "Apache Solr - News". apache.org. Retrieved 16 January 2017.
  17. "Apache Solr - News".
  18. "Apache Solr 8.0 Release notes".
  19. "12 May 2022, Apache Solr™ 9.0.0 available".
  20. "Highest Voted 'solr' Questions". stackoverflow.com. Retrieved 16 January 2017.
  21. "Lucene/Solr Revolution 2016". lucenerevolution.org. Archived from the original on 5 September 2017. Retrieved 16 January 2017.
  22. "SFBay Apache Lucene/Solr Meetup". meetup.com. Retrieved 16 January 2017.
  23. "Oslo Solr Community". meetup.com. Retrieved 16 January 2017.
  24. "LinkedIn Solr Group". linkedin.com. Retrieved 16 January 2017.
  25. "Hadoop for Everyone: Inside Cloudera Search - Cloudera Engineering Blog". cloudera.com. 24 June 2013. Retrieved 16 January 2017.
  26. "Bringing Enterprise Search to Enterprise Hadoop - Hortonworks". hortonworks.com. 2 April 2014. Retrieved 16 January 2017.
  27. "DataStax Enterprise: Cassandra with Solr Integration Details". datastax.com. 12 April 2012. Retrieved 6 February 2017.
  28. "IntegratingSolr - Solr Wiki". apache.org. Retrieved 16 January 2017.

Bibliography