Gadfly (database)

Last updated
Gadfly
Developer(s) Aaron Robert Watters
Initial release1994;30 years ago (1994)
Stable release
1.0 / January 5, 2005;19 years ago (2005-01-05)
Written in Python
Operating system Cross-platform
Type Relational Database Management System
License Python License
Website gadfly.sourceforge.net

Gadfly is a relational database management system written in Python. Gadfly is a collection of Python modules that provides relational database functionality entirely implemented in Python. It supports a subset of the standard RDBMS Structured Query Language (SQL). [1] [2]

Contents

Gadfly runs wherever Python runs and supports client/server on any platform that supports the standard Python socket interface. The file formats used by Gadfly for storage are cross-platform—a gadfly database directory can be moved from Windows 95 to Linux using a binary copying mechanism and gadfly will read and run the database.

It supports persistent databases consisting of a collection of structured tables with indices, and a large subset of SQL for accessing and modifying those tables. It supports a log-based recovery protocol which allows committed operations of a database to be recovered even if the database was not shut down[ clarification needed ] in a proper manner (i.e., in the event of a CPU or software crash, [but not in the event of a disk crash]). It also supports a TCP/IP Client/Server mode where remote clients can access a Gadfly database over a TCP/IP network (such as the Internet) subject to configurable security mechanisms.

Since Gadfly depends intimately on the kwParsing package it is distributed as part of the kwParsing package, under the same copyright.

Gadfly allows Python programs to store, retrieve and query tabular data without having to rely on any external database engine or package. It provides an in-memory relational database style engine for Python programs, complete with a notion of a "committed, recoverable transaction" and "aborts".[ citation needed ]

Usage

The main "gadfly" module attempts to faithfully adhere to Greg Stein's Python Database API, as discussed and certified by the Python DB-SIG.

Concurrent database updates are not supported. The "databases" are currently designed to be written/modified by one process in isolation. Multiple processes can access[ clarification needed ] a Gadfly database when accesses are arbitrated by a TCP/IP Gadfly server process.

Creating a new database

Unlike most Python/database-engine interfaces Gadfly databases must be created using Python. To accomplish this programmers use:

importgadflyconnection=gadfly.gadfly()

with no arguments. Then they startup a database using the startup method:

connection.startup("mydatabase","mydirectory")

Here "mydirectory" must be a directory which exists and which can be written to in order to store the database files. The startup creates some files in "mydirectory". This has the effect of clobbering any existing Gadfly database called "mydatabase" in the directory "mydirectory". Gadfly will not allow a start up the same connection twice, however.

The first "import gadfly" reads in and initializes some rather large data structures used for parsing SQL, and thus may take longer than other module imports.

Within the database the user can create tables, populate them, and commit the result when they are happy:

cursor=connection.cursor()cursor.execute("CREATE TABLE ph (nm VARCHAR, ph VARCHAR)")cursor.execute("INSERT INTO ph(nm, ph) VALUES ('arw', '3367')")cursor.execute("SELECT * FROM ph")forxincursor.fetchall():printx# prints ('arw', '3367')connection.commit()

Reconnecting to an existing database

Once a database exists it can be reconnected to:

importgadflyconnection=gadfly.gadfly("mydatabase","mydirectory")

This will read in the database tables with the most recently committed values. The initialized database is then queried and updated:

cursor=connection.cursor()cursor.execute("UPDATE ph SET nm='aaron' WHERE nm='arw'")cursor.execute("SELECT * FROM ph")forxincursor.fetchall():printx# prints ('aaron', '3367')

If the user does not want to commit updates then the do not execute a commit on the connection object (which writes out the tables). To restore the old values from the existing database they use:

connection.abort()

Updates are only stored upon a connection.commit(). [Actually, if autocheckpoint is disabled, updates are only stored to table files on checkpoint—see the documentation on the recovery mechanism.]

printcursor.pp()

to "pretty print" the result of any evaluation (which might be None for a non-select).

Features

In this version all tables are read into memory upon "connecting" to the database and "touched" tables are written out upon checkpoint. Each table is represented as a separate file in the destination directory, and there is a "data definition" file as well (a list of data definition declarations). During active use a log file appears in the active directory as well, and if the process crashes this log file is used to recover committed operations.

The SELECT statement

At this point Gadfly supports quite a lot of the SQL semantics requested in the ODBC 2.0 specification. SQL statements supported include the SELECT:

SELECT[DISTINCT|ALL]expressionsor*FROMtables[WHEREcondition][GROUPBYgroup-expressions][HAVINGaggregate-condition][union-clause][ORDERBYcolumns]

This statement is quite powerful. It reads as follows:

  1. Make all combinations of rows from the tables (FROM line)
  2. Eliminate those combinations not satisfying condition (WHERE line)
  3. (if GROUP present) form aggregate groups that match on group-expressions
  4. (if HAVING present) eliminate aggregate groups that don't satisfy the aggregate-condition.
  5. compute the columns to keep (SELECT line).
  6. (if union-clause present) combine (union, difference, intersect) the result with the result of another select statement.
  7. if DISTINCT, throw out redundant entries.
  8. (if ORDER present) order the result by the columns (ascending or descending as specified, with precedence as listed).

The actual implementation in gadfly is much better than the intuitive reading, particularly at steps 1 and 2 (which are combined via optimizing transformations and hash join algorithms).

Conditions may include equalities, and inequalities of expressions. Conditions may also be combined using AND, OR, NOT. Expressions include column names, constants, and standard arithmetic operations over them.

Embedded queries supported include subquery expressions, expr IN (subselect), quantified comparisons, and the EXISTS (subselect) predicate.

Aggregate tests and computations can only be applied after the GROUPing and before the columns are selected (steps 3,4,5). Aggregate operations include COUNT(*), COUNT(expression), AVG(expression), SUM(expression), MAX(expression), MIN(expression), and the non-standard MEDIAN(expression). These may be applied to DISTINCT values (throwing out redundancies, as in COUNT(DISTINCT drinker). if no GROUPing is present the aggregate computations apply to the entire result after step 2.

There is much more to know about the SELECT statement. The test suite test/test_gadfly.py gives numerous examples of SELECT statements.

Table creation and "data types"

Create tables using the CREATE TABLE statement:

CREATETABLEname(colnamedatatype[,colnamedatatype...])

Data types currently "supported" are integer, float, and varchar. They are ignored by the implementation, anything that is hashable and marshallable can currently go in any column (but that is likely to change). For example:

CREATETABLEfrequents(drinkerVARCHAR,barVARCHAR,perweekINTEGER)

At present tuples, complexes, or anything else can be put into a column specified as "VARCHAR".

Other supported statements

Gadfly also supports the searched DELETE and UPDATE; INSERT VALUES and INSERT subselect; CREATE/DROP INDEX, and DROP TABLE. These have the informal syntax:

DELETEFROMtableWHEREconditionUPDATEtableSETcol=expr[,col=expr...]WHEREconditionINSERTINTOtable[(column[,column...])]VALUES(value[,value...])INSERTINTOtable[(column[,column...])]subselectCREATE[UNIQUE]INDEXnameONtable(column[,column...])DROPTABLEtableDROPINDEXname

Multiple statements may be executed in one cursor.execute(S) by separating the statements with semicolons in S, for example S might have the string value:

DROPINDEXtdindex;DROPTABLEtemplikes

SQL is case insensitive.

Dynamic values

Expressions also include the special expression '?' (the ODBC-style dynamic expression) as in:

insertstat="INSERT INTO ph(nm,ph) VALUES (?, ?)"cursor.execute(insertstat,('nan',"0356"))cursor.execute(insertstat,('bill',"2356"))cursor.execute(insertstat,('tom',"4356"))

Dynamic values allow the cursor to use the same parsed expression many times for a similar operation. Above the insertstat is parsed and bound to the database only once. Using dynamic attributes should speed up accesses. Thus the above should run much faster than the equivalent:

cursor.execute("INSERT INTO ph(nm,ph) VALUES ('nan', '0356')")cursor.execute("INSERT INTO ph(nm,ph) VALUES ('bill', '2356')")cursor.execute("INSERT INTO ph(nm,ph) VALUES ('tom', '4356')")

Dynamic attributes can appear in other statements containing expressions (such as SELECTs, UPDATEs and DELETEs too).

For SELECT, UPDATE, and DELETE the dynamic expression substitutions must consist of a single tuple, as in:

stat="SELECT * FROM ph WHERE nm=?"cursor.execute(stat,("nan",))...cursor.execute(stat,("bob",))...

Since the dynamic substitution eliminates the need for parsing and binding (expensive operations!) the above should run faster than the equivalent:

cursor.execute("SELECT * FROM ph WHERE nm='nan'")...cursor.execute("SELECT * FROM ph WHERE nm='bob'")...

If several similar queries are repeated multiple times, each query "template string" is associated with a unique cursor object so that each template must be parsed and bound only once. Some relatively complex queries from the test suite run 2 to 3 times faster after they have been parsed and bound, even without the kjbuckets builtin. With kjbuckets the same ran 5 to 10 times faster.

Multiple batch inserts and dynamic values

For the special case of INSERT VALUES a list of substitution tuples allows the query engine to perform the inserts in optimized batch mode. Thus the fastest way to perform the three inserts given earlier is:

data=[('nan',"0356")),('bill',"2356"),('tom',"4356")]stat="INSERT INTO ph(nm,ph) VALUES (?, ?)"cursor.execute(stat,data)

It would be even faster if the cursor had previously executed the stat with different data (since then no parsing or binding would occur).

Introspection

By default a gadfly database automatically includes "introspective" tables which allow a gadfly query to "query the shape of the database"—for example to view table names and names of rows in tables:

>>> g=gadfly()>>> g.startup("dbtest","dbtest")>>> c=g.cursor()>>> c.execute("select * from __table_names__")>>> printc.pp()IS_VIEW | TABLE_NAME=========================1       | __TABLE_NAMES__1       | DUAL1       | __DATADEFS__1       | __COLUMNS__1       | __INDICES__1       | __INDEXCOLS__

Here DUAL is a standard one row/one column test table (from the Oracle tradition) and the other tables provide information about the database schema:

>>> c.execute("create table t1 (a varchar, b varchar)")>>> c.execute("create table t2 (b varchar, c varchar)")>>> c.execute("create unique index t1a on t1(a)")>>> c.execute("create index t1b on t1(b)")>>> c.execute("select * from __table_names__")>>> printc.pp()IS_VIEW | TABLE_NAME=========================0       | T11       | __DATADEFS__1       | __INDICES__0       | T21       | __TABLE_NAMES__1       | __COLUMNS__1       | DUAL1       | __INDEXCOLS__>>> c.execute("select * from __columns__")>>> printc.pp()COLUMN_NAME | TABLE_NAME=============================A           | T1B           | T1NAME        | __DATADEFS__DEFN        | __DATADEFS__INDEX_NAME  | __INDICES__TABLE_NAME  | __INDICES__IS_UNIQUE   | __INDICES__TABLE_NAME  | __TABLE_NAMES__IS_VIEW     | __TABLE_NAMES__B           | T2C           | T2COLUMN1     | DUALTABLE_NAME  | __COLUMNS__COLUMN_NAME | __COLUMNS__INDEX_NAME  | __INDEXCOLS__COLUMN_NAME | __INDEXCOLS__>>> c.execute("select * from __indices__")>>> printc.pp()IS_UNIQUE | TABLE_NAME | INDEX_NAME===================================0         | T1         | T1B1         | T1         | T1A>>> c.execute("select * from __indexcols__")>>> printc.pp()COLUMN_NAME | INDEX_NAME========================B           | T1BA           | T1A>>> c.execute("select * from dual")>>> printc.pp()COLUMN1=======0

Interactive testing

After installation, the created database can be interactively tested from the same directory using the interactive interpreter:

Python 2.1.3 (#1, Apr 30 2002, 19:37:40)[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-96)] on linux2Type "copyright", "credits" or "license" for more information.>>>>>> fromgadflyimportgadfly>>> connection=gadfly("test","dbtest")>>> cursor=connection.cursor()>>> cursor.execute("select * from frequents")>>> cursor.description(('DRINKER', None, None, None, None, None, None), ('PERWEEK', None, None,None, None, None, None), ('BAR', None, None, None, None, None, None))>>> printcursor.pp()DRINKER | PERWEEK | BAR============================adam    | 1       | lolaswoody   | 5       | cheerssam     | 5       | cheersnorm    | 3       | cheerswilt    | 2       | joesnorm    | 1       | joeslola    | 6       | lolasnorm    | 2       | lolaswoody   | 1       | lolaspierre  | 0       | frankies>>>

Architecture

The SQL grammar is described in grammar.py, the binding of the grammar constructs to semantic objects is performed in bindings.py, the semantic objects and their execution strategies is defined in semantics.py. The semantics use a lot of classical and non-classical logic (cylindric logic) as well as optimization heuristics to define a relatively efficient and correct implementation of SQL.

The most basic data structures of the implementation are given in either kjbuckets0.py or the faster kjbucketsmodule.c, which implement the same data type signatures in Python and in a C extension to Python respectively.

The database.py module is a simple wrapper that provides a standard DBAPI interface to the system.

The test suite test/test_gadfly.py attempts to provide a regression test and a demonstration of the system.

The SQL parser also requires the kwParsing parser generation package, which consists of a number of additional python modules.

Tools

With gfplus a gadfly database can be interactively manipulated with SQL commands. The tool works similar to Oracle's SQL*Plus.

Concurrency

Because it lacks true concurrency control and file-system based indexing it is not appropriate for very large multiprocess transaction-based systems.

Two applications may access the same database concurrently. However, changes made by one application may not be seen by the other application until after it restarts. This may be because each application loads the database in-memory at startup only.

Related Research Articles

<span class="mw-page-title-main">PostgreSQL</span> Free and open-source object relational database management system

PostgreSQL, also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transactions with atomicity, consistency, isolation, durability (ACID) properties, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is supported on all major operating systems, including Linux, FreeBSD, OpenBSD, macOS, and Windows, and handles a range of workloads from single machines to data warehouses or web services with many concurrent users.

Structured Query Language (SQL) is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.

A stored procedure is a subroutine available to applications that access a relational database management system (RDBMS). Such procedures are stored in the database data dictionary.

In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE, ALTER, and DROP.

<span class="mw-page-title-main">Join (SQL)</span> SQL clause

A join clause in the Structured Query Language (SQL) combines columns from one or more tables into a new table. The operation corresponds to a join operation in relational algebra. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

The SQL SELECT statement returns a result set of rows, from one or more tables.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its creator.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

<span class="mw-page-title-main">RDFLib</span> Python library to serialize, parse and process RDF data

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.

An identity column is a column in a database table that is made up of values generated by the database. This is much like an AutoNumber field in Microsoft Access or a sequence in Oracle. Because the concept is so important in database science, many RDBMS systems implement some type of generated key, although each has its own terminology. Today a popular technique for generating identity is to generate a random UUID.

Judoscript is a general purpose programming language designed primarily for scripting tasks on the Java platform. It was conceived and developed by James Jianbo Huang, starting in late 2001. Judoscript was one of the first so-called Java scripting languages; but its most striking characteristics is its audacious multi-domain support philosophy and practice.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

Web2py is an open-source web application framework written in the Python programming language. Web2py allows web developers to program dynamic web content using Python. Web2py is designed to help reduce tedious web development tasks, such as developing web forms from scratch, although a web developer may build a form from scratch if required.

Apache Empire-db is a Java library that provides a high level object-oriented API for accessing relational database management systems (RDBMS) through JDBC. Apache Empire-db is open source and provided under the Apache License 2.0 from the Apache Software Foundation.

Java Database Connectivity (JDBC) is an application programming interface (API) for the Java programming language which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. It is part of the Java Standard Edition platform, from Oracle Corporation. It provides methods to query and update data in a database, and is oriented toward relational databases. A JDBC-to-ODBC bridge enables connections to any ODBC-accessible data source in the Java virtual machine (JVM) host environment.

In database management systems (DBMS), a prepared statement, parameterized statement, or parameterized query is a feature where the database pre-compiles SQL code and stores the results, separating it from data. Benefits of prepared statements are:

PL/SQL is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database, Times Ten in-memory database, and IBM Db2. Oracle Corporation usually extends PL/SQL functionality with each successive release of the Oracle Database.

The syntax of the SQL programming language is defined and maintained by ISO/IEC SC 32 as part of ISO/IEC 9075. This standard is not freely available. Despite the existence of the standard, SQL code is not completely portable among different database systems without adjustments.

References

  1. "Gadfly—Zope's Integrated Demo Relational Database". InformIT. Retrieved February 19, 2011.
  2. "Gadfly". SourceForge. Retrieved February 19, 2011.