MetaHDBC paper (draft)

September 18, 2008

In May I started a Haskell-cafe discussions, where I proposed using
Template Haskell to do type-safe database access. The idea got well
received and turned into the MetaHDBC library.

Concurrently with development I also wrote a paper describing
MetaHDBC. I have never writing a paper before, and I therefore tried
to imitate the best by following Simon PJ’s slides and video about
writing papers.

A draft of the paper can be found here. Your opportunity to
influence the paper is large, as my limited paper-writing experience
means I have little preconception about what a good paper looks
like. I would especially like comments about the overall quality of
the paper, can it be called scientific and comments about anything I
could do to improve the paper. And remember, if commenting, honest is
better than polite.

Posted by Mads Lindstrøm
Filed in Uncategorized

6 Comments »

6 Responses to “MetaHDBC paper (draft)”

Anthony Clayden Says:

September 22, 2008 at 5:57 am
Hi Mads,

I’m very pleasantly surprised that TH is powerful enough to connect to the database at compile time. It’s then a great idea to ‘pre-validate’ the database access embedded in your program. (So turning SQL from an interpreted language into a compiled language – almost ;-)

I’m afraid, though, that your so-called results show that you understand very little about SQL. (In my following comments, I’m not claiming there’s anything ‘better’ about SQL – just that it is well-established and well-understood.)

Before you claim “The results are less than encouraging … of inferring types …” [Section 6 of your paper], or “… disappointed with … ability to do type inference” [Section 11], I would look long and hard at the SQL DESCRIBE statement.

SQL has very powerful means to infer types. And it must do that to validate and optimise queries before executing them.

Consider: orders of magnitude more lines of SQL code than Haskell are executed in commercial data processing every day; applications written in Java, C, Visual Basic, C#, C++, Perl, Ruby, etc access databases via SQL strings; the SQL must be able to provide consistent, well-typed return data for these applications to operate.

Anthony

Reply
Mads Lindstrøm Says:

September 22, 2008 at 9:46 pm
Hi Anthony,

I am very pleased that you took the time to look at the MetaHDBC paper, and I can see that the paper should explain some things better. You state:

SQL has very powerful means to infer types. And it must do that to validate and optimise queries before executing them.

Consider: orders of magnitude more lines of SQL code than Haskell are executed in commercial data processing every day; applications written in Java, C, Visual Basic, C#, C++, Perl, Ruby, etc access databases via SQL strings; the SQL must be able to provide consistent, well-typed return data for these applications to operate.

The point is not if database management system’s (DBMS) internally do type inference, but if they let the results of type inference be available to the DBMS user and that the type inference result is available before executing a SQL statement. Nonetheless, I must concede that the paper do not make this distinction explicit. Unfortunately, the paper also do not describe the means used for type inference. MetaHDBC asks DBMSs for inferred types though ODBC’s describeParam (and friends).

I do, however, not see your point about “SQL DESCRIBE”. You state:

Before you claim “The results are less than encouraging … of inferring types …” [Section 6 of your paper], or “… disappointed with … ability to do type inference” [Section 11], I would look long and hard at the SQL DESCRIBE statement.

And I therefore looked up “SQL DESCRIBE” in MySQL’s manual:

DESCRIBE provides information about the columns in a table. It is a shortcut for SHOW COLUMNS FROM.

How would that help infer the type of a SQL statement? And yes, I do know that you could ask most database management systems for the type of a column.

Greetings,

Mads Lindstrøm

Reply
Anthony Clayden Says:

September 23, 2008 at 2:33 am
Hi Mads,

Yes, it seems the limitations on getting type info have more to do with ODBC and/or TH’s ability to interface to the internals of the DBMS. I would re-phrase so you don’t suggest DBMS’s or SQL can’t do type inference.

re DESCRIBE, I had in mind the ANSI/ISO 1992 SQL standard, as available in Oracle, SQL-server, Sybase. (I notice Wouter also suggested this in the May discussion.) From your description, it sounds like MySQL doesn’t implement this fully. And the purpose of DESCRIBE is exactly to make type information available without executing a statement.

DESCRIBE can apply to a SELECT statement (not just a table) — so you can use it to obtain type information about joins and aggregates, as well as the ‘base tables’. It tells you the column types, their length, and whether nullable. (This goes via SQLPrepare – which I think you’re doing already.)

DESCRIBE OUTPUT tells you the types of the columns returned from the SELECT. (So in the simple case of a SELECT * FROM , yes it would be equivalent to SHOW COLUMNS FROM .) Have you tried SQLDescribeCol for a SQLPrepared statement? — from the msdn .Net description, it sounds like it should do the trick.

DESCRIBE INPUT tells you the types of the dynamic parameters the query expects. (I guess same as SQLDescribeParam you mention.)

I’ve mostly seen DESCRIBE applied to SELECT statements, but from the standard, it seems it can also apply to inserts, updates and deletes, etc.

I suspect that to get finer-grained info, you need to decompose the SQL statements to identify the table name(s), column list, etc. And a better way round to achieve that would be building up the SQL using combinators, rather than having the end-user programmer provide the SQL as a string and TH try to parse it — I think Bjorn hinted at something along these lines back in May.

Regards
Anthony

Reply
Mads Lindstrøm Says:

September 24, 2008 at 9:06 pm
Hi Anthony,

You write:

Yes, it seems the limitations on getting type info have more to do with ODBC and/or TH’s ability to interface to the internals of the DBMS. I would re-phrase so you don’t suggest DBMS’s or SQL can’t do type inference.

Depending on your point of view, it has to do with ODBC or with DBMS vendors implementation of ODBC. ODBC do specify ODBC’s describeParam, SQLDescribeCol, SQLNumParams, … but ODBC drivers are allowed to return: “Not implemented”. You can choose to view this as a too unrestricted specification of ODBC or to view it as a limitation of some ODBC drivers.

However, I do not follow why you think that Template Haskell (TH) should be particular suited for interfacing with the internals of a DBMS. The only thing TH has to do with DBMSs is that I choose to build a library that accessed DBMSs using TH. TH can do arbitrary IO at compile-time, so obviously it can interface with anything, but it has no particular connection to databases. Maybe I am missing your point?

Next you state:

re DESCRIBE, I had in mind the ANSI/ISO 1992 SQL standard, as available in Oracle, SQL-server, Sybase. (I notice Wouter also suggested this in the May discussion.) From your description, it sounds like MySQL doesn’t implement this fully. And the purpose of DESCRIBE is exactly to make type information available without executing a statement.

In Maj, Wouter suggested:

… you really want to compute the *type* of a table based on the *value* of an SQL DESCRIBE.

Wouter was writing about using SQL DESCRIBE to get the type of a table, not the type of a SQL Statement. If you read the paper he refers to in the thread, you will see that he is describing a library similar to HaskellDB (main difference is Wouter is using dependent types), and thus he only needs the type of tables.

And I am afraid SQL DESCRIBE is not going to help me much. Of the four DBMSs I investigated in the paper, only DB2 supports “SQL DESCRIBE”, but it already supports ODBC’s describeParam (and friends) so their is no real gain there. Anyway, thanks for the pointer, there may be other DBMSs, which supports SQL DESCRIBE and do not support describeParam (and friends).

Later you write:

Have you tried SQLDescribeCol for a SQLPrepared statement? — from the msdn .Net description, it sounds like it should do the trick.

Yes, I am aware of SQLDescribeCol and I used the function from the very beginning of MetaHDBC.

Finally you write:

I suspect that to get finer-grained info, you need to decompose the SQL statements to identify the table name(s), column list, etc. And a better way round to achieve that would be building up the SQL using combinators, rather than having the end-user programmer provide the SQL as a string and TH try to parse it — I think Bjorn hinted at something along these lines back in May.

In the same thread, Bjorn also encouraged me to implement MetaHDBC and package it as a library. Using query combinators, like HaskellDB, or parsing off the SQL to ODBC at compile-time has different trade-offs. Different situations will call for the one solution, other situations will call for the other solution.

I do not think we need much finer grain control of type information. The only type information I found lacking in ODBC is singleton results. E.g. if you do SELECT SUM(x) FROM foo, then ODBC will not tell you that you are requesting a singleton result. This is not a limitation of SQL, but a limitation in ODBC. Obviously, knowing whether a result is a singleton is nice, but it is hugely important. And I suspect that it is not really easy to do with query combinators, as not even HaskellDB does it.

Greetings,

Mads Lindstrøm

Reply
Anthony Clayden Says:

September 26, 2008 at 5:21 am
Greetings Mads,

I’m sorry if you thought I was trying to get into a detailed comparison of different ways a DBMS might or might not make typing info available via ODBC.

I was trying to comment about things you could do to improve the paper, as you asked for. I think I’d better start by explaining why I first responded to you about SQL …

One of the things SPJ recommends is to “state your contributions”. (He contrasts that with “The purpose of the paper is *not* to describe the Wizwoz system.”)

From your web page, and the paper’s Abstract, I was hoping your contribution would be around “type-safe database access” and static typing.

But then your Introduction reads like you’re describing a wizwoz system, specifically to bind SQL statements within a Haskell source, and get them pre-validated at compile time by passing through TH to HDBC to ODBC. At that point I nearly stopped reading your paper, except for two tantalising sentences near the end of the Introduction:
– “Has type inference of SQL statements”
“I explore the ability of popular databases to do type inference”

Both those sentences refer to Section 6 about type inference (which I read next); then I read your Conclusion.

So my background as a reader is that I know SQL intimately (because I’m a commercial programmer, and over the years have come to feel strongly there must be something better); I’ve used ODBC to grab data into an application, but not explored it’s ability to pre-validate SQL; I’ve been watching Haskell for some time (because it looks like it might be one of those better things); I’ve not looked closely at TH; I’ve tried to understand the many attempts there have been to connect Haskell to relational databases, including things like Oleg’s HList stuff for records calculus, and askellDB’s similar attempt. These may be interesting academic exercises, but there’s still no generally accepted approach for Haskell to connect to DBMS’s, whereas for most popular languages it’s all in a day’s work. Why?

The biggest strength of Haskell is its strong type inference at compile time, and you understand that where your Conclusion says “static implicit typing is the most important [feature of MetaHDBC]”. But the rest of the Conclusion, and most of section 6 seem to be saying MetaHDBC failed in the attempt because the DBMS “did a poor job of type inference”. I knew it couldn’t be true that the DBMS did poor type inference, and that’s what I first questioned.

Now that you’ve explained what you actually tried (and I’ve read up a bit about HDBC and ODBC’s API), I think the position is:
– the SQL standard specifies a means to obtain type information about a SQL statement
(that is, the DESCRIBE command)
– ODBC in theory implements this through a family of API calls SQLDescribeXXX
– in practice, various DBMS’s support these API’s to different extents
– usually it is necessary to execute the SQL statement to obtain reliable descriptions
(according to the HDBC documentation)
– clearly executing SQL is unacceptable at compile time
(because it can slow compilation, and worse, it might update the database)
– but here’s the idea: if we could grab the column list and table(s) out of a SELECT,
we could then Execute in TH: SELECT columns FROM tables WHERE FALSE.
(This should be quick and not cause any damage.)
Then we’ll get better type info??

I’d like to respectfully suggest what are (or might be) your contributions. (And I’m drawing heavily on The Power of Pi’s Section 4/page 8):
– MetaHDBC shows it is possible to bridge the gap from Haskell’s static type system
to access a DBMS via SQL (usually an interpreted language).
– This is achieved by pre-validating the SQL at compile time
using TH’s ability to execute IO.
– So ensuring all tables and columns used in the program do appear in the database
(and with corresponding types).
– In this way, TH avoids the need for a preprocessor to extract type and
schema information from the database for embedding into the program source.
– This gives strong guarantees that database access ‘will not go wrong’ at run time
(providing the database’s schema has not changed).
– It can also validate against the DBMS’s type inference engine to (for example):
– confirm the mapping of SQL column types to the Haskell program’s usage;
– verify the type of the join or cartesian product of two tables.

MetaHDBC’s proof of concept opens up interesting lines for further exploration:
– capturing relevant parts of the schema definition at compile time,
and embedding them in the compiled program,
to detect before running the program whether the database’s schema has changed
along the lines of [Swierstra+Oury 2008]’s approach to Relational Algebra.
– strengthening type information at compile time,
to circumvent limitations in some DBMS’s support for the ODBC API’s.

How to go about exploring that stuff? I’ll take your example of detecting a singleton result from an aggregate SELECT:
– SELECT SUM(x) FROM foo — returns a singleton
(BTW are you sure about that? I would expect a single row)
– SELECT y, SUM(x) FROM foo GROUP BY y — returns multiple rows
– I need to be able to look ‘inside’ the SELECT to detect this.

Which brings me back to the combinators approach. SQL is a horrible kludge, hacked together by IBM engineers in the early ’70’s to get something up and running. It is not really a programming language, certainly not an algebra that supports equational reasoning, and
shouldn’t have a place in the carefully crafted world of functional programming. And what really annoys me is that Ted Codd did all the hard work for us before 1969: Relational Algebra is what we should be embedding in Haskell. OK, there’s no commercial DBMS that supports RA, but we don’t need one because we can use SQL as the implementation substrate: we can treat RA as a DSEL, and thanks to TH/MetaHDBC translate it into SQL in a
well-structured way and then obtain compile-time static type guarantees.

Mads, I hope this helps you re-orient your paper so that (as SPJ puts it) the contributions drive the whole paper, and readers can see how exciting it might be.

Regards
Anthony

Reply
Mads Lindstrøm Says:

November 4, 2008 at 9:31 pm
Hi Anthony,

I realize it is quite some time ago you posted a comment, but I have been quite busy and have not had time to reply before now. MetaHDBC is, after all, a hobby project and I work on it when I have the time and inclination to do so.

You write:

I was trying to comment about things you could do to improve the paper, as you asked for.

And I really appreciate it – I really do. Nonetheless, when you write:

But then your Introduction reads like you’re describing a wizwoz system, specifically to bind SQL statements within a Haskell source, and get them pre-validated at compile time by passing through TH to HDBC to ODBC …

I am confused about what you are referring to. I do not even mention HDBC in the introduction and I do not really know what you mean by “to bind SQL statements within…”. It is not really that hard to copy’ paste the text you are referring to and WordPress do support the blockquote tag. That said, I will try my best to make the introduction more readable, especially by skipping some of the details. I will try to put a stronger focus on what MetaHDBC provides, rather then how it provides it. In the end of the same paragraph you also state:

At that point I nearly stopped reading your paper, except for two tantalising sentences near the end of the Introduction:

– “Has type inference of SQL statements”
– “I explore the ability of popular databases to do type inference”

Yes, this is really the meat of the matter and therefore should not be “hidden” at the bottom of the list of contributions. I will correct this.

Latter you state:

…

– usually it is necessary to execute the SQL statement to obtain reliable descriptions
(according to the HDBC documentation)

…

I suspect you refer to this:

describeResult :: Statement -> IO [(String, SqlColDesc)]

Obtain information about the columns in the result set. Must be run only after execute. The String in the result set is the column name.

You should expect this to be returned in the same manner as a result from fetchAllRows’.

All results should be converted to lowercase for you before you see them.

Please see caveats under getColumnNames for information on the column name field here.

Having looked at and modified HDBC to fit the needs of MetaHDBC, I can say that the reason you need to execute SQL statements before obtaining type information has to do with how HDBC is implemented. Note, I am not saying that some DBMS’ do not have the same limitation, just that even if a DBMS’ can return type results before executing a statement, then you cannot receive those type results though HDBC. Therefore when somebody installs MetaHDBC, he also has to install a modified version of HDBC.

Your comment made me think though, that maybe PostgreSQL or MySQL could return type information if I execute the SQL statement, before asking for the type information. It is definitely on the do-investigate list.

Latter, you write:

– but here’s the idea: if we could grab the column list and table(s) out of a SELECT,
we could then Execute in TH: SELECT columns FROM tables WHERE FALSE.
(This should be quick and not cause any damage.)

Then we’ll get better type info??

Yes, if indeed executing a SQL statement improves the ability of DBMS’ to return type information, then this is an interesting option. I will think about it.

Finally, and here I think we get into a fundamental disagreement, you write:

Which brings me back to the combinators approach. SQL is a horrible kludge, hacked together by IBM engineers in the early ’70’s to get something up and running. It is not really a programming language, certainly not an algebra that supports equational reasoning, and shouldn’t have a place in the carefully crafted world of functional programming. And what really annoys me is that Ted Codd did all the hard work for us before 1969: Relational Algebra is what we should be embedding in Haskell. OK, there’s no commercial DBMS that supports RA, but we don’t need one because we can use SQL as the implementation substrate: we can treat RA as a DSEL, and thanks to TH/MetaHDBC translate it into SQL in a
well-structured way and then obtain compile-time static type guarantees.

Even though the beginning of your criticism of SQL is little more than name calling, you do provide some valid points. Yes, SQL is not true to the relational model and it do not seem to support equational reasoning. I am not really sure when something qualifies as supporting equational reasoning, but I could not come up with counter examples, so I will take your word for it. Nonetheless, inventing a new query language versus using ordinary SQL is not a clear cut decision, as I will explain below.

First, SQL is not true to the relational model, as it supports features which are non-relational. Nonetheless, you can restrain yourself from using these features and you are back to the relational model. Of cause this situation is not optimal, but I really cannot see that the situation is as bad as you seem to indicate.

Second, MetaHDBC is a thin layer above a DBMS, as compared to implementing relational algebra as a DSEL. Being thin provides several advantages:

– The connection between the code the programmer writes and what is passed on to the DBMS is obvious, as MetaHDBC pass on the SQL essentially unmodified. This in turn makes it easier to optimize SQL statements. A thicker layer, like a DSEL, complicates hand-written optimizations. And if something goes wrong inside the layer (DSEL or MetaHDBC), then it is a lot easier deal with if the layer is thin than if the layer is thick.

– For people who already knows SQL, MetaHDBC is going to be easier to learn than a DSEL.

– It has a simple implementation, as we do not need to invent a new language.

– Has access to DBMS specific features, as we pass on the SQL unaltered.

To be clear, I am not stating that a DSEL will not be the best option in certain situations. And for sure, you can name advantages of a DSEL approach not mentioned here. Though, I am stating that in some situations an approach like MetaHDBC will be the best option, in other situations a DSEL will be better. It really depends on the developers and your requirements.

Another reason for not working on a DSEL approach, is that the HaskellDB folks are already doing this.

Regards,

Mads

Reply

Haskellville