Source code
https://github.com/uber/queryparser
Article
Queryparser, an Open Source Tool for Parsing and Analyzing SQL

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
OLAP vs OLTP
    Two kinds of database-backed operations that
    you need to do:
    - OLTP
      You need to use a database as part of
      some business process.
    - OLAP
      You use a database as part of analysis.

Online Analytical Processing
OLAP
    A fancy term used to describe a certain
    class of database applications.

columnar data warehouse
    An alternative approach to OLAP workloads.

data warehouse
enterprise data warehouse
    A system used for reporting and data
    analysis, and is considered a core
    component of business intelligence.

    DWs are central repositories of integrated
    data from one or more disparate sources.

Implementation

  • Deployed in a streaming architecture
1
2
3
"data warehouse" -> "query logs" -> "queryparser" -> "analyse results"
"queryparser" -> "catalog info"
"catalog info" -> "queryparser"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
                                          +---------------------------------------+
                                          v                                       |
+----------------+     +------------+     +--------------+     +-----------------+  |
| data warehouse | --> | query logs | --> | queryparser  | --> | analyse results |  |
+----------------+     +------------+     +--------------+     +-----------------+  |
                                          |                                       |
                                          |                                       |
                                          v                                       |
                                        +--------------+                          |
                                        | catalog info | -------------------------+
                                        +--------------+
Uber’s data warehouse streaming architecture
Feeds all queries through Queryparser.
Boxes denote services and pipes denote data-streams. The catalog info service is responsible for tracking the schemas of the tables in the data warehouse.

Queryparser consumes the real-time stream of queries submitted to the data warehouse, analyzes every single one, and emits the analysis results to a separate stream.

1
2
3
4
5
scoping rules
    The scope of a name (variable names, data
    structure names, procedure names) is the
    part of the program within which the name
    can be used.
  • Individual queries are processed in three steps.
    • Phase 1: Parse
      Transforms the query from a raw string of characters into an AST representation.

    • Phase 2: Resolve.
      Scans the raw AST and applies scoping rules.

      Transforms plain column names by adding the table name, and transforms plain table names by adding the schema name.

      Requires as input the full list of columns in every table and the full list of tables in every schema, otherwise known as “catalog information.”

    • Phase 3: Analyze
      Scans the resolved AST, looking for columns which are compared for equality.

Running the demo

1
cd "$MYGIT/uber/queryparser"; stack ghci

I will automate the starting of the repl and running of the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
:set prompt "> "
import Demo
demoAllAnalyses "SELECT * FROM foo" -- note that the SELECT * expands to a,b,c
demoAllAnalyses "SELECT * FROM bar" -- and here the SELECT * expands to x,y,z
demoAllAnalyses "SELECT x, count(1) FROM foo JOIN bar ON foo.a = bar.y WHERE z IS NOT NULL GROUP BY 1 ORDER BY 2 DESC, b"
-- let's play with some queries that modify table-data!
demoTableLineage "INSERT INTO foo SELECT * FROM bar"
demoTableLineage "TRUNCATE TABLE foo"
demoTableLineage "ALTER TABLE bar, foo RENAME TO baz, bar"
-- let's explore a few subtler behaviors of the "joins" analysis (admittedly, something of a misnomer)
demoJoins "SELECT * FROM foo JOIN bar ON a=x AND b+c = y+z"
demoJoins "SELECT a FROM foo UNION SELECT x FROM bar"

Do some tcl/expect code generation

1
2
3
4
0</dev/null cmd-real x -sh "$(cmd-real cd "$MYGIT/uber/queryparser"); $(cmd-real stack ghci)" -r "Test.*> "
echo " \\"
q -l | sed -e 's/^/  -s /' -e 's/$/ -c m -e "> " \\/'
echo -n "  -i"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
x -sh "cd /home/shane/source/git/uber/queryparser; stack ghci" -r "Test.*> " \
-s ":set prompt \"> \"" -c m -e "> " \
-s "import Demo" -c m -e "> " \
-s "demoAllAnalyses \"SELECT * FROM foo\" -- note that the SELECT * expands to a,b,c" -c m -e "> " \
-s "demoAllAnalyses \"SELECT * FROM bar\" -- and here the SELECT * expands to x,y,z" -c m -e "> " \
-s "demoAllAnalyses \"SELECT x, count(1) FROM foo JOIN bar ON foo.a = bar.y WHERE z IS NOT NULL GROUP BY 1 ORDER BY 2 DESC, b\"" -c m -e "> " \
-s "-- let's play with some queries that modify table-data!" -c m -e "> " \
-s "demoTableLineage \"INSERT INTO foo SELECT * FROM bar\"" -c m -e "> " \
-s "demoTableLineage \"TRUNCATE TABLE foo\"" -c m -e "> " \
-s "demoTableLineage \"ALTER TABLE bar, foo RENAME TO baz, bar\"" -c m -e "> " \
-s "-- let's explore a few subtler behaviors of the \"joins\" analysis (admittedly, something of a misnomer)" -c m -e "> " \
-s "demoJoins \"SELECT * FROM foo JOIN bar ON a=x AND b+c = y+z\"" -c m -e "> " \
-s "demoJoins \"SELECT a FROM foo UNION SELECT x FROM bar\"" -c m -e "> " \
-i

Demonstration

asciinema recording

Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import Demo
> import Demo
> demoAllAnalyses "SELECT * FROM foo" -- note that the SELECT * expands to a,b,c
Tables accessed:
    public.foo
Columns accessed by clause:
    public.foo.a        SELECT
    public.foo.b        SELECT
    public.foo.c        SELECT
Joins:
    no joins
Table lineage:
    no tables modified
> demoAllAnalyses "SELECT * FROM bar" -- and here the SELECT * expands to x,y,z
Tables accessed:
    public.bar
Columns accessed by clause:
    public.bar.x        SELECT
    public.bar.y        SELECT
    public.bar.z        SELECT
Joins:
    no joins
Table lineage:
    no tables modified
> demoAllAnalyses "SELECT x, count(1) FROM foo JOIN bar ON foo.a = bar.y WHERE z IS NOT NULL GROUP BY 1 ORDER BY 2 DESC, b"
Tables accessed:
    public.bar
    public.foo
Columns accessed by clause:
    public.bar.x        GROUPBY
    public.bar.x        SELECT
    public.bar.y        JOIN
    public.bar.z        WHERE
    public.foo.a        JOIN
    public.foo.b        ORDER
Joins:
    public.bar.y <-> public.foo.ademoTableLineage "INSERT INTO foo SELECT * FROM bar"

Table lineage:
    no tables modified
> -- let's play with some queries that modify table-data!
> demoTableLineage "INSERT INTO foo SELECT * FROM bar"
public.foo after the query depends on public.bar, public.foo before the query
> demoTableLineage "TRUNCATE TABLE foo"
public.foo no longer has data
> demoTableLineage "ALTER TABLE bar, foo RENAME TO baz, bar"
public.bar after the query depends on public.foo before the query
public.baz after the query depends on public.bar before the query
public.foo no longer has data
> -- let's explore a few subtler behaviors of the "joins" analysis (admittedly, something of a misnomer)
> demoJoins "SELECT * FROM foo JOIN bar ON a=x AND b+c = y+z"
public.bar.x <-> public.foo.a
public.bar.y <-> public.foo.b
public.bar.y <-> public.foo.c
public.bar.z <-> public.foo.b
public.bar.z <-> public.foo.c
> demoJoins "SELECT a FROM foo UNION SELECT x FROM bar"
public.bar.x <-> public.foo.a
>

The Haskell choice

Originally conceived by an Uber engineer who was a Haskell enthusiast, and it quickly gained traction with several other engineers.

Many of Uber engineers learned Haskell specifically to develop in it.

Haskell turned out to be a good choice for prototyping Queryparser for a variety of reasons.

  • Haskell has very mature library support for language parsers.
  • expressive type system was extremely useful for the frequent and extensive refactors of our internal model of a SQL query.
  • leaned heavily on the compiler to guide us through those big, scary refactors.

If we attempted the same using a dynamically- typed language, we would have lost weeks chasing runtime bugs that Haskell’s compiler can quickly flag for us.

The main drawback of writing Queryparser in Haskell

  • not enough developers knew it.

To introduce more of our engineers to Haskell, we started a weekly reading group, which met over lunch to discuss Haskell books and documentation.

Deploying Queryparser

  • Add stack to container dependencies

  • Internally deployed as a Haskell artifact, running behind a Python service wrapper for interoperability with the rest of Uber’s infrastructure.

The Python wrapper acts as a proxy server and simply forwards requests to the Haskell backend server in the same docker container.

The Haskell server consists of a main thread that listens for requests on a UNIX-domain socket; when a new request arrives, the main thread spawns a worker thread to handle the request.

The Python wrapper also handles metric emission on behalf of the Haskell backend.

Metric data is passed via a second UNIX-domain socket, with data flowing in the reverse direction: a daemon-thread in the Python layer listens for metrics from the Haskell layer.

In order to share configuration between the Python and Haskell layers, we implemented a tiny configuration parser in Haskell, which understood Uber’s standard Python convention of layered configuration files.

Finally, to define the service interface, we used Thrift.

This is the standard choice at Uber, and since Thrift has Haskell support, the Haskell server worked out-of-the-box.

Writing the Python code to transparently forward requests required diving into the binary protocol and was the most difficult operations step.

Summary

Queryparser unlocked a diversity of solutions and had some interesting limitations. From its humble origins as a migration tool, it became a vehicle for insight into large-scale data access patterns.

See also

Parsec
https://hackage.haskell.org/package/parsec

Glossary

1
2
3
leaky abstraction
    An abstraction that leaks details that it
    is supposed to abstract away.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Unix domain socket
IPC socket
inter-process communication socket
    A data communications endpoint for
    exchanging data between processes
    executing on the same host operating
    system. Valid socket types in the UNIX
    domain are:
    - SOCK_STREAM (compare to TCP)
      For a stream-oriented socket
    - SOCK_DGRAM (compare to UDP)
      For a datagram-oriented socket that
      preserves message boundaries (as on most
      UNIX implementations, UNIX domain
      datagram sockets are always reliable and
      don't reorder datagrams)
    - SOCK_SEQPACKET (compare to SCTP)
      For a sequenced-packet socket that is
      connection-oriented, preserves message
      boundaries, and delivers messages in the
      order that they were sent.

      The Unix domain socket facility is a
      standard component of POSIX operating
      systems.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
daemon thread
    https://docs.python.org/2/library/threading.html#thread-objects

    A thread can be flagged as a “daemon
    thread”.

    The significance of this flag is that the
    entire Python program exits when only
    daemon threads are left.

    The initial value is inherited from the
    creating thread.

    The flag can be set through the daemon
    property.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Apache Thrift
    [software framework]

    For scalable cross-language services
    development, combines a software stack
    with a code generation engine to build
    services that work efficiently and
    seamlessly between C++, Java, Python, PHP,
    Ruby, Erlang, Perl, Haskell, C#, Cocoa,
    JavaScript, Node.js, Smalltalk, OCaml and
    Delphi and other languages.

    RPC system.