sql parser By ANTLR

Antlr

识别出所有有效的句子、词组、字词组等,识别语言的程序叫做解析器(Parser)或者语法分析器(syntax analyzer)。

  • recognize all the valid sentences and subphrase
  • parser of syntax analyzers: recognize languages
  • ANTLR meta-language
  • sentence as a input stream, look up words in dictionary

phrase 1:

  • lexical analysis: group characters into words or symbols(tokens)
  • lexer: group the related tokens into token classes or token types

phrase 2:

  • feed off these tokens to recognize the sentence structure
  • parse tree or syntax tree: ANTLR generated parsers build a data structure

basic data flow of a language recognizer

Implementing Parsers

  • recursive-descent parser -> kind of top-down parser implementation

  • graph traced out by invoking methods stat(), assign(), and expr() mirrors the interior parse tree nodes.

charStream -> tokenStream -> parse tree

parse-tree Listeners

ANTLR会为Token生成子类--parseTreeListener,并且实现了每个规则的进入和退出的方法。
ANTLR为每个Rule都会生成一个Context对象,它会记录识别时的所有信息。ANTLR提供了Listener和Visitor两种遍历机制。Listener是全自动化的,ANTLR会主导深度优先遍历过程,我们只需处理各种事件就可以了。而Visitor则提供了可控的遍历方式,我们可以自行决定是否显示地调用子结点的visit方法。

presto中ANTLR的应用

what is Presto?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

presto-parser

首先来看一下SqlBase.g4的内容,其中定义了SQL Base的TOKEN。

截取一部分

grammar SqlBase;  
tokens {  
    DELIMITER
}
singleStatement  
    : statement EOF
    ;
singleExpression  
    : expression EOF
    ;
statement  
    : query                                                            #statementDefault
    | USE schema=identifier                                            #use
    | USE catalog=identifier '.' schema=identifier                     #use
    | CREATE TABLE (IF NOT EXISTS)? qualifiedName
        (WITH tableProperties)? AS query
        (WITH (NO)? DATA)?                                             #createTableAsSelect
    | CREATE TABLE (IF NOT EXISTS)? qualifiedName
        '(' tableElement (',' tableElement)* ')'
        (WITH tableProperties)?                                        #createTable
    | DROP TABLE (IF EXISTS)? qualifiedName                            #dropTable
    | INSERT INTO qualifiedName columnAliases? query                   #insertInto
    | DELETE FROM qualifiedName (WHERE booleanExpression)?             #delete
    | ALTER TABLE from=qualifiedName RENAME TO to=qualifiedName        #renameTable
    | ALTER TABLE tableName=qualifiedName
        RENAME COLUMN from=identifier TO to=identifier                 #renameColumn
    | ALTER TABLE tableName=qualifiedName
        ADD COLUMN column=tableElement                                 #addColumn
    | CREATE (OR REPLACE)? VIEW qualifiedName AS query                 #createView
    | DROP VIEW (IF EXISTS)? qualifiedName                             #dropView
    | CALL qualifiedName '(' (callArgument (',' callArgument)*)? ')'   #call
    | GRANT
        (privilege (',' privilege)* | ALL PRIVILEGES)
        ON TABLE? qualifiedName TO grantee=identifier
        (WITH GRANT OPTION)?                                           #grant
    | REVOKE
        (GRANT OPTION FOR)?
        (privilege (',' privilege)* | ALL PRIVILEGES)
        ON TABLE? qualifiedName FROM grantee=identifier                #revoke
    | EXPLAIN ANALYZE?
        ('(' explainOption (',' explainOption)* ')')? statement        #explain

定义好g4之后,antlr会根据定义的Token生成对应的Parser。 包com.facebook.presto.parser.SqlParser中InvokeParser的实现:

private Node invokeParser(String name, String sql, Function<SqlBaseParser, ParserRuleContext> parseFunction)  
    {
        try {
            SqlBaseLexer lexer = new SqlBaseLexer(new CaseInsensitiveStream(new ANTLRInputStream(sql)));
            CommonTokenStream tokenStream = new CommonTokenStream(lexer);
            SqlBaseParser parser = new SqlBaseParser(tokenStream);
            parser.addParseListener(new PostProcessor());
            lexer.removeErrorListeners();
            lexer.addErrorListener(ERROR_LISTENER);
            parser.removeErrorListeners();
            parser.addErrorListener(ERROR_LISTENER);
            ParserRuleContext tree;
            try {          parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
                tree = parseFunction.apply(parser);
            }
            catch (ParseCancellationException ex) {
                tokenStream.reset(); // rewind input stream
                parser.reset();
                parser.getInterpreter().setPredictionMode(PredictionMode.LL);
                tree = parseFunction.apply(parser);
            }
            return new AstBuilder().visit(tree);
        }
        catch (StackOverflowError e) {
            throw new ParsingException(name + " is too large (stack overflow while parsing)");
        }
    }

InvokeParser() 返回的是一个AST语法树。
ASTBuilder 类中定义了各个Node节点的访问方法。