阿里妹导读
本文主要介绍了一些主流的解析器是怎么实现like的语法逻辑,接着作者分析了几种实现方式的优劣,终采用状态机的方式,针对场景一步一步进行性能优化。
提及
近在优化项目的like语法,那既然谈到了SQL,我们不妨来看看一些主流的解析器是怎么实现like的语法逻辑。这里需要提一下主流的两种SQL解析器,它们分别是ANTLR和Calcite。
/** SQL {@code LIKE} function. */
public static boolean like(String s,String pattern){
final String regex = Like.sqlToRegexLike(pattern, null);
return Pattern.matches(regex, s);
}
/** Translates a SQL LIKE pattern to Java regex pattern.*/
static String sqlToRegexLike(String sqlPattern,char escapeChar) {
int i;
final int len = sqlPattern.length();
final StringBuilder javaPattern = new StringBuilder(len + len);
for (i = ; i < len; i++) {
char c = sqlPattern.charAt(i);
if (JAVA_REGEX_SPECIALS.indexOf(c) >= ) {
javaPattern.append('\\');
}
if (c == escapeChar) {
if (i == (sqlPattern.length() - 1)) {
throw invalidEscapeSequence(sqlPattern, i);
}
char nextChar = sqlPattern.charAt(i + 1);
if ((nextChar == '_')
|| (nextChar == '%')
|| (nextChar == escapeChar)) {
javaPattern.append(nextChar);
i++;
} else {
throw invalidEscapeSequence(sqlPattern, i);
}
} else if (c == '_') {
javaPattern.append('.');
} else if (c == '%') {
javaPattern.append("(?s:.*)");
} else {
javaPattern.append(c);
}
}
return javaPattern.toString();
}
...
try {
Pattern pattern = patterns.get(buildKey(right, escTmp), new Callable<Pattern>() {
public Pattern call() throws Exception {
return Pattern.compile(buildPattern(right, escTmp), Pattern.CASE_INSENSITIVE);
}
});
Matcher m = pattern.matcher(left);
return m.matches() ? 1l : 0l;
} catch (ExecutionException e) {
throw new FunctionException(e.getCause());
}
...
到此,综上来看,不少项目是基于正则表达式来完成的,接下来我整理了下我近实现的几种方式:
正则表达式实现
public static boolean like(final String dest, final String pattern) {
String regex = regexParse(pattern);
regex = regex.replace("_",".").replace("%",".*?");
Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
return p.matcher(dest).matches();
}
这种方式在代码层面简单明了,但是性能非常差,多次replace的使用就已经进行了多次遍历,这里有个可以优化的点,对于单个字符做替换可以选择用replaceChars(str, searchChar, replaceChar)这个方案。
贪婪模式
懒惰模式
独占模式
简单算法实现
public static boolean like(final String dest, final String pattern) {
int destPointer = , patternPointer = ;
int destRecall = -1, patternRecall = -1;
final int patternLen = pattern.length();
final int destLen = dest.length();
while( destPointer < destLen) {
......
......
}
while(patternPointer < patternLen && pattern.chatAt(patternPointer) == '%') {
patternPointer++;
}
return patternPointer == patternLen;
}
有个场景我们不得不去考虑,那就是回溯的情况,举个例子:
有限状态机实现
穷举法
查表法
状态模式
public void compile(final String pattern) {
...
LikeStateMachine machine = LikeStateMachine.build(pattern);
...
}
构建的过程就是我们把pattern解析加载的过程,我采用的方式是构建链表的方式。实现就是遍历构建的过程,compile时间复杂度O(n)
回溯场景优化
常用场景优化
public LikeParserResult compile(final String pattern) {
return parseLikeExpress(pattern);
}
....
public boolean match(final String dest, LikeParserResult likeParserResult) {
switch (likeParserResult.getMatchResult()) {
case LikeMatchResult.TYPE.ALLMATCH:
return true;
case LikeMatchResult.TYPE.EQUALS:
return doEquals(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.STARTSWITH:
return doStartsWith(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.ENDSWITH:
return doEndsWith(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.CONTAINS:
return doContain(dest, likeParserResult.getFinalPattern());
default:
//或者别的实现
return likeParserResult.getLikeStateMachine().match(dest);
}
}
上面给出的代码是为了清楚的看到里面运行,终根据自己擅长的代码风格(函数式Function,接口等),还可以进一步优化和精简。
...
public LikeParserResult compile(final String pattern) {
return parseLikeExpress(pattern);
}
public boolean match(final String dest, LikeParserResult likeParserResult) {
return likeParserResult.getMatcher().match(dest);
}
...
public class StartsWithMatcher implements Matcher {
...
public Boolean match(String dest) {
return dest.startsWith(this.pattern);
}
}
...
压测数据对比
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
后