Table of Contents
可以优化的点
- 相同的表达式不要写多遍
- 位运算写的更加明白一点
hive上的优化
with cube
的使用
https://tech.youzan.com/cube/
其实就是把group by列出来的维度的可能组合都group by了一遍
ck上的优化
- 使用with语句避免一个表达式写多遍
- 用hex函数打印数字的16进制形式
FAQ
- 获取array的长度
- hive
size(col)
- ck
- hive
- 获取string的长度
- hive
- ck
length(col)
- 获取连续的区间
用行号和想要连续区间的列相减,然后根据group by的结果再取最大值、最小值就得到了连续区间
这篇csdn博客也可以提供一定的参考。 - 合并区间
SELECT
intervals,
arraySort(x -> x, intervals) sortedIntervals,
/* try to merge each interval with precede ones */
arrayMap((x, index) -> index != 1
? (arrayReduce(
'min',
arrayMap(
i -> sortedIntervals[i + 1].1,
/* get indexes of intervals that can be merged with the current one (index is zero-based) */
arrayFilter(
i -> x.1 <= sortedIntervals[i + 1].2 AND x.2 >= sortedIntervals[i + 1].1,
range(index)))),
arrayReduce(
'max',
arrayMap(
i -> sortedIntervals[i + 1].2,
/* get indexes of intervals that can be merged with the current one (index is zero-based) */
arrayFilter(
i -> x.1 <= sortedIntervals[i + 1].2 AND x.2 >= sortedIntervals[i + 1].1,
range(index)))))
: x,
sortedIntervals,
arrayEnumerate(sortedIntervals)) rawResult,
/* filter out intervals nested to other ones */
arrayFilter(
(x, index) -> index == length(rawResult) OR x.1 != rawResult[index + 1].1,
rawResult,
arrayEnumerate(rawResult)) result
FROM
(
SELECT [(1, 5), (2, 3), (3, 8), (10, 15)] intervals
UNION ALL
SELECT [(2, 4), (1, 3), (3, 6), (12, 14), (7, 7), (13, 16), (9, 9), (8, 9), (10, 15)]
UNION ALL
SELECT [(20, 22), (18, 18), (16, 21), (1, 8), (2, 9), (3, 5), (10, 12), (11, 13), (14, 15)]
UNION ALL
SELECT []
UNION ALL
SELECT [(1, 11)]
)
- 容易忽视的地方
- null值的处理
- 计算avg时,null值的处理
- join表或者lateral view explode或者array join时null值的处理
- null值的处理
近期评论