计算机 · 2021年12月29日 0

write better sql

可以优化的点

  • 相同的表达式不要写多遍
  • 位运算写的更加明白一点

hive上的优化

ck上的优化

  • 使用with语句避免一个表达式写多遍
  • 用hex函数打印数字的16进制形式

FAQ

  • 获取array的长度
    • hive
      size(col)
    • ck
  • 获取string的长度
    • hive
    • ck
      length(col)
  • 获取连续的区间
    用行号和想要连续区间的列相减,然后根据group by的结果再取最大值、最小值就得到了连续区间
    这篇csdn博客也可以提供一定的参考。
  • 合并区间
SELECT
    intervals,

    arraySort(x -> x, intervals) sortedIntervals,

    /* try to merge each interval with precede ones */
    arrayMap((x, index) -> index != 1
        ? (arrayReduce(
            'min', 
            arrayMap(
              i -> sortedIntervals[i + 1].1, 
              /* get indexes of intervals that can be merged with the current one (index is zero-based) */              
              arrayFilter(
                i -> x.1 <= sortedIntervals[i + 1].2 AND x.2 >= sortedIntervals[i + 1].1, 
                range(index)))),
          arrayReduce(
            'max', 
            arrayMap(
              i -> sortedIntervals[i + 1].2,  
              /* get indexes of intervals that can be merged with the current one (index is zero-based) */              
              arrayFilter(
                i -> x.1 <= sortedIntervals[i + 1].2 AND x.2 >= sortedIntervals[i + 1].1, 
                range(index)))))
        : x,
      sortedIntervals, 
      arrayEnumerate(sortedIntervals)) rawResult,

    /* filter out intervals nested to other ones */
    arrayFilter(
      (x, index) -> index == length(rawResult) OR x.1 != rawResult[index + 1].1,
      rawResult, 
      arrayEnumerate(rawResult)) result
FROM
(
    SELECT [(1, 5), (2, 3), (3, 8), (10, 15)] intervals
    UNION ALL
    SELECT [(2, 4), (1, 3), (3, 6), (12, 14), (7, 7), (13, 16), (9, 9), (8, 9), (10, 15)]
    UNION ALL
    SELECT [(20, 22), (18, 18), (16, 21), (1, 8), (2, 9), (3, 5), (10, 12), (11, 13), (14, 15)]
    UNION ALL
    SELECT []
    UNION ALL 
    SELECT [(1, 11)]
)

  • 容易忽视的地方
    • null值的处理
      • 计算avg时,null值的处理
      • join表或者lateral view explode或者array join时null值的处理