本文共 2828 字,大约阅读时间需要 9 分钟。
我正在使用datastax spark集成和spark SQL thrift服务器,它为我提供了一个Hive SQL接口来查询Cassandra中的表.
我的数据库中的表是动态创建的,我想要做的是仅根据表名在表的每列中获取空值的计数.
我可以使用describe database.table获取列名,但在hive SQL中,如何在另一个为所有列计数null的select查询中使用其输出.
更新1:使用Dudu的解决方案回溯
Error running query: TExecuteStatementResp(status=TStatus(errorCode=0,
errorMessage=”org.apache.spark.sql.AnalysisException: Invalid usage of
‘*’ in explode/json_tuple/UDTF;”, sqlState=None,
infoMessages=[“org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException:
Invalid usage of ‘‘ in explode/json_tuple/UDTF;:16:15″,
‘org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258′,
‘org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152’,
‘org.apache.hive.service.cli.operation.Operation:run:Operation.java:257′,
‘org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388’,
‘org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369’,
‘org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262’,
‘org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437’,
‘org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313’,
‘org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298’,
‘org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39’,
‘org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39’,
‘org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56’,
‘org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286’,
‘java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142’,
‘java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617’,
‘java.lang.Thread:run:Thread.java:745’], statusCode=3),
operationHandle=None)
在以下解决方案中,不需要单独处理每个列.
结果是列索引和该列中的空值数.
您可以稍后通过列索引将其连接到从Metastore检索到的信息.
一个限制是包含确切文本null的字符串将被计为空值.
演示
CTE(由mytable定义的mytable)显然可以替换为实际表
with mytable as
(
select stack
(
5
,1 ,1.2 ,date '2017-06-21' ,null
,2 ,2.3 ,null ,null
,3 ,null ,null ,'hello'
,4 ,4.5 ,null ,'world'
,5 ,null ,date '2017-07-22' ,null
) as (id,amt,dt,txt)
)
select pe.pos as col_index
,count(case when pe.val='null' then 1 end) as nulls_count
from mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe
group by pe.pos
;
+-----------+-------------+
| col_index | nulls_count |
+-----------+-------------+
| 0 | 0 |
| 1 | 2 |
| 2 | 3 |
| 3 | 3 |
+-----------+-------------+
转载地址:http://yhudv.baihongyu.com/