Hive基础入门

2020-07-01 17:29:09

从事大数据分析工作，不可避免的你会接触到Hadoop大数据系统，而Hive是是基于Hadoop的一个数据仓库工具，可以用来进行数据提取转化加载（ETL），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 还定义了简单的类 SQL 查询语言，称为 HQL，它允许熟悉 SQL 的用户查询数据，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能。

这里主要介绍一些Hive的基础知识。

一、Hive数据库操作

默认数据库“default”
使用#hive命名后，不使用hive>use <数据库名>，系统默认的数据库。可以显示使用hive> use default;
创建数据库 hive > create database db_hive_01;

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT database_comment]
    [LOCATION hdfs_path]

列出所有数据库 hive > show database；
使用数据库 hive > use database ;
查看外部数据库 hive > desc database extended db_hive_01;
删除数据库 hive > drop database db_hive_01; cascade;

二、Hive数据表操作

1、列出所有表

show tables;

指定数据库的所有表

show tables in db_name;

支持正则

show tables '.*s';

2、创建表

create table test
(id int,
a string
)
row format delimited        ## 行分割
fields terminated by‘,’    ## 字段分隔符
lines terminated by‘\n’    ## 行分隔符
stored as textfile;        ## 作为文本存储

3、修改表

增加一个新列

alter table test add columns (new_col2 int comment 'a comment');

修改表名

alter table old_name rename to new_name;

4、删除表

drop table test;

5、查看表信息

显示列信息

desc test;

显示详细表信息

desc formatted test;

6、索引

创建索引

create index index_name

on table base_table_name (col_name, ...)

as 'index.handler.class.name'

重建索引

alter index index_name on table_name [partition(...)] rebuild

如：alter index index1_index_test on index_test rebuild;

删除索引

drop index index_name on table_name

列出索引

show index on index_test;

7、视图

创建视图

create view [if not exists] view_name [ (column_name [comment column_comment], ...) ][comment view_comment][tblproperties (property_name = property_value, ...)] as select

注：hive只支持逻辑视图，不支持物化视图

•如果没有提供表名，视图列的名字将由定义的SELECT表达式自动生成

•如果修改基本表的属性，视图中不会体现，查询将会失败

•视图是只读的，不能用LOAD/INSERT/ALTER

•删除视图 drop view view_name

三、Hive数据表分区

1、列出一个表的所有分区

show partitions test;

2、创建分区表

create table test

(id int,

a string,

)

partitioned by (b string, c int)

row format delimited

fields terminated by ‘,’

lines terminated by ‘\n’

stored as textfile;

3、对现有表添加分区

alter table test add if not exists

partition (year = 2017) location ‘/hiveuser/hive/warehouse/data_zh.db/data_zh/2017.txt’;

4、删除分区

alter table test drop if exists partition(year =2017);

5、加载数据到分区表

loaddata inpath ‘/data/2017.txt’into table test partition(year=2017);

6、未分区表数据导入分区表

insert overwrite table part_table partition (year,month)select *from no_part_table;

7、动态分区指令

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

#set hive.enforce.bucketing = true;

四、Hive 数据的删除/清空

1、删除table1中不符合条件的数据（复写表）

insert overwrite table table1

select * from table1 where XXXX;

2、清空表（复写表）

insert overwrite table t_table1

select * from t_table1 where 1=0;

3、截断表（注：不能截断外部表）

truncate table table_name;

4.删除hdfs对应的表数据达到清空表（表结构依然存在）

hdfs dfs -rmr /user/hive/warehouse/test

需要注意的是：在hive中默认不支持事务，因此默认不支持delete与update，如果需要支持必须在hive-site.xml中配置打开

五、Hive数据表的应用

◆ 排序：

1、orderby

全局排序，只有一个Reduce任务

2、sortby

只做局部排序

3、distribute by

用distribute by 会对指定的字段按照hash Code值对reduce的个数取模，然后将任务分配到对应的reduce中去执行

4、cluster by

distribute by 和sort by 合用就相当于cluster by，但是clusterby 不能指定排序为asc或desc 的规则，只能是desc倒序排列

注：distribute by 和sort by,distribute by 必须在sort by前面

◆ 表连接：

1、join只支持等值连接

2、可以 join 多个表

select a.val, b.val,c.val from a join b

on (a.key = b.key1) join c on (c.key = b.key2)

注:如果join中多个表的join key 是同一个，则join 会被转化为单个map/reduce 任务

3、left semi join

left semi join的限制是，JOIN 子句中右边的表只能在ON 子句中设置过滤条件，在WHERE 子句、SELECT 子句或其他地方过滤都不行

select a.key, a.value

from a

where a.key in

(select b.key from B);

可以被重写为：

select a.key, a.val

from a left semi join b on (a.key = b.key)

5、UNION ALL

用来合并多个select的查询结果，需要保证select中字段须一致

select_statement union all select_statement union all select_statement ...

六、Hive数据的导入和加载

1、加载本地文件到Hive表

load data local inpath ‘path/file' into table 表名称；

2、加载HDFS文件到Hive表

load data inpath ‘path/file' into table 表名称；

3、加载数据覆盖表中已有的数据

load data local inpath ‘path/file' overwrite into table 表名称；

4、创建表时通过select加载

create table 表名称 as select * from 表名称1；

5、创建表的时候通过location指定加载

create table location_table(
...
)
row format delimited fields terminated by '\t'
location '指定加载路径';

6、用insert命令加载

应用场景：把用select命名分析的结果写入某个临时表（先创建表，然后写入数据）

append 追加写入 --默认

overwrite 覆盖写入 --使用多

insert into table 表名 select * from 表名1；

insert overwrite table 表名 select * from 表名2；

7、将查询数据导入到多表

from source_table
insert into table test select id,name, tel from dest1_table select src.* where src.id < 100
insert into table test select id,name, tel from dest2_table select src.* where src.id < 100
insert into table test select id,name, tel from dest3_table select src.* where src.id < 100;

8、指定分隔符导出数据

insert overwrite local directory '/home/hadoop/export_hive' 
row format delimited
fields terminated by '\t' 
select * from test;

七、Hive 数据的导出

1、通过insert...local directory导出到本地

insert overwrite local directory "path/" select * from table ;

2、insert ... directory 导出到HDFS

insert overwrite directory "hdfspath/" select * from table ;

3、通过hive shell命令 + 管道（hive -f/-e | sed/grep/awk > file)

$ bin/hive -e "select ..." >/home/table.log(目录)

分享好友

分享这个小栈给你的朋友们，一起进步吧。

Hive专区

创建时间：2020-07-01 14:09:32

Hive是一个基于Hadoop的数据仓库平台。通过hive，我们可以方便地进行ETL的工作。hive定义了一个类似于SQL的查询语言：HQL，能够将用户编写的QL转化为相应的Mapreduce程序基于Hadoop执行。 Hive是Facebook 2008年8月刚开源的一个数据仓库框架，其系统目标与 Pig 有相似之处，但它有一些Pig目前还不支持的机制，比如：更丰富的类型系统、更类似SQL的查询语言、Table/Partition元数据的持久化等。

展开

订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅：虚拟交易，一经交易不退款；若特殊情况，可3日内客服咨询

• 专区发布评论属默认订阅所评论专区（除付费小栈外）

技术专家

查看更多

markriver
专家