近需要用到hive,刚开始用py3去连接,白忙活了2天,居然没有搞定,如果你搞定了,请分析笔记给我哈。我用py2搞定了,下面是步骤:
一、安装
### 安装PIP ###curl https://bootstrap.pypa.io/get-pip.py -o get-pip.pypython get-pip.py# 查看 pip 版本 #pip -V#### install 依赖包 ####yum install cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi如果 cyrus-sasl-plain 安装失败,则下载rpm包后进行安装:http://download.csdn.net/detail/u012965373/9647909rpm -ivh ./xxxxx.rpm###安装工程依赖包 ###pip install -r requirements.txt
requirements.txt 文件内容如下:
pyhs2 requests
二、代码
#!/usr/bin/python2.7 # -*- coding: UTF-8 -*- ''' 代码运行在python 2.7 环境,需要安装 ''' import pyhs2 import sys from logger.log import dblog class HiveModelClass(object): def __init__(self, **connect_info): super(HiveModelClass, self).__init__() domain = connect_info['domain'] # xx.cn host = connect_info['host'] port = connect_info['port'] if connect_info['port'] else 10000 user = connect_info['user'] password = connect_info['password'] database = connect_info['database'] if connect_info['database'] else 'default' try: conn = pyhs2.connect(host=host, port=port, authMechanism="LDAP", user='{0}@{1}'.format(user, domain), password=password, database=database) self.cursor = conn.cursor() except Exception as e: dblog.error('Catch exception: [ %s ], file: [ %s ], line: [ %s ].' % (e, __file__, sys._getframe().f_lineno)) self.cursor = None def _cursorQuery(self, sql): try: self.cursor.execute(sql) return self.cursor.fetchall() except Exception as e: dblog.error("[ERROR] Query error, Catch exception:[ %s ], file: [ %s ], line: [ %s ]" % (e, __file__, sys._getframe().f_lineno)) return '' # 插入接口 ## def _cursorInsert(self, sql): try: self.cursor.execute(sql) return 1 except Exception as e: dblog.error("[ERROR] Insert error, Catch exception:[ %s ], file: [ %s ], line: [ %s ]" % (e, __file__, sys._getframe().f_lineno)) return -1
有了hive 连接的基类,就可以针对这个基类新建一类继承他,如下:
#!/usr/bin/python # -*- coding: UTF-8 -*- from Model.hiveModelClass import HiveModelClass from logger.log import dblog import sys class DefaultModelClass(HiveModelClass): def __init__(self): connect_dic = { 'domain': 'xx.cn', 'host': 'x.x.x.x', 'port': 10000, 'user': 'xxxxxx', 'password': 'xxxxxx', } super(DefaultModelClass, self).__init__(**connect_dic) def getCountData(self): ''' 获取数据的总行数 :return: number ''' sql = 'select count(*) from test2' try: ret = self._cursorQuery(sql) return ret[0] except Exception as e: dblog.error("Catch exception: [ %s ], file: [ %s ], line: [ %s ]" % (e, __file__, sys._getframe().f_lineno)) return []
于是在控制器中初始化Default 对象,调用 getCountData 方法即可返回数据。 getCountData只是示列,其他方法等你开发。
dblog 是日志模块,可以用print替代(dblog.error/http://dblog.info 等替换为print)
hive查询起来还是很慢的,优势在于大量数据查询或者处理。