2019-06-12

Hadoop生态圈

12 分钟读完 (大约 1737 个字) 阅读次数: 0次访问

HBase-API操作

HBas是Hadoop数据库，是一个分布式，可扩展的大数据存储。

这一章博客不介绍如何搭建Hbase,只介绍API代码的编写

HBase官方网站: https://hbase.apache.org/

HBase中文文档: http://abloz.com/hbase/book.html

为什么要用Hbase?

来看看官方解释:

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

大概意思是:

当你需要对大量数据进行随机,实时读/写的操作时,请使用HBase,该项目致力于非常量大的表-数十亿行&百万列- 它是一个开源,面向列,分布式(包括高并发)的非关系数据库(NoSql),基于Hadoop,并存储在HDFS之上

这代表它能够运行在廉价的PC server上搭建大规模的结构化存储集群,关系型数据库无法满足数据疯狂增长的需求,但HBase可以!

我的开发环境

Linux-CentOs(VMware10虚拟机 3台)
win10
idea 2019.1.1
jdk1.8
Maven

基本概念

RowKey：是Byte array，是表中每条记录的“主键”，方便快速查找，Rowkey的设计非常重要，后面在重点讲讲我们在RowKey的设计上遇到过的坑。

Column Family：列族，拥有一个名称(string)，包含一个或者多个相关列

Column：属于某一个columnfamily，familyName:columnName，每条记录可动态添加

Version Number：类型为Long，默认值是系统时间戳，可由用户自定义

Value(Cell)：Byte array

创建HBase项目

创建一个普通的maven项目

导入依赖:

<dependencies>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.3.1</version>
        </dependency>

    </dependencies>

创建类MyBaseAPI,编写代码,注意看注释

package com.ujiuye;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.Before;
import org.junit.Test;

import java.util.ArrayList;
import java.util.List;

public class MyHBaseAPI {
    //此方法创建使用HBase的资源配置
    private Configuration conf = null;

    @Before// 获取配置对象
    public void getConfiguration() throws Exception {
        // 通过类提供的方法获取配置对象
        conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "hadoop101");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
    }

    //判断表是否存在
    @Test
    public void tableExist() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.通过连接拿到hbase的客户端对象
        HBaseAdmin admin = (HBaseAdmin) conn.getAdmin();
        // 3.开始操作
        boolean tableExist = admin.tableExists("user");
        if (tableExist) {
            System.out.println("表存在");
        } else {
            System.out.println("表不存在");
        }
    }

    //创建表
    @Test
    public void createTable() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.通过连接拿到 hbase 的客户端对象
        HBaseAdmin admin = (HBaseAdmin) conn.getAdmin();
        // 3.创建表描述器
        HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("user"));
        // 4.设置列族
        tableDescriptor.addFamily(new HColumnDescriptor("info"));
        // 5.执行 创建表
        admin.createTable(tableDescriptor);
        System.out.println("表创建成功");
    }

    // 删除表
    @Test
    public void deleteTable() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.通过连接拿到 hbase 的客户端对象
        HBaseAdmin admin = (HBaseAdmin) conn.getAdmin();
        // 3,禁用表
        admin.disableTable("user");
        // 4.删除表
        admin.deleteTable("user");
        System.out.println("表删除成功");
    }

    //向表中插入数据
    @Test
    public void insertData() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.通过连接拿到 hbase 的客户端对象
        HBaseAdmin admin = (HBaseAdmin) conn.getAdmin();

        Table table = conn.getTable(TableName.valueOf("user"));
        // 3. 封装数据,注意要用hbase提供的工具来转化为字节数组，不要用字符串的getBytes方法
        Put put = new Put(Bytes.toBytes("1001"));
        // 4.设置 列族 类名 值
        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("zhangsan"));
        // 5. 向表中插入数据
        table.put(put);
    }

    //批量插入数据
    @Test
    public void insertDatas() throws Exception {
        // 拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 获取表
        Table table = conn.getTable(TableName.valueOf("user"));

        // 批量插入数据

        Put put = new Put(Bytes.toBytes("1001"));
        // 设置 列族 类名 值
        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("zhangsan"));

        Put put2 = new Put(Bytes.toBytes("1002"));
        // 设置 列族 类名 值
        put2.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("wangwu"));

        List<Put> list = new ArrayList<Put>();
        list.add(put);
        list.add(put2);
        table.put(list);
        // 关闭
        table.close();
    }

    // 删除一行数据 / 多行数据
    @Test
    public void deleteData() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.通过连接拿到 hbase 的客户端对象
//        HBaseAdmin admin = (HBaseAdmin) conn.getAdmin();

        Table table = conn.getTable(TableName.valueOf("user"));

//        Delete delete = new Delete(Bytes.toBytes("1002"));
//        table.delete(delete);

        // 批量删除
        Delete delete = new Delete(Bytes.toBytes("1001"));
        Delete delete1 = new Delete(Bytes.toBytes("1002"));
        List<Delete> list = new ArrayList<Delete>();
        list.add(delete);
        list.add(delete1);
        table.delete(list);
    }

    // 获取所有数据
    @Test
    public void getAllData() throws Exception {
        // 1.拿到连接
        Connection conn = ConnectionFactory.createConnection(conf);
        // 2.获取表
        Table table = conn.getTable(TableName.valueOf("user"));
        // 3.构造一个 scan 对象
        Scan scan = new Scan(Bytes.toBytes(""));
        ResultScanner scanner = table.getScanner(scan);
        for (Result result : scanner) {
            Cell[] rawCells = result.rawCells();// 获取某一行数据
            for (Cell cell : rawCells) {
                String row = Bytes.toString(CellUtil.cloneRow(cell));
                String cf = Bytes.toString(CellUtil.cloneFamily(cell));
                String cl = Bytes.toString(CellUtil.cloneQualifier(cell));
                String va = Bytes.toString(CellUtil.cloneValue(cell));
                System.out.println(row+"---"+cf+"---"+cl+"---"+va);
            }
        }
    }

    // 获取某一行数据,指定列族,列
    @Test
    public void getSomeData() throws Exception{
    // 1.拿到连接 和 表
    Connection conn = ConnectionFactory.createConnection(conf);
        Table table = conn.getTable(TableName.valueOf("user"));

        Get get = new Get(Bytes.toBytes("1001"));
        Result result = table.get(get);
        Cell[] rawCells = result.rawCells();//获取某一行的所有数据
        for (Cell cell : rawCells) {
            String row = Bytes.toString(CellUtil.cloneRow(cell));
            String cf = Bytes.toString(CellUtil.cloneFamily(cell));
            String cl = Bytes.toString(CellUtil.cloneQualifier(cell));
            String va = Bytes.toString(CellUtil.cloneValue(cell));
            System.out.println(row+"---"+cf+"---"+cl+"---"+va);
        }
    }
}

HBase Shell简单验证操作

说几个HBase Shell的基本操作,不然都不知道怎么验证上面的代码

首先启动HBase

[root@hadoop101 hbase] $ bin/start-hbase.sh

然后进入HBase Shell命令行操作

在Shell操作的时候如果需要后退直接按 Backspace 是不行的,必须Ctrl+Backspace

[root@hadoop101 hbase]$ bin/hbase shell

查看当前数据库有哪些表

hbase(main):002:0> list

查询表数据 scan+’表名’ STARTROW是开始行到 STOPROW结束行,无STOPROW的话是到最后

hbase(main):008:0> scan ‘student’

hbase(main):009:0> scan ‘student’,{STARTROW => ‘1001’, STOPROW => ‘1002’}

hbase(main):010:0> scan ‘student’,{STARTROW => ‘1001’}

验证代码一个 scan 就够了,

内存优化

HBase操作过程中需要大量的内存开销，毕竟Table是可以缓存在内存中的，一般会分配整个可用内存的70%给HBase的Java堆。但是不建议分配非常大的堆内存，因为GC过程持续太久会导致RegionServer处于长期不可用状态，一般16~48G内存就可以了，如果因为框架占用内存过高导致系统内存不足，框架一样会被系统服务拖死。

HBase在商业项目中的能力

每天：

消息量：发送和接收的消息数超过60亿
将近1000亿条数据的读写
高峰期每秒150万左右操作
整体读取数据占有约55%，写入占有45%
超过2PB的数据，涉及冗余共6PB数据
数据每月大概增长300千兆字节。

我只是拿我自己的博客当一个笔记而已,如有类似,纯属雷同

关于我

# Hbase