5分钟Flink1.10 - 自定义Data Source源

2020-07-03 14:52:33

文章内容

自定义Flink Source，案例分别实现了继承于SourceFunction的四个案例，三个完全自定义的Source，另外一个Source为常见的MySQL，通过这几个案例，启发我们进行实际案例的Source研发

代码版本

Flink : 1.10.0
Scala : 2.12.6

官网部分说明

这个是关于Interface中Souce中的信息以及链接，关于SourceFunction的说明，基本使用到的是实现了SourceFunction接口的类

Flink1.10：https://ci.apache.org/projects/flink/flink-docs-stable/api/java/index.html?org/apache/flink/streaming/api/functions/source/SourceFunction.html

ALL Known Implementing Classes 就是SourceFunction以及实现于SourceFunction的各个类

自定义Source中，我们可以使用SourceFunction也可以使用它的实现类，看具体情况

可以通过-非并行Source实现SourceFunction，或者通过实现ParallelSourceFunction接口或为并行源扩展RichParallelSourceFunction来编写自己的自定义源

以下有四个案例，可以根据代码直接进行跑通实现

自定义Source，实现自定义&并行度为1的source
自定义Source，实现一个支持并行度的source
自定义Source，实现一个支持并行度的富类source
自定义Source，实现消费MySQL中的数据

1. 自定义Source，实现自定义&并行度为1的source

自定义source，实现SourceFunction接口，实现一个没有并行度的案例

功能：每隔 1s 进行自增加1

实现的方法：run()，作为数据源，所有数据的产生都在 run() 方法中实现

文件名：MyNoParallelFunction.scala

package com.tech.consumer

import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
/**
  * 创建自定义并行度为1的source
  */
class MyNoParallelFunction extends SourceFunction[Long]{
  var count = 0L
  var isRunning = true

  override def run(ctx: SourceContext[Long]): Unit = {
    while( isRunning ) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

Flink main函数中使用

文件名：StreamWithMyNoParallelFunction.scala

package com.tech.consumer

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment

object StreamWithMyNoParallelFunction {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val stream = env.addSource(new MyNoParallelFunction)
    val mapData = stream.map(line => {
      print("received data: " + line)
      line
    })
    mapData.setParallelism(1)

    env.execute("StreamWithMyNoParallelFunction")
  }
}

执行起来，就可以看到数据的打印，就是我们想要得到的数据源不断的产出：

2. 自定义Source，实现一个支持并行度的source

实现ParallelSourceFunction接口

该接口只是个标记接口，用于标识继承该接口的Source都是并行执行的。其直接实现类是RichParallelSourceFunction，它是一个抽象类并继承自 AbstractRichFunction（从名称可以看出，它应该兼具 rich 和 parallel 两个特性，这里的rich体现在它定义了 open 和 close 这两个方法）。

MyParallelFunction.scala

package com.tech.consumer

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}

/**
  * 实现一个可以自定义的source
 */
class MyParallelFunction extends ParallelSourceFunction[Long]{
  var count = 0L
  var isRunning = true

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while(isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

Flink main函数中使用

文件名：StreamWithMyParallelFunction.scala

package com.tech.consumer

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment

object StreamWithMyParallelFunction {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val stream = env.addSource(new MyParallelFunction)

    stream.print("received data: ")

    env.execute("StreamWithMyParallelFunction")
  }
}

如果使用该自定义Source，如果代码中没有设置并行度，会根据机器性能自动设置并行度。如机器是8核，则打印出来有8个并行度的数据

根据我找出的cpu记录，就是记录着正在运行的程序，以及下面打印出来的数据

3. 自定义Source，实现一个支持并行度的富类source

RichParallelSourceFunction 中的rich体现在额外提供open和close方法

针对source中如果需要获取其他链接资源，那么可以在open方法中获取资源链接，在close中关闭资源链接

文件名：MyRichParallelSourceFunction.scala

package com.tech.consumer

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}

class MyRichParallelSourceFunction extends RichParallelSourceFunction[Long]{

  var count = 0L
  var isRunning = true

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while(isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }

  override def open(parameters: Configuration): Unit = {
    // 如果需要获取其他链接资源，那么可以在open方法中获取资源链接
    print("资源链接.. ")
  }

  override def close(): Unit = {
    // 在close中关闭资源链接
    print("资源关闭.. ")
  }
}

文件名：StreamWithMyRichParallelSourceFunction.scala

package com.tech.consumer

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment

object StreamWithMyRichParallelSourceFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val stream = env.addSource(new MyRichParallelSourceFunction)
    stream.print("received data")

    env.execute("StreamWithMyRichParallelSourceFunction")
  }
}

从 “资源链接” 可以看到是执行在所有数据流之前的，可以用来定义一些数据源的连接信息，比如说MySQL的连接信息

4. 自定义Source，实现消费MySQL中的数据

这个更加接近实际的案例

4.1 首先添加pom依赖

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.15</version>
</dependency>

4.2 创建mysql表，作为读取的数据源

CREATE TABLE `person` (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  name varchar(260) NOT NULL DEFAULT '' COMMENT '姓名',
  age int(11) unsigned NOT NULL DEFAULT '0' COMMENT '年龄',
  sex tinyint(2) unsigned NOT NULL DEFAULT '2' COMMENT '0:女, 1男',
  email text COMMENT '邮箱',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COMMENT='人员定义';

随后插入一些数据，作为数据源的内容

insert into person values
  (null, 'Johngo12', 12, 1, 'Source01@flink.com'),
  (null, 'Johngo13', 13, , 'Source02@flink.com'),
  (null, 'Johngo14', 14, , 'Source03@flink.com'),
  (null, 'Johngo15', 15, , 'Source04@flink.com'),
  (null, 'Johngo16', 16, 1, 'Source05@flink.com'),
  (null, 'Johngo17', 17, 1, 'Source06@flink.com'),
  (null, 'Johngo18', 18, , 'Source07@flink.com'),
  (null, 'Johngo19', 19, , 'Source08@flink.com'),
  (null, 'Johngo20', 20, 1, 'Source09@flink.com'),
  (null, 'Johngo21', 21, , 'Source10@flink.com');

4.3 保存实体类Person Bean

package com.tech.bean

import scala.beans.BeanProperty

class Person(
              @BeanProperty var id:Int = ,
              @BeanProperty var name:String = "",
              @BeanProperty var age:Int = ,
              @BeanProperty var sex:Int = 2,
              @BeanProperty var email:String = ""
            ) {
}

4.4 创建自定义Source类，继承 RichSourceFunction

文件名：RichSourceFunctionFromMySQL.scala

package com.tech.consumer

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import com.tech.bean.Person
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}

class RichSourceFunctionFromMySQL extends RichSourceFunction[Person]{

  var isRUNNING: Boolean = true
  var ps: PreparedStatement = null
  var conn: Connection = null

  // 建立连接
  /**
    * 与MySQL建立连接信息
    * @return
    */
  def getConnection():Connection = {
    var conn: Connection = null
    val DB_URL: String = "jdbc:mysql://localhost:3306/flinkData?useUnicode=true&characterEncoding=UTF-8"
    val USER: String = "root"
    val PASS: String = "admin123"

    try{
      Class.forName("com.mysql.cj.jdbc.Driver")
      conn = DriverManager.getConnection(DB_URL, USER, PASS)
    } catch {
      case _: Throwable => println("due to the connect error then exit!")
    }
    conn
  }

  /**
    * open()方法初始化连接信息
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    conn = this.getConnection()
    val sql = "select * from person"
    ps = this.conn.prepareStatement(sql)
  }

  /**
    * main方法中调用run方法获取数据
    * @param ctx
    */
  override def run(ctx: SourceFunction.SourceContext[Person]): Unit = {
    val person:Person = new Person()
    val resSet:ResultSet = ps.executeQuery()
    while(isRUNNING & resSet.next()) {
      person.setId(resSet.getInt("id"))
      person.setName(resSet.getString("name"))
      person.setAge(resSet.getInt("age"))
      person.setSex(resSet.getInt("sex"))
      person.setEmail(resSet.getString("email"))

      ctx.collect(person)
    }
  }

  override def cancel(): Unit = {
    isRUNNING = false
  }

  /**
    * 关闭连接信息
    */
  override def close(): Unit = {
    if(conn != null) {
      conn.close()

    }
    if(ps != null) {
      ps.close()
    }
  }
}

将上述Source作为数据源，进行消费，当前打印到控制台

文件名：StreamRichSourceFunctionFromMySQL.scala

package com.tech.flink

import com.google.gson.Gson
import com.tech.consumer.RichSourceFunctionFromMySQL
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment


/**
  * 从MySQL中读取数据 & 打印到控制台
  */
object StreamRichSourceFunctionFromMySQL {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val stream = env.addSource(new RichSourceFunctionFromMySQL())
    val personInfo = stream.map(line => {
      new Gson().toJson(line)
    })
    personInfo.print("Data From MySQL ").setParallelism(1)

    env.execute(StreamRichSourceFunctionFromMySQL.getClass.getName)
  }
}

强烈建议使用Google的Json包，fastJSON会出现坑

好了，现在就把刚刚存放到MySQL中的数据读取了出来，MySQL中将近900多条数据，看图

是不是很流畅的将MySQL作为数据源进行了Mysql中数据的消费，后面有机会可以做更多的自定义数据源，也可以在实例工作中按照上述思路进行开发。

文章内容

继承上一篇Source源是MySQL的思路，本文想要想要将数据Sink到MySQL

那咱们本文的基本思路是，先把数据生产至Kafka，然后将Kafka中的数据Sink到MySQL，这么一条流下来，不断的往Kafka生产数据，不断的往MySQL插入数据

代码版本

Flink : 1.10.0
Scala : 2.12.6

下面图中是Flink1.10.0版本官网给出的可以sink的组件，大家可以自寻查看

1. 准备pom依赖：

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.15</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.10_2.12</artifactId>
    <version>1.10.0</version>
</dependency>

2. 创建MySQL表：

CREATE TABLE `person` (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  name varchar(260) NOT NULL DEFAULT '' COMMENT '姓名',
  age int(11) unsigned NOT NULL DEFAULT '0' COMMENT '年龄',
  sex tinyint(2) unsigned NOT NULL DEFAULT '2' COMMENT '0:女, 1男',
  email text COMMENT '邮箱',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COMMENT='人员定义';

3. 准备Person bean

package com.tech.bean

import scala.beans.BeanProperty

class Person() {
  @BeanProperty var id:Int = 0
  @BeanProperty var name:String = _
  @BeanProperty var age:Int = 0
  @BeanProperty var sex:Int = 2
  @BeanProperty var email:String = _
}

4. 工具类 - 向Kafka生产数据

package com.tech.util

import java.util.Properties

import com.google.gson.Gson
import com.tech.bean.Person
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}

/**
  * 创建 topic:
  * kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person
  *
  * 消费数据:
  * kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person --from-beginning
  */
object ProduceToKafkaUtil {
  final val broker_list: String = "localhost:9092"
  final val topic = "person"

  def produceMessageToKafka(): Unit = {
    val writeProps = new Properties()
    writeProps.setProperty("bootstrap.servers", broker_list)
    writeProps.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    writeProps.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    val producer = new KafkaProducer[String, String](writeProps)

    for (i <- 1 to 10000) {
      val person: Person = new Person()
      person.setId(i)
      person.setName("Johngo" + i)
      person.setAge(10 + i)
      person.setSex(i%2)
      person.setEmail("Johngo" + i + "@flink.com")
      val record = new ProducerRecord[String, String](topic, null, null,  new Gson().toJson(person))
      producer.send(record)
      println("SendMessageToKafka: " + new Gson().toJson(person))
      Thread.sleep(3000)
    }
    producer.flush()
  }

  def main(args: Array[String]): Unit = {
    this.produceMessageToKafka()
  }
}

可以通过下面kafka语句消费进行测试是否写入了kafka

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person --from-beginning

将程序启动，终端消费数据看看

说明可以往Kafka生产数据了

5. SinkToMySQL - 自定义Sink到MySQL

继承RichSinkFunction，进行自定义Sink的开发

文件名：RichSinkFunctionToMySQL.scala

package com.tech.sink

import java.sql.{Connection, DriverManager, PreparedStatement}

import com.tech.bean.Person
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction

class RichSinkFunctionToMySQL extends RichSinkFunction[Person]{
  var isRUNNING: Boolean = true
  var ps: PreparedStatement = null
  var conn: Connection = null

  // 建立连接
  def getConnection():Connection = {
    var conn: Connection = null
    val DB_URL: String = "jdbc:mysql://localhost:3306/flinkData?useUnicode=true&characterEncoding=UTF-8"
    val USER: String = "root"
    val PASS: String = "admin123"

    try{
      Class.forName("com.mysql.jdbc.Driver")
      conn = DriverManager.getConnection(DB_URL, USER, PASS)
    } catch {
      case _: Throwable => println("due to the connect error then exit!")
    }
    conn
  }

  /**
    * open()初始化建立和 MySQL 的连接
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    conn = getConnection()
    val sql: String = "insert into Person(id, name, password, age) values(?, ?, ?, ?);"
    ps = this.conn.prepareStatement(sql)
  }

  /**
    * 组装数据，进行数据的插入操作
    * 对每条数据的插入都要调用invoke()方法
    *
    * @param value
    */
    override def invoke(value: Person): Unit = {
      ps.setInt(1, value.getId())
      ps.setString(2, value.getName())
      ps.setInt(3, value.getAge())
      ps.setInt(4, value.getSex())
      ps.setString(5, value.getEmail())
      ps.executeUpdate()
    }

  override def close(): Unit = {
    if (conn != null) {
      conn.close()
    }
    if(ps != null) {
      ps.close()
    }
  }
}

6. Flink程序调用起来

文件名：FromKafkaToMySQL.scala

package com.tech.flink

import java.util.Properties

import com.alibaba.fastjson.JSON
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import com.google.gson.Gson
import com.tech.bean.Person
import com.tech.sink.RichSinkFunctionToMySQL
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010

object FromKafkaToMySQL {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())

    val brokerList: String = "localhost:9092"
    val topic = "person"
    val gs: Gson = new Gson

    val props = new Properties()
    props.setProperty("bootstrap.servers", brokerList)
    props.setProperty("group.id", "person")
    props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
    props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")


    val consumer = new FlinkKafkaConsumer010[String](topic, new SimpleStringSchema(), props)
    val stream = env.addSource(consumer).map(
        line => {
        print("line" + line + "\n")
        JSON.parseObject(line, classOf[Person])
    })

    stream.addSink(new RichSinkFunctionToMySQL())

    env.execute("FromKafkaToMySQL")
  }
}