文章内容
自定义Flink Source,案例分别实现了继承于SourceFunction的四个案例,三个完全自定义的Source, 另外一个Source为常见的MySQL,通过这几个案例,启发我们进行实际案例的Source研发
代码版本
Flink : 1.10.0
Scala : 2.12.6
官网部分说明
这个是关于Interface中Souce中的信息以及链接,关于SourceFunction的说明,基本使用到的是实现了SourceFunction接口的类
ALL Known Implementing Classes 就是SourceFunction以及实现于SourceFunction的各个类
自定义Source中,我们可以使用SourceFunction也可以使用它的实现类,看具体情况
可以通过-非并行Source实现SourceFunction,或者通过实现ParallelSourceFunction接口或为并行源扩展RichParallelSourceFunction来编写自己的自定义源
以下有四个案例,可以根据代码直接进行跑通实现
- 自定义Source,实现自定义&并行度为1的source
- 自定义Source,实现一个支持并行度的source
- 自定义Source,实现一个支持并行度的富类source
- 自定义Source,实现消费MySQL中的数据
1. 自定义Source,实现自定义&并行度为1的source
自定义source,实现SourceFunction接口,实现一个没有并行度的案例
功能:每隔 1s 进行自增加1
实现的方法:run(),作为数据源,所有数据的产生都在 run() 方法中实现
文件名:MyNoParallelFunction.scala
package com.tech.consumer
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
/**
* 创建自定义并行度为1的source
*/
class MyNoParallelFunction extends SourceFunction[Long]{
var count = 0L
var isRunning = true
override def run(ctx: SourceContext[Long]): Unit = {
while( isRunning ) {
ctx.collect(count)
count += 1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
Flink main函数中使用
文件名:StreamWithMyNoParallelFunction.scala
package com.tech.consumer
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
object StreamWithMyNoParallelFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new MyNoParallelFunction)
val mapData = stream.map(line => {
print("received data: " + line)
line
})
mapData.setParallelism(1)
env.execute("StreamWithMyNoParallelFunction")
}
}
执行起来,就可以看到数据的打印,就是我们想要得到的数据源不断的产出:
2. 自定义Source,实现一个支持并行度的source
实现ParallelSourceFunction接口
该接口只是个标记接口,用于标识继承该接口的Source都是并行执行的。其直接实现类是RichParallelSourceFunction,它是一个抽象类并继承自 AbstractRichFunction(从名称可以看出,它应该兼具 rich 和 parallel 两个特性,这里的rich体现在它定义了 open 和 close 这两个方法)。
MyParallelFunction.scala
package com.tech.consumer
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
/**
* 实现一个可以自定义的source
*/
class MyParallelFunction extends ParallelSourceFunction[Long]{
var count = 0L
var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
while(isRunning) {
ctx.collect(count)
count += 1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
Flink main函数中使用
文件名:StreamWithMyParallelFunction.scala
package com.tech.consumer
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
object StreamWithMyParallelFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new MyParallelFunction)
stream.print("received data: ")
env.execute("StreamWithMyParallelFunction")
}
}
如果使用该自定义Source,如果代码中没有设置并行度,会根据机器性能自动设置并行度。 如机器是8核,则打印出来有8个并行度的数据
根据我找出的cpu记录,就是记录着正在运行的程序,以及下面打印出来的数据
3. 自定义Source,实现一个支持并行度的富类source
RichParallelSourceFunction 中的rich体现在额外提供open和close方法
针对source中如果需要获取其他链接资源,那么可以在open方法中获取资源链接,在close中关闭资源链接
文件名:MyRichParallelSourceFunction.scala
package com.tech.consumer
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
class MyRichParallelSourceFunction extends RichParallelSourceFunction[Long]{
var count = 0L
var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
while(isRunning) {
ctx.collect(count)
count += 1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
override def open(parameters: Configuration): Unit = {
// 如果需要获取其他链接资源,那么可以在open方法中获取资源链接
print("资源链接.. ")
}
override def close(): Unit = {
// 在close中关闭资源链接
print("资源关闭.. ")
}
}
文件名:StreamWithMyRichParallelSourceFunction.scala
package com.tech.consumer
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
object StreamWithMyRichParallelSourceFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new MyRichParallelSourceFunction)
stream.print("received data")
env.execute("StreamWithMyRichParallelSourceFunction")
}
}
从 “资源链接” 可以看到是执行在所有数据流之前的,可以用来定义一些数据源的连接信息,比如说MySQL的连接信息
4. 自定义Source,实现消费MySQL中的数据
这个更加接近实际的案例
4.1 首先添加pom依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.15</version>
</dependency>
4.2 创建mysql表,作为读取的数据源
CREATE TABLE `person` (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
name varchar(260) NOT NULL DEFAULT '' COMMENT '姓名',
age int(11) unsigned NOT NULL DEFAULT '0' COMMENT '年龄',
sex tinyint(2) unsigned NOT NULL DEFAULT '2' COMMENT '0:女, 1男',
email text COMMENT '邮箱',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COMMENT='人员定义';
随后插入一些数据,作为数据源的内容
insert into person values
(null, 'Johngo12', 12, 1, 'Source01@flink.com'),
(null, 'Johngo13', 13, , 'Source02@flink.com'),
(null, 'Johngo14', 14, , 'Source03@flink.com'),
(null, 'Johngo15', 15, , 'Source04@flink.com'),
(null, 'Johngo16', 16, 1, 'Source05@flink.com'),
(null, 'Johngo17', 17, 1, 'Source06@flink.com'),
(null, 'Johngo18', 18, , 'Source07@flink.com'),
(null, 'Johngo19', 19, , 'Source08@flink.com'),
(null, 'Johngo20', 20, 1, 'Source09@flink.com'),
(null, 'Johngo21', 21, , 'Source10@flink.com');
4.3 保存实体类Person Bean
package com.tech.bean
import scala.beans.BeanProperty
class Person(
@BeanProperty var id:Int = ,
@BeanProperty var name:String = "",
@BeanProperty var age:Int = ,
@BeanProperty var sex:Int = 2,
@BeanProperty var email:String = ""
) {
}
4.4 创建自定义Source类,继承 RichSourceFunction
文件名:RichSourceFunctionFromMySQL.scala
package com.tech.consumer
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import com.tech.bean.Person
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
class RichSourceFunctionFromMySQL extends RichSourceFunction[Person]{
var isRUNNING: Boolean = true
var ps: PreparedStatement = null
var conn: Connection = null
// 建立连接
/**
* 与MySQL建立连接信息
* @return
*/
def getConnection():Connection = {
var conn: Connection = null
val DB_URL: String = "jdbc:mysql://localhost:3306/flinkData?useUnicode=true&characterEncoding=UTF-8"
val USER: String = "root"
val PASS: String = "admin123"
try{
Class.forName("com.mysql.cj.jdbc.Driver")
conn = DriverManager.getConnection(DB_URL, USER, PASS)
} catch {
case _: Throwable => println("due to the connect error then exit!")
}
conn
}
/**
* open()方法初始化连接信息
* @param parameters
*/
override def open(parameters: Configuration): Unit = {
super.open(parameters)
conn = this.getConnection()
val sql = "select * from person"
ps = this.conn.prepareStatement(sql)
}
/**
* main方法中调用run方法获取数据
* @param ctx
*/
override def run(ctx: SourceFunction.SourceContext[Person]): Unit = {
val person:Person = new Person()
val resSet:ResultSet = ps.executeQuery()
while(isRUNNING & resSet.next()) {
person.setId(resSet.getInt("id"))
person.setName(resSet.getString("name"))
person.setAge(resSet.getInt("age"))
person.setSex(resSet.getInt("sex"))
person.setEmail(resSet.getString("email"))
ctx.collect(person)
}
}
override def cancel(): Unit = {
isRUNNING = false
}
/**
* 关闭连接信息
*/
override def close(): Unit = {
if(conn != null) {
conn.close()
}
if(ps != null) {
ps.close()
}
}
}
将上述Source作为数据源,进行消费,当前打印到控制台
文件名:StreamRichSourceFunctionFromMySQL.scala
package com.tech.flink
import com.google.gson.Gson
import com.tech.consumer.RichSourceFunctionFromMySQL
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
/**
* 从MySQL中读取数据 & 打印到控制台
*/
object StreamRichSourceFunctionFromMySQL {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new RichSourceFunctionFromMySQL())
val personInfo = stream.map(line => {
new Gson().toJson(line)
})
personInfo.print("Data From MySQL ").setParallelism(1)
env.execute(StreamRichSourceFunctionFromMySQL.getClass.getName)
}
}
强烈建议使用Google的Json包,fastJSON会出现坑
好了,现在就把刚刚存放到MySQL中的数据读取了出来,MySQL中将近900多条数据,看图
是不是很流畅的将MySQL作为数据源进行了Mysql中数据的消费,后面有机会可以做更多的自定义数据源,也可以在实例工作中按照上述思路进行开发。
文章内容
继承上一篇Source源是MySQL的思路,本文想要想要将数据Sink到MySQL
那咱们本文的基本思路是,先把数据生产至Kafka,然后将Kafka中的数据Sink到MySQL,这么一条流下来,不断的往Kafka生产数据,不断的往MySQL插入数据
代码版本
Flink : 1.10.0
Scala : 2.12.6
下面图中是Flink1.10.0版本官网给出的可以sink的组件,大家可以自寻查看
1. 准备pom依赖:
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.15</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.12</artifactId>
<version>1.10.0</version>
</dependency>
2. 创建MySQL表:
CREATE TABLE `person` (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
name varchar(260) NOT NULL DEFAULT '' COMMENT '姓名',
age int(11) unsigned NOT NULL DEFAULT '0' COMMENT '年龄',
sex tinyint(2) unsigned NOT NULL DEFAULT '2' COMMENT '0:女, 1男',
email text COMMENT '邮箱',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COMMENT='人员定义';
3. 准备Person bean
package com.tech.bean
import scala.beans.BeanProperty
class Person() {
@BeanProperty var id:Int = 0
@BeanProperty var name:String = _
@BeanProperty var age:Int = 0
@BeanProperty var sex:Int = 2
@BeanProperty var email:String = _
}
4. 工具类 - 向Kafka生产数据
package com.tech.util
import java.util.Properties
import com.google.gson.Gson
import com.tech.bean.Person
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
/**
* 创建 topic:
* kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person
*
* 消费数据:
* kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person --from-beginning
*/
object ProduceToKafkaUtil {
final val broker_list: String = "localhost:9092"
final val topic = "person"
def produceMessageToKafka(): Unit = {
val writeProps = new Properties()
writeProps.setProperty("bootstrap.servers", broker_list)
writeProps.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
writeProps.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](writeProps)
for (i <- 1 to 10000) {
val person: Person = new Person()
person.setId(i)
person.setName("Johngo" + i)
person.setAge(10 + i)
person.setSex(i%2)
person.setEmail("Johngo" + i + "@flink.com")
val record = new ProducerRecord[String, String](topic, null, null, new Gson().toJson(person))
producer.send(record)
println("SendMessageToKafka: " + new Gson().toJson(person))
Thread.sleep(3000)
}
producer.flush()
}
def main(args: Array[String]): Unit = {
this.produceMessageToKafka()
}
}
可以通过下面kafka语句消费进行测试是否写入了kafka
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person --from-beginning
将程序启动,终端消费数据看看
说明可以往Kafka生产数据了
5. SinkToMySQL - 自定义Sink到MySQL
继承RichSinkFunction,进行自定义Sink的开发
文件名:RichSinkFunctionToMySQL.scala
package com.tech.sink
import java.sql.{Connection, DriverManager, PreparedStatement}
import com.tech.bean.Person
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
class RichSinkFunctionToMySQL extends RichSinkFunction[Person]{
var isRUNNING: Boolean = true
var ps: PreparedStatement = null
var conn: Connection = null
// 建立连接
def getConnection():Connection = {
var conn: Connection = null
val DB_URL: String = "jdbc:mysql://localhost:3306/flinkData?useUnicode=true&characterEncoding=UTF-8"
val USER: String = "root"
val PASS: String = "admin123"
try{
Class.forName("com.mysql.jdbc.Driver")
conn = DriverManager.getConnection(DB_URL, USER, PASS)
} catch {
case _: Throwable => println("due to the connect error then exit!")
}
conn
}
/**
* open()初始化建立和 MySQL 的连接
* @param parameters
*/
override def open(parameters: Configuration): Unit = {
super.open(parameters)
conn = getConnection()
val sql: String = "insert into Person(id, name, password, age) values(?, ?, ?, ?);"
ps = this.conn.prepareStatement(sql)
}
/**
* 组装数据,进行数据的插入操作
* 对每条数据的插入都要调用invoke()方法
*
* @param value
*/
override def invoke(value: Person): Unit = {
ps.setInt(1, value.getId())
ps.setString(2, value.getName())
ps.setInt(3, value.getAge())
ps.setInt(4, value.getSex())
ps.setString(5, value.getEmail())
ps.executeUpdate()
}
override def close(): Unit = {
if (conn != null) {
conn.close()
}
if(ps != null) {
ps.close()
}
}
}
6. Flink程序调用起来
文件名:FromKafkaToMySQL.scala
package com.tech.flink
import java.util.Properties
import com.alibaba.fastjson.JSON
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import com.google.gson.Gson
import com.tech.bean.Person
import com.tech.sink.RichSinkFunctionToMySQL
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
object FromKafkaToMySQL {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())
val brokerList: String = "localhost:9092"
val topic = "person"
val gs: Gson = new Gson
val props = new Properties()
props.setProperty("bootstrap.servers", brokerList)
props.setProperty("group.id", "person")
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
val consumer = new FlinkKafkaConsumer010[String](topic, new SimpleStringSchema(), props)
val stream = env.addSource(consumer).map(
line => {
print("line" + line + "\n")
JSON.parseObject(line, classOf[Person])
})
stream.addSink(new RichSinkFunctionToMySQL())
env.execute("FromKafkaToMySQL")
}
}
控制台能打印出来了
再看MySQL中的数据
现在可以看到MySQL中的数据也出现了,至此也就完成了SinkToMySQL的方案