Loop through a PCollection to make a Graph data structure and then passing it as SideInput to pipeline transform

Question

I have a use case where I have to read a big query table into dataflow pipeline, then read each row in that PCollection to construct a graph data structure. And then pass the graph as SideInput to more transform steps that require this graph and another big query table PCollection as arguments. below is what I have right now:

Pipeline pipeline = Pipeline.create(options);

//Read from big query
PCollection<TableRow> bqTable = pipeline.apply("ReadFooBQTable", BigQueryIO.Read.from("Table"));

//Loop over PCollection create "graph" still need to figure this out


//pass the graph as side input 
pCol.apply("Process", ParDo.withSideInputs(graph).of(new BlueKai.ProcessBatch(graph))).apply("Write",
    Write.to(new DecoratedFileSink<String>(standardBucket, "csv", TextIO.DEFAULT_TEXT_CODER, null, null, WriterOutputGzipDecoratorFactory.getInstance())).withNumShards(numChunks));

Ben Chambers · Accepted Answer · 2017-05-17 19:51:21Z

1

The problem is going to be how to serialize the graph to pass it between machines. If you define a Coder for how to serialize an element representing the graph, then you could use it as a side input as you describe.

Assuming the graph can be encoded, then you would just use it as a singleton side input. This assumes the number of rows can be processed on a single machine. You may need to define a CombineFn<TableRow, Graph, Graph> that computes the graph from the table rows. Assuming two graphs can be combined (eg., it is an associative and commutative operation), then you could use a combine plus asSingletonView.

An alternative, would be to use a List<TableRow> as the side input and have each machine construct the graph.

answered May 17, 2017 at 19:51

Ben Chambers

6,13013 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PUG Over a year ago

thanks for the reply, currently the idea I have is to: 1) merge all query definitions into one line using big query aggregate functions then read it as a query, so my PCollection will only have one element. 2) Then pass this PCollection into a DoFn which will convert output it into a graph. 3) Then pass this graph as side input to another DoFn that does filtering per line of my input.

PUG Over a year ago

revised idea: (1) read all TableRows into a PCollection. (2) Pass this PCollection into a DoFn which outputs a KV<String,TableRow> with all TableRows under same key. (3) Then GroupByKey this KV<String,TableRow> that will result in all TableRows in one line. (4) Pass this to a DoFn which reads these TableRows and converts to a Graph data structure. Since there is only one item in the input PCollection it will all be sent to only one node. (5) Then pass this graph as side input to another DoFn that does filtering per line of my input.

PUG Over a year ago

The DoFn function for step (4) above looks like DoFn<KV<String, Iterable<TableRow>>, Map<String,GraphVertex>>

PUG Over a year ago

when I do use a List<TableRow> as the side input and have each machine construct the graph. the pipeline execution is very slow because each worker node is doing redundant work of constructing graph

PUG Over a year ago

I am having trouble encoding the graph, avro coder takes too long to code the list of my vertices. posted this question stackoverflow.com/questions/44142491/…

Collectives™ on Stack Overflow

Loop through a PCollection to make a Graph data structure and then passing it as SideInput to pipeline transform

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related