Generally in PEGA it’s quite easy to guess the purpose of the rule by just seeing its name. With Data Transform, we can easily guess it’s used to transform data. Similarly, Data Flow is used to define the flow of data from one source to destination.
What is Data Flow?
Data flow is a rule type in PEGA. It can be used when transaction of data is huge and performance is highly considered. Data flow in PEGA is widely used in PEGA Marketing framework where customer records might be in millions. Data flow has a lot of in-built configurations to optimize performance during its execution.
How to configure Data Flow?
Data flow can be found under the Data Model category in Records Explorer.
Similar to other rules, it prompts for Name, Class, Ruleset and RSV in its new rule form.
Data flow looks similar to our process flow in terms of configuration. It has lot of in-built shapes which is used to build flow.
Different shapes and its usage in Data Flow
It’s mandatory to have a source and destination for each data flow. There should be only one source shape whereas destination shape can be more than one.
Source
Compose
Merge
Data Transform
Convert
Filter
Text Analyzer
Strategy
Destination
What is Data Set?
Data set is a rule type in PEGA which can be a representation of a table or any other source mentioned in the below screenshot. It can help us view the stored information at design time.
For Demo purpose, a data set is created to represent customer data type. We can pull the list of records in customer data type by just running the data set stand-alone.
Below is the example run of data set with operation as “Browse“. Other operations are not covered in this post and is left for exploration.
A Data set can be executed from an activity using the DataSet-Execute method.
Let’s understand the usage of Data flow with a real time scenario.
Business Scenario
Let’s consider a scenario where an external database table records need to be copied to PEGA database table. Records should be updated or inserted based on its existence in destination. External table might have millions of records and it’s up to us to define a best strategy to implement this.
We have many ways to implement this. Few approaches are listed below:
- Configure ETL jobs to copy data on daily basis.
- Write an activity to browse records and write logic to insert into PEGA DB.
- Configure a Stored procedure to implement the same.
Let’s discuss how best we can implement the same using Data Flow. To start with, let’s break down the implementation into small steps:
- Establish connection to the external database table
- Configure logic to browse records from external table
- Write logic to check if record exists then update, else insert
- Save and commit the changes to PEGA DB table
Establish connection to the external database table
Since it is our personal edition, we have created an internal data type with 1 lakh record which will be copied to another table of same structure. Steps 2, 3 and 4 can be easily implemented with the below data flow configuration.
We have loaded 1 lakh sample records into our customer table [source] to explore data flow usage and to compare it’s performance when implemented using activity.
Configure logic to browse records from external table
Browsing logic is performed by configuring source data set. When data flow is executed, the configured data set browses through the list of records and flow gets executed for each record. Convert shape is used to change the page class from External Customer context to Internal Customer context. Additional properties required can also be mapped using convert shape.
Write logic to check if record exists then update, else insert
When implemented using activity, we have to manually add a logic to check whether the record exists in DB. It becomes much simpler when implemented using Data set as destination.In the configuration panel of destination data set, an option called Insert new and overwrite existing records automatically takes care of update/Insert.
Save and commit the changes to PEGA DB table
Destination data-set when executed will directly save the changes to DB. No explicit commit is needed. It’s similar to RDB methods in PEGA. Records when saved through data set operation will not be eligible for declarative trigger operations and it is not possible to roll back the changes once saved.
Executing the configured data flow
To execute a data flow, an active node should be configured for Data flow processing.
From Designer studio navigate to
Decisioning-> Infrastructure-> Services-> Data Flow
We can run data flow stand-alone using run option from the actions menu or from an activity using “DataFlow-Execute” method.
We can use operation to start/stop/get progress of running data flow using the run ID. Each data flow run will be created as an instance of work class “Pega-DM-DDF-Work” and the run ID provided here will be used as the primary key.
We can view the run statistics by navigating to,
Decisioning-> Decisions-> Data Flow-> Batch Processing
It just took 40 seconds to process one lakh records with average speed of 468 µs/rec (2,136 rec/s)
We can further optimize the performance of data flow by analyzing the component statistics that is taking more time. In our scenario, destination data set took more time to execute which can be optimized by performing vacuum. Please refer to our previous post for detailed information on vacuum process.
Data flow and data sets can be used to efficiently handle large sets of data.
Nice blog
Thanks @Rayudu Addagarla.
Subscribe and be part of our OneStopPEGA family 🙂
Shared good topic.Nice blog
Thanks @Sindhu
Subscribe and be part of our OneStopPEGA family 🙂
Very good and precise
Thanks @Narasimha
Subscribe and be part of our OneStopPEGA family 🙂
The data set or data flow called through an Activity is giving an Null Pointer Exception. I have tested this in Personal Edition in 7.x and 8.x
@Vamsi: Please send your implementation details, error screenshots and log files to “contact@onestoppega.com” and we will get back to you with the root cause analysis.
FYI, all our POCs for article is from PE and we are pretty sure that it should work fine for you as well !!!
Is this any performance improvement over obj browse or report definition? Can you please explain me how.
Data set can be a replacement for Report Definition or Obj-Browse only when we are supposed to fetch all records from a table. We can’t fetch values by passing filer criteria in Data Set.
Let’s say you run a report definition and loop through the results to execute complex logic, then that’s where Data Set and Data Flow make difference in terms of performance.
Data Set gives the option to mention partition key, and child requestor gets created for distinct values of the identified partition keys. When data set with the partition key gets executed, the system automatically creates the child requestors and balances the load across the created child requestors.
So, I can’t run it when I need to query on where clause?
Yeah… Thow whole usage of Data Set and Data Flow can be seen when you want to play with huge transactional data.
Thanks for the post. Very easy to understand. Can you please post more details on the dataflows
Sure we will post more on Data flow with real-time scenarios @saiNath
Happy Learning from OSP 😊
Hi Team,
Nice information, can you clarify how to handle the exceptions in the data flow.
Let say some records failed to commit during the data flow execution and how to handle this.
Thanks
Thank you so much @Venkat.
Sure we will post an article on exception handling in Data Flow.
Happy learning from OSP 😊
NICe blog. We are expecting more posts on data flow components. Thanks
Thank you so much. Sure we will post more on data flows since Pega stats replacing most of its core features using data flow in recent releases.
Happy Learning from OSP 😊
Nice Article.
One question :
I have requirement where I need to update my 10K exisitng Work objects records present in work table with some data update,where the data to update for each work object is different.
Can we use data flows for this purpose?
Yes, data flow can be used to update inflight cases.
Let us know if you need any additional help on configuring data flow for this requirement.
Good Explanation, Great Job Guys…..Keep doing it
All the best for your future posts
Thank you so much @Neeraj, for being a well wisher.
Happy Learning from OSP😊
You mentioned this line “Records when saved through data set operation will not be eligible for declarative trigger operations “, but once record is in Pega DB, why can’t it is eligible for declarative processing.
Declarative process in Pega happens only during Obj-* methods. When we use a data set, it directly performs an RDB-Save which does not execute any declarative rules.
First and foremost , appreciate your efforts to describe the data flows in a quite readable way!.
I have a question though:
1)I wanted to filter some records in a data flow using filter shape,and i tried to pass the parameter from the calling activity of a data flow.
I could not be able to do it.
Do you have any insight about this issue , i am exploring in 8.4.
Thanks.
Thank you so much for your appreciation.
We don’t think it will work. Data flows triggered from activity runs as a different requestor and hence parameter page will not be accessible.
If you could brief your scenario, we might shed some light on the design.
Thanks for the reply, indeed agree with your comments.
Usecase:
Expose a webservice which should return retention offers.
Solution:
1) Created predictive model for churn model.
2) created propositions for churn , loyal.
3 ) Created strategy which runs predictive model and returns suitable prepositions based in churn result.
4)Created a data flow to refer strategy.
5) this data flow is called in the service activity of the rest service, which takes customer id as the request parameter.
6) Unable to cascade this param to dataflow and then strategy.
Real time containers will be the ideal solution, but tried a traditional approach and could not be able to solve it.
Thanks
In that case, have a data flow with the Abstract page as a source. Set the Customer ID into the abstract page and invoke the strategy from the data flow. In that case, we believe Strategy should run on the primary page (abstract page) which has the customer ID value.
Let us know if that works.
Hi,
Nice to see the topic on Data flow.
We are uploading Driving License using pyAttachContent Action and setting that whole data to a property and then we are using Parse delimited rule to convert to valuelist and then to pagelist. By using this we are doing hardcoded to map values to UI. ( we have implemented this step)
Here we need to convert pagelist data to CSV file and create a text analyzer rule to upload the CSV file read the data and map to Pega properties like Name,Issue date,DOB,Expiry date etc.. (Need assistance to develop this).
Thanks
Luqmane
nice tutorial by the way ,whether it is small or big doesnt matter,you are doing a great job.
Thank you so much @Kishore
Happy Learning from OSP 🙂
Such a great job OSP, Kudos for your articles. It is very easily digestible, couldn’t understand much from PDN but from here.
I have a funny question, How do you guys understand the concepts at a low level where I can’t see the same things explained anywhere in the Pega knowledge base or in Pega academy courses, how can I also try to grasp the things like you :)?
Thank you so much @Durga
It’s simple. We sit, analyze and bring out stuffs. It’s a bit time consuming process but we do it always
good explanation but few few things not explained like when we will use abstract in source and destination or dataflow or activity like this points are missing please update this blog with this. Then it will become full package no need visit pdn to get those explanation
Thank you so much @Arun
Sure we’ll keep the post updated with your points. Happy Learning from OSP 🙂
Very Rare and Useful Information,Thanks
In Source Data Set how can we define where clause for matching row in destination data set ? FOr example in my first table pxInsIndexedKey coulmn has the value in pzInKey Column of 2nd table .
Such great simple explanation. Thank you very much.
You said this feature used rarely. I’m thinking it could be used to send / receive Kafka message (based on option available on Data Set). Is this correct? If so, will it not be much easier than using Kafka Connector?
Hii.
could you please explain how to add various shapes in a defined casetype, like decision ,utility. subprocess shapes
how to reprocess failed records of these data flow? (let’s say real time run)
Hi OSP Team,
How do we “PAUSE” a data flow instead of “STOP” it? We are seeing an issue where we Stopped the Data flow and lost messages which got posted during the stopped time frame. In the data flow landing page, we can just see ‘Stop’ which completely stops the data flow.
Appreciate your help on this!
Hi,
Can we automate this transfer ? means. here I need to click on DataFlow to execute or via activity I need to make explicit call. Is there any way, that I add any record in source table, it will automatically insert / update records in target table
Real-time data flows are executed automatically when your source has a message to process. For Eg, Stream dataset, Kafka dataset.
But if you have a source listening to a physical table, then you will have to invoke the data flow.
Hi,
I’ve one question, here we see one source dataset and one target dataset in dataflow. Is it possible to store in multiple target datasets via data flow.
i.e. I’m using stream dataset and realtime data flow. and I’m going to expose data flow with Service REST where user is going to provide complex request JSON. it has many details. and I want to store all these details in multiple tables.
Is realtime dataflow and stream dataset can make this possible ? or I need to prepare Service REST and Service rule-activity to use to fulfil it.
You can still handle it using Data Flow.
Data flow can have multiple destinations. Just see the bottom of your destination shape and you will find a link to add a new destination.
Another point – To optimize the fetch process from Data Set, I think Partition Key always plays a great role as well. Nice details blog.