What is Data Flow? How to use Data Flow in PEGA Platform?

Generally in PEGA it’s quite easy to guess the purpose of the rule by just seeing its name. With Data Transform, we can easily guess it’s used to transform data. Similarly, Data Flow is used to define the flow of data from one source to destination.

What is Data Flow?

Data flow is a rule type in PEGA. It can be used when transaction of data is huge and performance is highly considered. Data flow in PEGA is widely used in PEGA Marketing framework where customer records might be in millions. Data flow has a lot of in-built configurations to optimize performance during its execution.

How to configure Data Flow?

Data flow can be found under the Data Model category in Records Explorer.

Similar to other rules, it prompts for Name, Class, Ruleset and RSV in its new rule form.

Data flow looks similar to our process flow in terms of configuration. It has lot of in-built shapes which is used to build flow.

Different shapes and its usage in Data Flow

It’s mandatory to have a source and destination for each data flow. There should be only one source shape whereas destination shape can be more than one.

Source

Compose

Merge

Data Transform

Convert

Filter

Text Analyzer

Strategy

Destination

What is Data Set?

Data set is a rule type in PEGA which can be a representation of a table or any other source mentioned in the below screenshot. It can help us view the stored information at design time.

For Demo purpose, a data set is created to represent customer data type. We can pull the list of records in customer data type by just running the data set stand-alone.

Below is the example run of data set with operation as “Browse“. Other operations are not covered in this post and is left for exploration.

A Data set can be executed from an activity using the DataSet-Execute method.

Let’s understand the usage of Data flow with a real time scenario.

Business Scenario

Let’s consider a scenario where an external database table records need to be copied to PEGA database table. Records should be updated or inserted based on its existence in destination. External table might have millions of records and it’s up to us to define a best strategy to implement this.

We have many ways to implement this. Few approaches are listed below:

Configure ETL jobs to copy data on daily basis.
Write an activity to browse records and write logic to insert into PEGA DB.
Configure a Stored procedure to implement the same.

Let’s discuss how best we can implement the same using Data Flow. To start with, let’s break down the implementation into small steps:

Establish connection to the external database table
Configure logic to browse records from external table
Write logic to check if record exists then update, else insert
Save and commit the changes to PEGA DB table

Establish connection to the external database table

Since it is our personal edition, we have created an internal data type with 1 lakh record which will be copied to another table of same structure. Steps 2, 3 and 4 can be easily implemented with the below data flow configuration.

**It is mandatory to create data flow in the class context of the source.**

We have loaded 1 lakh sample records into our customer table [source] to explore data flow usage and to compare it’s performance when implemented using activity.

Configure logic to browse records from external table

Browsing logic is performed by configuring source data set. When data flow is executed, the configured data set browses through the list of records and flow gets executed for each record. Convert shape is used to change the page class from External Customer context to Internal Customer context. Additional properties required can also be mapped using convert shape.

Write logic to check if record exists then update, else insert

When implemented using activity, we have to manually add a logic to check whether the record exists in DB. It becomes much simpler when implemented using Data set as destination.In the configuration panel of destination data set, an option called Insert new and overwrite existing records automatically takes care of update/Insert.

Save and commit the changes to PEGA DB table

Destination data-set when executed will directly save the changes to DB. No explicit commit is needed. It’s similar to RDB methods in PEGA. Records when saved through data set operation will not be eligible for declarative trigger operations and it is not possible to roll back the changes once saved.

Executing the configured data flow

To execute a data flow, an active node should be configured for Data flow processing.

From Designer studio navigate to

Decisioning-> Infrastructure-> Services-> Data Flow

If a node is not listed, click on “Add Node” button to add one node from the existing list. Make sure the node status is Normal after adding it. If the node is not enabled, data flow will fail during its execution.

We can run data flow stand-alone using run option from the actions menu or from an activity using “DataFlow-Execute” method.

We can use operation to start/stop/get progress of running data flow using the run ID. Each data flow run will be created as an instance of work class “Pega-DM-DDF-Work” and the run ID provided here will be used as the primary key.

We can view the run statistics by navigating to,

Decisioning-> Decisions-> Data Flow-> Batch Processing

**Open the data flow execution work object to get the detailed statistics of its execution**.

It just took 40 seconds to process one lakh records with average speed of 468 µs/rec (2,136 rec/s)

We can further optimize the performance of data flow by analyzing the component statistics that is taking more time. In our scenario, destination data set took more time to execute which can be optimized by performing vacuum. Please refer to our previous post for detailed information on vacuum process.

How to optimize/fine tune DB transaction?

Data flow and data sets can be used to efficiently handle large sets of data.

Join the discussion Cancel reply

46 comments

Rayudu Addagarla says:
October 23, 2019 at 6:56 AM
Nice blog
- OSP Editorial Team says:
  October 23, 2019 at 7:11 PM
  Thanks @Rayudu Addagarla.
  Subscribe and be part of our OneStopPEGA family 🙂
Sindhu says:
October 25, 2019 at 10:39 AM
Shared good topic.Nice blog
- OSP Editorial Team says:
  October 25, 2019 at 1:45 PM
  Thanks @Sindhu
  Subscribe and be part of our OneStopPEGA family 🙂
Narasimha says:
October 29, 2019 at 12:19 AM
Very good and precise
- OSP Editorial Team says:
  October 29, 2019 at 9:03 AM
  Thanks @Narasimha
  Subscribe and be part of our OneStopPEGA family 🙂
Vamsi says:
November 4, 2019 at 8:12 PM
The data set or data flow called through an Activity is giving an Null Pointer Exception. I have tested this in Personal Edition in 7.x and 8.x
- OSP Editorial Team says:
  November 4, 2019 at 9:04 PM
  @Vamsi: Please send your implementation details, error screenshots and log files to “contact@onestoppega.com” and we will get back to you with the root cause analysis.
  FYI, all our POCs for article is from PE and we are pretty sure that it should work fine for you as well !!!
Rangekutta says:
February 3, 2020 at 1:02 PM
Is this any performance improvement over obj browse or report definition? Can you please explain me how.
- OSP Editorial Team says:
  February 3, 2020 at 3:59 PM
  Data set can be a replacement for Report Definition or Obj-Browse only when we are supposed to fetch all records from a table. We can’t fetch values by passing filer criteria in Data Set.
  Let’s say you run a report definition and loop through the results to execute complex logic, then that’s where Data Set and Data Flow make difference in terms of performance.
  Data Set gives the option to mention partition key, and child requestor gets created for distinct values of the identified partition keys. When data set with the partition key gets executed, the system automatically creates the child requestors and balances the load across the created child requestors.
  - Rangekutta says:
    February 3, 2020 at 5:50 PM
    So, I can’t run it when I need to query on where clause?
    - OSP Editorial Team says:
      February 3, 2020 at 11:18 PM
      Yeah… Thow whole usage of Data Set and Data Flow can be seen when you want to play with huge transactional data.
saiNath says:
February 21, 2020 at 2:41 PM
Thanks for the post. Very easy to understand. Can you please post more details on the dataflows
- OSP Editorial Team says:
  February 21, 2020 at 6:46 PM
  Sure we will post more on Data flow with real-time scenarios @saiNath
  Happy Learning from OSP 😊
Venkat says:
April 1, 2020 at 1:17 PM
Hi Team,
Nice information, can you clarify how to handle the exceptions in the data flow.
Let say some records failed to commit during the data flow execution and how to handle this.
Thanks
- OSP Editorial Team says:
  April 1, 2020 at 1:23 PM
  Thank you so much @Venkat.
  Sure we will post an article on exception handling in Data Flow.
  Happy learning from OSP 😊
Superman says:
April 30, 2020 at 8:03 AM
NICe blog. We are expecting more posts on data flow components. Thanks
- OSP Editorial Team says:
  April 30, 2020 at 9:45 AM
  Thank you so much. Sure we will post more on data flows since Pega stats replacing most of its core features using data flow in recent releases.
  Happy Learning from OSP 😊
Sriteja Sabbienni says:
May 24, 2020 at 12:14 PM
Nice Article.
One question :
I have requirement where I need to update my 10K exisitng Work objects records present in work table with some data update,where the data to update for each work object is different.
Can we use data flows for this purpose?
- OSP Editorial Team says:
  May 25, 2020 at 3:19 PM
  Yes, data flow can be used to update inflight cases.
  Let us know if you need any additional help on configuring data flow for this requirement.
Neeraj says:
July 7, 2020 at 1:13 PM
Good Explanation, Great Job Guys…..Keep doing it
All the best for your future posts
- OSP Editorial Team says:
  July 7, 2020 at 1:20 PM
  Thank you so much @Neeraj, for being a well wisher.
  Happy Learning from OSP😊
Rishabh says:
July 24, 2020 at 1:10 PM
You mentioned this line “Records when saved through data set operation will not be eligible for declarative trigger operations “, but once record is in Pega DB, why can’t it is eligible for declarative processing.
- OSP Editorial Team says:
  July 28, 2020 at 9:14 AM
  Declarative process in Pega happens only during Obj-* methods. When we use a data set, it directly performs an RDB-Save which does not execute any declarative rules.
Tirumala Devi Divyakolu says:
July 28, 2020 at 5:36 PM
First and foremost , appreciate your efforts to describe the data flows in a quite readable way!.
I have a question though:
1)I wanted to filter some records in a data flow using filter shape,and i tried to pass the parameter from the calling activity of a data flow.
I could not be able to do it.
Do you have any insight about this issue , i am exploring in 8.4.
Thanks.
- OSP Editorial Team says:
  July 28, 2020 at 6:43 PM
  Thank you so much for your appreciation.
  We don’t think it will work. Data flows triggered from activity runs as a different requestor and hence parameter page will not be accessible.
  If you could brief your scenario, we might shed some light on the design.
  - Tirumala says:
    July 28, 2020 at 8:13 PM
    Thanks for the reply, indeed agree with your comments.
    Usecase:
    Expose a webservice which should return retention offers.
    Solution:
    1) Created predictive model for churn model.
    2) created propositions for churn , loyal.
    3 ) Created strategy which runs predictive model and returns suitable prepositions based in churn result.
    4)Created a data flow to refer strategy.
    5) this data flow is called in the service activity of the rest service, which takes customer id as the request parameter.
    6) Unable to cascade this param to dataflow and then strategy.
    Real time containers will be the ideal solution, but tried a traditional approach and could not be able to solve it.
    Thanks
    - OSP Editorial Team says:
      July 28, 2020 at 9:13 PM
      In that case, have a data flow with the Abstract page as a source. Set the Customer ID into the abstract page and invoke the strategy from the data flow. In that case, we believe Strategy should run on the primary page (abstract page) which has the customer ID value.
      Let us know if that works.
Luqmane says:
August 17, 2020 at 1:37 PM
Hi,
Nice to see the topic on Data flow.
We are uploading Driving License using pyAttachContent Action and setting that whole data to a property and then we are using Parse delimited rule to convert to valuelist and then to pagelist. By using this we are doing hardcoded to map values to UI. ( we have implemented this step)
Here we need to convert pagelist data to CSV file and create a text analyzer rule to upload the CSV file read the data and map to Pega properties like Name,Issue date,DOB,Expiry date etc.. (Need assistance to develop this).
Thanks
Luqmane
kishore says:
October 1, 2020 at 10:18 PM
nice tutorial by the way ,whether it is small or big doesnt matter,you are doing a great job.
- OSP Editorial Team says:
  October 4, 2020 at 10:33 AM
  Thank you so much @Kishore
  Happy Learning from OSP 🙂
Durga says:
October 14, 2020 at 4:25 AM
Such a great job OSP, Kudos for your articles. It is very easily digestible, couldn’t understand much from PDN but from here.
I have a funny question, How do you guys understand the concepts at a low level where I can’t see the same things explained anywhere in the Pega knowledge base or in Pega academy courses, how can I also try to grasp the things like you :)?
- OSP Editorial Team says:
  October 15, 2020 at 9:33 AM
  Thank you so much @Durga
  It’s simple. We sit, analyze and bring out stuffs. It’s a bit time consuming process but we do it always
Arun says:
November 26, 2020 at 2:29 PM
good explanation but few few things not explained like when we will use abstract in source and destination or dataflow or activity like this points are missing please update this blog with this. Then it will become full package no need visit pdn to get those explanation
- OSP Editorial Team says:
  November 26, 2020 at 4:53 PM
  Thank you so much @Arun
  Sure we’ll keep the post updated with your points. Happy Learning from OSP 🙂
Venkatesh says:
February 10, 2021 at 10:12 PM
Very Rare and Useful Information,Thanks
Prateek Nanda says:
April 16, 2021 at 8:08 AM
In Source Data Set how can we define where clause for matching row in destination data set ? FOr example in my first table pxInsIndexedKey coulmn has the value in pzInKey Column of 2nd table .
Grif Topia says:
April 19, 2021 at 12:13 AM
Such great simple explanation. Thank you very much.
You said this feature used rarely. I’m thinking it could be used to send / receive Kafka message (based on option available on Data Set). Is this correct? If so, will it not be much easier than using Kafka Connector?
E.ramana says:
September 6, 2021 at 2:58 PM
Hii.
could you please explain how to add various shapes in a defined casetype, like decision ,utility. subprocess shapes
Tharun says:
September 25, 2022 at 12:01 PM
how to reprocess failed records of these data flow? (let’s say real time run)
Sailatha Kaavya says:
November 12, 2022 at 11:27 PM
Hi OSP Team,
How do we “PAUSE” a data flow instead of “STOP” it? We are seeing an issue where we Stopped the Data flow and lost messages which got posted during the stopped time frame. In the data flow landing page, we can just see ‘Stop’ which completely stops the data flow.
Appreciate your help on this!
Vachan says:
January 13, 2023 at 8:20 PM
Hi,
Can we automate this transfer ? means. here I need to click on DataFlow to execute or via activity I need to make explicit call. Is there any way, that I add any record in source table, it will automatically insert / update records in target table
- Karthick Baskaran says:
  May 25, 2023 at 1:26 PM
  Real-time data flows are executed automatically when your source has a message to process. For Eg, Stream dataset, Kafka dataset.
  But if you have a source listening to a physical table, then you will have to invoke the data flow.
Chetan says:
January 22, 2023 at 7:17 PM
Hi,
I’ve one question, here we see one source dataset and one target dataset in dataflow. Is it possible to store in multiple target datasets via data flow.
i.e. I’m using stream dataset and realtime data flow. and I’m going to expose data flow with Service REST where user is going to provide complex request JSON. it has many details. and I want to store all these details in multiple tables.
Is realtime dataflow and stream dataset can make this possible ? or I need to prepare Service REST and Service rule-activity to use to fulfil it.
- Karthick Baskaran says:
  May 25, 2023 at 1:23 PM
  You can still handle it using Data Flow.
  Data flow can have multiple destinations. Just see the bottom of your destination shape and you will find a link to add a new destination.
Ayan Mukherjee says:
March 31, 2023 at 2:34 AM
Another point – To optimize the fetch process from Data Set, I think Partition Key always plays a great role as well. Nice details blog.