What is Data Flow? How to use Data Flow in PEGA Platform?
Data Flow implementation

What is Data Flow? How to use Data Flow in PEGA Platform?

Summary:

This articles discusses about a rarely used rule type in PEGA – “Data Flow”, its configuration and usage.

Generally in PEGA it’s quite easy to guess the purpose of the rule by just seeing its name. With Data Transform, we can easily guess it’s used to transform data. Similarly, Data Flow is used to define the flow of data from one source to destination.

What is Data Flow?

Data flow is a rule type in PEGA. It can be used when transaction of data is huge and performance is highly considered. Data flow in PEGA is widely used in PEGA Marketing framework where customer records might be in millions. Data flow has a lot of in-built configurations to optimize performance during its execution.

How to configure Data Flow?

Data flow can be found under the Data Model category in Records Explorer.

Similar to other rules, it prompts for Name, Class, Ruleset and RSV in its new rule form.

Data flow looks similar to our process flow in terms of configuration. It has lot of in-built shapes which is used to build flow.

Different shapes and its usage in Data Flow

It’s mandatory to have a source and destination for each data flow. There should be only one source shape whereas destination shape can be more than one.

Source


Source of a Data flow can be anything from the above list. Let’s explore each source in detail in our later posts.

Compose

Compose shape is similar to Page-Copy in our Activity which browses through sample set of records from secondary source based on the match criteria and copies them to the defined property in the compose shape.

Merge

Merge shape is similar to Page-Merge-Into in Activity. It helps us to merge details between two pages of same class with option to specify precedence path in case of conflicts.

Data Transform

Data Transform can be invoked from data flow using Data Transform shape. To invoke activity from Data Flow, use function inside Data Transform as the activity cannot be invoked directly from Data flow.

Convert

Convert shape in data flow is similar to Page-change-class in activity. It converts page of one class to another and gives an option for property mapping in the configuration panel.

Filter

Filter shape can be used to conditionally skip the processing of records.

Text Analyzer

Text Analyzer is mostly used with Decisioning package along with NLP ruleset. It is used to deliver text analytics capabilities to users through data flows. Text Analyzer shape will be discussed in detail in our later posts.

Strategy

Strategy is a Decisioning component in PEGA. Strategy will be discussed in detail in our later posts.

Destination


Destination is where the PEGA writes the data to. Destination can be anything from the above list.

What is Data Set?

Data set is a rule type in PEGA which can be a representation of a table or any other source mentioned in the below screenshot. It can help us view the stored information at design time.

For Demo purpose, a data set is created to represent customer data type. We can pull the list of records in customer data type by just running the data set stand-alone.

Below is the example run of data set with operation as “Browse“. Other operations are not covered in this post and is left for exploration.

A Data set can be executed from an activity using the DataSet-Execute method.

Let’s understand the usage of Data flow with a real time scenario.

Business Scenario

Let’s consider a scenario where an external database table records need to be copied to PEGA database table. Records should be updated or inserted based on its existence in destination. External table might have millions of records and it’s up to us to define a best strategy to implement this.

We have many ways to implement this. Few approaches are listed below:

  • Configure ETL jobs to copy data on daily basis.
  • Write an activity to browse records and write logic to insert into PEGA DB.
  • Configure a Stored procedure to implement the same.

Let’s discuss how best we can implement the same using Data Flow. To start with, let’s break down the implementation into small steps:

  1. Establish connection to the external database table
  2. Configure logic to browse records from external table
  3. Write logic to check if record exists then update, else insert
  4. Save and commit the changes to PEGA DB table

Establish connection to the external database table

Since it is our personal edition, we have created an internal data type with 1 lakh record which will be copied to another table of same structure. Steps 2, 3 and 4 can be easily implemented with the below data flow configuration.


It is mandatory to create data flow in the class context of the source.

We have loaded 1 lakh sample records into our customer table [source] to explore data flow usage and to compare it’s performance when implemented using activity.

Configure logic to browse records from external table

Browsing logic is performed by configuring source data set. When data flow is executed, the configured data set browses through the list of records and flow gets executed for each record. Convert shape is used to change the page class from External Customer context to Internal Customer context. Additional properties required can also be mapped using convert shape.

Write logic to check if record exists then update, else insert

When implemented using activity, we have to manually add a logic to check whether the record exists in DB. It becomes much simpler when implemented using Data set as destination.In the configuration panel of destination data set, an option called Insert new and overwrite existing records automatically takes care of update/Insert.

Save and commit the changes to PEGA DB table

Destination data-set when executed will directly save the changes to DB. No explicit commit is needed. It’s similar to RDB methods in PEGA. Records when saved through data set operation will not be eligible for declarative trigger operations and it is not possible to roll back the changes once saved.

Executing the configured data flow

To execute a data flow, an active node should be configured for Data flow processing.

From Designer studio navigate to

Decisioning-> Infrastructure-> Services-> Data Flow

If a node is not listed, click on “Add Node” button to add one node from the existing list. Make sure the node status is Normal after adding it. If the node is not enabled, data flow will fail during its execution.

We can run data flow stand-alone using run option from the actions menu or from an activity using “DataFlow-Execute” method.

We can use operation to start/stop/get progress of running data flow using the run ID. Each data flow run will be created as an instance of work class “Pega-DM-DDF-Work” and the run ID provided here will be used as the primary key.

We can view the run statistics by navigating to,

Decisioning-> Decisions-> Data Flow-> Batch Processing


Open the data flow execution work object to get the detailed statistics of its execution.

It just took 40 seconds to process one lakh records with average speed of 468 µs/rec (2,136 rec/s)

We can further optimize the performance of data flow by analyzing the component statistics that is taking more time. In our scenario, destination data set took more time to execute which can be optimized by performing vacuum. Please refer to our previous post for detailed information on vacuum process.

Data flow and data sets can be used to efficiently handle large sets of data.

OSP TEAM
Written by
OSP Editorial Team

Recent Jobs from our community

loading
View more Jobs on OSP Forum
Join the discussion

Feel free to post your questions here about this topic if any. We will definitely get back to you ASAP !!!
If you have any off-topic questions, Let's discuss at OSP Forum

46 comments
  • The data set or data flow called through an Activity is giving an Null Pointer Exception. I have tested this in Personal Edition in 7.x and 8.x

    • @Vamsi: Please send your implementation details, error screenshots and log files to “contact@onestoppega.com” and we will get back to you with the root cause analysis.

      FYI, all our POCs for article is from PE and we are pretty sure that it should work fine for you as well !!!

    • Data set can be a replacement for Report Definition or Obj-Browse only when we are supposed to fetch all records from a table. We can’t fetch values by passing filer criteria in Data Set.

      Let’s say you run a report definition and loop through the results to execute complex logic, then that’s where Data Set and Data Flow make difference in terms of performance.

      Data Set gives the option to mention partition key, and child requestor gets created for distinct values of the identified partition keys. When data set with the partition key gets executed, the system automatically creates the child requestors and balances the load across the created child requestors.

  • Hi Team,

    Nice information, can you clarify how to handle the exceptions in the data flow.
    Let say some records failed to commit during the data flow execution and how to handle this.

    Thanks

  • Nice Article.

    One question :

    I have requirement where I need to update my 10K exisitng Work objects records present in work table with some data update,where the data to update for each work object is different.

    Can we use data flows for this purpose?

  • You mentioned this line “Records when saved through data set operation will not be eligible for declarative trigger operations “, but once record is in Pega DB, why can’t it is eligible for declarative processing.

  • First and foremost , appreciate your efforts to describe the data flows in a quite readable way!.

    I have a question though:
    1)I wanted to filter some records in a data flow using filter shape,and i tried to pass the parameter from the calling activity of a data flow.
    I could not be able to do it.
    Do you have any insight about this issue , i am exploring in 8.4.

    Thanks.

    • Thank you so much for your appreciation.

      We don’t think it will work. Data flows triggered from activity runs as a different requestor and hence parameter page will not be accessible.

      If you could brief your scenario, we might shed some light on the design.

      • Thanks for the reply, indeed agree with your comments.
        Usecase:
        Expose a webservice which should return retention offers.

        Solution:
        1) Created predictive model for churn model.
        2) created propositions for churn , loyal.
        3 ) Created strategy which runs predictive model and returns suitable prepositions based in churn result.
        4)Created a data flow to refer strategy.
        5) this data flow is called in the service activity of the rest service, which takes customer id as the request parameter.
        6) Unable to cascade this param to dataflow and then strategy.

        Real time containers will be the ideal solution, but tried a traditional approach and could not be able to solve it.

        Thanks

        • In that case, have a data flow with the Abstract page as a source. Set the Customer ID into the abstract page and invoke the strategy from the data flow. In that case, we believe Strategy should run on the primary page (abstract page) which has the customer ID value.

          Let us know if that works.

  • Hi,

    Nice to see the topic on Data flow.
    We are uploading Driving License using pyAttachContent Action and setting that whole data to a property and then we are using Parse delimited rule to convert to valuelist and then to pagelist. By using this we are doing hardcoded to map values to UI. ( we have implemented this step)

    Here we need to convert pagelist data to CSV file and create a text analyzer rule to upload the CSV file read the data and map to Pega properties like Name,Issue date,DOB,Expiry date etc.. (Need assistance to develop this).

    Thanks
    Luqmane

  • Such a great job OSP, Kudos for your articles. It is very easily digestible, couldn’t understand much from PDN but from here.
    I have a funny question, How do you guys understand the concepts at a low level where I can’t see the same things explained anywhere in the Pega knowledge base or in Pega academy courses, how can I also try to grasp the things like you :)?

  • good explanation but few few things not explained like when we will use abstract in source and destination or dataflow or activity like this points are missing please update this blog with this. Then it will become full package no need visit pdn to get those explanation

  • In Source Data Set how can we define where clause for matching row in destination data set ? FOr example in my first table pxInsIndexedKey coulmn has the value in pzInKey Column of 2nd table .

  • Such great simple explanation. Thank you very much.
    You said this feature used rarely. I’m thinking it could be used to send / receive Kafka message (based on option available on Data Set). Is this correct? If so, will it not be much easier than using Kafka Connector?

  • Hii.
    could you please explain how to add various shapes in a defined casetype, like decision ,utility. subprocess shapes

  • Hi OSP Team,

    How do we “PAUSE” a data flow instead of “STOP” it? We are seeing an issue where we Stopped the Data flow and lost messages which got posted during the stopped time frame. In the data flow landing page, we can just see ‘Stop’ which completely stops the data flow.

    Appreciate your help on this!

  • Hi,
    Can we automate this transfer ? means. here I need to click on DataFlow to execute or via activity I need to make explicit call. Is there any way, that I add any record in source table, it will automatically insert / update records in target table

    • Real-time data flows are executed automatically when your source has a message to process. For Eg, Stream dataset, Kafka dataset.

      But if you have a source listening to a physical table, then you will have to invoke the data flow.

  • Hi,
    I’ve one question, here we see one source dataset and one target dataset in dataflow. Is it possible to store in multiple target datasets via data flow.
    i.e. I’m using stream dataset and realtime data flow. and I’m going to expose data flow with Service REST where user is going to provide complex request JSON. it has many details. and I want to store all these details in multiple tables.
    Is realtime dataflow and stream dataset can make this possible ? or I need to prepare Service REST and Service rule-activity to use to fulfil it.

    • You can still handle it using Data Flow.
      Data flow can have multiple destinations. Just see the bottom of your destination shape and you will find a link to add a new destination.

  • Another point – To optimize the fetch process from Data Set, I think Partition Key always plays a great role as well. Nice details blog.