Skip to Content
Technical Articles
Author's profile photo Tim Huse

Data Cleansing and Data Matching with SAP HANA – Smart Data Quality

Data%20Cleansing%20and%20Data%20Matching%20with%20SAP%20HANA%20%u2013%20Smart%20Data%20Quality%20%28Source%3A%20SAP%29

Data Cleansing and Data Matching with SAP HANA – Smart Data Quality (Source: SAP)

Introduction

Today, companies leverage data from a variety of internal and external data sources. In order to integrate these successfully, the data must not only fit semantically, but also have a common key in order to be able to merge them in joins.

This endeavor often fails because the data contains different data qualities (cleansing problem) or because there is no common key, especially when external data sources are involved (matching problem).

SAP HANA Smart Data Quality (integrated with SAP HANA) provides a high-performance, rule-based solution to cleanse and merge the data (e.g. address data) even if no common key exists or to identify duplicates in the data sources. In this blogpost, the data cleansing and data matching functionalities of SAP HANA Smart Data Quality are demonstrated using person data and company data.

The blogpost is structured as follows: First, an example is presented, this includes two data sets with person data and two data sets with company data. Then, data cleansing is introduced and demonstrated using the data sets as examples. In the next step, data matching is shown and demonstrated on the examples. Finally, a conclusion is drawn.

Please note: The following information is presented from my point of view and does not reflect the view of SAP. All data shown is generated sample data.

Prerequisite

To implement the use case shown, a HANA on-premise instance is required. Currently, the matching and cleansing functionality is not available in HANA Cloud. For the example, a HANA database in version 2.0 SPS05 was used.

SAP HANA Smart Data Integration (SDI for short) and Smart Data Quality (SDQ for short) are the data integration and data quality capabilities of the HANA platform. These are embedded in SAP HANA and can be used via an intuitive UI using so-called SDI flow graphs. More information can be found in thisblog post. To use SDI and SDQ, the script server of the HANA database must be activated (see here).

Basic knowledge of building HANA SDI flowgraphs (see here) and of HANA 2.0 (see here) is advantageous for implementing the use case.

Example Use Case

A company wants to combine data from two different data sources. Data source 1 is an ERP system with transactional data of the company, data source 2 is a CRM system with customer relationship data. Both data sources contain data on customer companies (company data) and their contacts (person data). Both data sources have stored the company and personal data with different IDs and also the names and address data are not created identically. Therefore a simple join is not possible.

The example data records of the data sources are presented below:

Example 1 Person Data

MEMBER_NO FIRST_NAME LAST_NAME BIRTH_DATE COUNTRY POST_CODE CITY ADDRESS_LINE SOURCE
1 Ted Mosby 04/11/1983 US 10977 Spring Valley 656 N. Stonybrook Street 1
2 Lilly Aldrin 01/12/1994 US 6095 Windsor 8354 South Drive 1
3 Marshall Eriksen 12/14/1989 US 6095 Windsor 8354 South Drive 1
4 Robin Scherbatsky 03/18/1987 US 20707 Laurel 599 Penn Ave. 1
5 Barney Stinson 03/23/1968 US 1801 Woburn 498 Pawnee Road 1
6 Donna Paulsen 08/04/1993 US 33030 Homestead 8586 S. Winchester Dr. 1
7 Harvey Specter 12/04/1994 US 33030 Homestead 8586 S. Winchester Dr. 1
8 Mike Ross 12/23/1996 US 12203 Albany 8992 East Harvey St. 1
9 Rachel Zane 01/20/1980 US 12203 Albany 8992 East Harvey St. 1
10 Louis Litt 06/24/1985 US 7712 Asbury Park 8701 Bear Hill Street 1
11 Jessica Pearson 09/10/2000 US 18940 Newtown 319 Brown Rd. 1
12 Charlie Harper 11/25/1984 US 7110 Malibu 758 Jones Rd. 1
13 Alan Harper 05/08/1992 US 7110 Malibu 758 Jones Rd. 1
14 Jake Harper 08/05/1980 US 7103 Newark 7709 El Dorado Court 1
15 Walden Schmidt 11/13/1987 US 8080 Sewell 5 Pacific Drive 1
16 Harry Potter 06/16/1981 UK 8701 Lakewood 66 Myrtle Dr. 1
17 Hermione Granger 07/15/1991 UK 46307 Crown Point 874 Rockcrest St. 1
18 Ron Weasley 10/28/1987 UK 60452 Oak Forest 7835 Crescent St. 1
19 Rubeus Hagrid 11/25/1982 UK 30701 Calhoun 37 Woodsman Lane 1
20 Albus Dumbledore 01/11/1968 UK 34231 Sarasota 10 Lilac Court 1

Person Data – Data Source 1

ID FIRST LAST BIRTHDAY COUNTRY ZIP CITY STREET STREETNUMBER SOURCE
932093 Teddy Mosby 04/11/1983 US 10977 Spring Valley North Stonybrook Street 656 2
83294 Lilly Aldrin 01/12/1994 US 6095 Windsor South Driv 8354 2
83209841 Marshall Eriksen 12/14/1989 US 6095 Windsor S. Drive 8354 2
129321 Robin Scherbatsky US 20707 Laurel Penn Avenue 2
219409 Barney Stinson 03/23/1968 US 1801 Woburn Pawnee Rd. 498 2
4329 Donna Paulsen 08/04/1993 US 33030 Homestead South Winchester Drive 8586 2
3432 Michael James Ross 12/23/1996 US 12203 Albany E. Harvey Street 8992 2
82302 Rachel Zane 01/20/1980 US 12203 Albany 8992 2
1290383 Louis Litt 06/24/1985 US 7712 Asbury Park Bear Hill St. 8701 2
3129321 Jessica Pearson 09/10/2000 US 18940 Newtown Brown Rd. 319 2
43213 Charles Harper 11/25/1984 US 7110 Malibu Jones Road 758 2
1243221 Alan Harper 05/08/1992 US 7110 Jones Rd. 758 2
2342322 Jake Harper 08/05/1980 US 7103 Newark El Dorado Court 7709 2
213234 Harry Potter 06/16/1981 UK 8701 Lakewood Myrtle Drive 66 2
234212 Hermione Granger 07/15/1991 UK 46307 Crown Point Rockcrest St. 2
2321321 Ronald Weasley UK 60452 Oak Forest Crescent St. 7835 2
234232 Rubeus Hagrid 11/25/1982 UK 30701 Woodmans Ln. 37 2
219410 Barney Stins 03/23/1968 US 1801 Woburn Pawnee Road 498 2

Person Data – Data Source 2

它可以被识别,记录包含不结盟运动e information (First Name, Last Name), date of birth, as well as address information (country, postal code, city, street, house number). Nevertheless, the records have different IDs (MEMBER_NO, ID). For storing the street and house number, two fields are used in data source 2 (STREET, STREETNUMBER) and one field in source 1 (ADDRESS_LINE).

Example 2 Company Data

COMPANY_ID COMPANY_NAME COUNTRY POST_CODE CITY ADDRESS_LINE SOURCE
1 Tim Hortons CA 10977 Spring Valley 656 N. Stonybrook Street 1
2 Burger King US 6095 Windsor 8354 South Drive 1
3 McDonalds US 6095 Windsor 8354 South Drive 1
4 Kentucky Fried Chicken US 20707 Laurel 599 Penn Ave. 1
5 Pizza Hut US 1801 Woburn 498 Pawnee Road 1

Company Data – Data Source 1

ID COMPANY COUNTRY ZIP LOCATION STREETNAME STREETNUMBER SOURCE
1 Tim Hortons CA 10977 Spring Valley North Stonybrook Street 656 2
2 Burger King Food US 6095 Windsor South Drive 8354 2
3 汉堡王 US Windsor South Drive 8354 2
4 KFC US 20707 Laurel Penn Avenue 599 2
5 PIZZA HUT INC US 1801 Woburn Pawnee Road 2

Company Data – Data Source 2

The two data sets have different IDs (COMPANY_ID, ID). The column names differ in both datasets. For storing the street and house number, two fields are used in data source 2 (STREETNAME, STREETNUMBER) and one field in source 1 (ADDRESS_LINE). In some cases, individual values are missing from the data records.

Step 1 – Data Cleansing

为了能够使用韩亚SDQ清理数据, the “Cleanse” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has exactly one input port and one output port for the data flow. It is used for identifying, parsing, validating and formatting data, these include: Addresses, people, company names, job titles, phone numbers, email addresses.

The node looks like this:

Cleanse%20Node%20in%20HANA%20SDQ%20%28Source%3A%20Own%20Image%29

Cleanse Node in HANA SDQ (Source: Own Image)

The documentation can be foundhere. In addition to the pure rule-based validation,address reference datacan be purchased with additional licenses (per country) to achieve address validation and to usegeocodingand reverse geocoding. In addition, there is the node “DQMm Cleanse” with which data can be cleaned based on a microservice (see here).

Components + Content Types

Acomponentcomprises a category of data to be cleansed and/or matched. Components include Firm, Address, Person, Email, Phone, Custom (custom defined components e.g. matching of proprietary IDs). Each component consists of 1 to n content types.

Acontent type是一个类型的数据中包含一个列的dat吗a source, e.g. city, region, postal code, email address. The mapping of the content types to the source data is done automatically based on the name. The mapping can be adjusted manually if a component is not recognized or is recognized incorrectly.

The following graphic shows how a content type mapping can be adjusted in the “Input Components” tab so that the component (in this case Address) is correctly identified.

Content%20type%20mapping%20conversion%20%28Source%3A%20Own%20Image%29

Content type mapping conversion (Source: Own Image)

Cleanse Settings

For each component there are different special settings that can change the behavior of the cleanse, e.g. Casing, Diacritics, Script Conversion. The cleanse domain and the output format can be freely defined, e.g. based on special country formats.

The Side Effect Data Level can also be configured in the Settings. This allows additional statistics on the cleanse behavior. Depending on the level set, the information is stored in additional tables in the“_SYS_TASK” schema.

The following figure shows how the cleanse settings can be adjusted (using companies, persons and addresses as examples).

Configuration%20of%20the%20Cleanse%20Settings%20%28Source%3A%20Own%20Image%29

Configuration of the Cleanse Settings (Source: Own Image)

Cleansed Output

In the tab “Cleansed Output” you can define which fields are added to the output. Cleansed columns that are used specifically for SDQ matching must be explicitly switched on. Uncleansed source data can also be passed on through the output. Standardized fields get the prefix “STD_” and match fields get the prefix “MATCH_“.

The following graphic shows how the cleaned fields can be added for matching.

Adding%20the%20cleansed%20Matching%20Columns%20%28Source%3A%20Own%20Image%29

Adding the cleansed Matching Columns (Source: Own Image)

Example 1: Cleansing Person Data

The following example shows the cleansing of the person data from our example. The SDI flowgraph for the cleansing of the Company Data Source 1 looks as follows:

Example%20Flowgraph%20Cleansing%20Person%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Cleansing Person Data (Source: Own Image)

The result of the cleansing is shown in the following table (some cleansing columns have been hidden):

MATCH_PERSON STD_PERSON_PRE STD_PERSON_GN_FULL STD_PERSON_FN_FULL MATCH_PERSON_GN MATCH_PERSON_GN_STD MATCH_PERSON_GN_STD2 MATCH_PERSON_FN STD_ADDR_COUNTRY_NAME STD_ADDR_ADDRESS_DELIVERY STD_ADDR_LOCALITY_FULL STD_ADDR_POSTCODE_FULL MATCH_ADDR_COUNTRY MATCH_ADDR_POSTCODE1 MATCH_ADDR_REGION MATCH_ADDR_LOCALITY MATCH_ADDR_PRIM_NAME MATCH_ADDR_PRIM_TYPE MATCH_ADDR_PRIM_DIR MATCH_ADDR_PRIM_NUMBER CLEANSING_SOURCE_1_TABLE_ID CLEANSING_SOURCE_1_ROW_ID
TED MOSBY MR. TED MOSBY TED EDWARD THEODORE MOSBY UNITED STATES 656 N STONYBROOK ST SPRING VALLEY 10977 US 10977 SPRING VALLEY STONYBROOK ST N 656 4 0
LILLY ALDRIN MS. LILLY ALDRIN LILLY LILLIAN ALDRIN UNITED STATES 8354 SOUTH DR WINDSOR 6095 US 6095 WINDSOR SOUTH DR 8354 4 1
MARSHALL ERIKSEN MR. MARSHALL ERIKSEN MARSHALL ERIKSEN UNITED STATES 8354 SOUTH DR WINDSOR 6095 US 6095 WINDSOR SOUTH DR 8354 4 2
ROBIN SCHERBATSKY ROBIN SCHERBATSKY ROBIN ROBBIN ROBINSON SCHERBATSKY UNITED STATES 599 PENN AVE LAUREL 20707 US 20707 LAUREL PENN AVE 599 4 3
BARNEY STINSON MR. BARNEY STINSON BARNEY BARNABAS BERNARD STINSON UNITED STATES 498 PAWNEE RD WOBURN 1801 US 1801 WOBURN PAWNEE RD 498 4 4
DONNA PAULSEN MS. DONNA PAULSEN DONNA PAULSEN UNITED STATES 8586 S WINCHESTER DR HOMESTEAD 33030 US 33030 HOMESTEAD WINCHESTER DR S 8586 4 5
HARVEY SPECTER MR. HARVEY SPECTER HARVEY SPECTER UNITED STATES 8586 S WINCHESTER DR HOMESTEAD 33030 US 33030 HOMESTEAD WINCHESTER DR S 8586 4 6
MIKE ROSS MR. MIKE ROSS MIKE MICHAEL ROSS UNITED STATES 8992 E HARVEY ST ALBANY 12203 US 12203 ALBANY HARVEY ST E 8992 4 7
RACHEL ZANE MS. RACHEL ZANE RACHEL ZANE UNITED STATES 8992 E HARVEY ST ALBANY 12203 US 12203 ALBANY HARVEY ST E 8992 4 8
LOUIS LITT MR. LOUIS LITT LOUIS LITT UNITED STATES 8701 BEAR HILL ST ASBURY PARK 7712 US 7712 ASBURY PARK BEAR HILL ST 8701 4 9
JESSICA PEARSON MS. JESSICA PEARSON JESSICA PEARSON UNITED STATES 319年布朗RD NEWTOWN 18940 US 18940 NEWTOWN BROWN RD 319 4 10
CHARLIE HARPER MR. CHARLIE HARPER CHARLIE CHARLENE CHARLES HARPER UNITED STATES 758 JONES RD MALIBU 7110 US 7110 MALIBU JONES RD 758 4 11
ALAN HARPER MR. ALAN HARPER ALAN HARPER UNITED STATES 758 JONES RD MALIBU 7110 US 7110 MALIBU JONES RD 758 4 12
JAKE HARPER MR. JAKE HARPER JAKE JACOB JAKOB HARPER UNITED STATES 7709 EL DORADO CT NEWARK 7103 US 7103 NEWARK EL DORADO CT 7709 4 13
WALDEN SCHMIDT MR. WALDEN SCHMIDT WALDEN SCHMIDT UNITED STATES 5 PACIFIC DR SEWELL 8080 US 8080 SEWELL PACIFIC DR 5 4 14
HARRY POTTER MR. HARRY POTTER HARRY HAROLD HENRY POTTER UNITED KINGDOM 66 MYRTLE DRIVE LAKEWOOD 8701 GB 8701 LAKEWOOD MYRTLE DR 66 4 15
赫敏GRANGER MS. 赫敏 GRANGER 赫敏 GRANGER UNITED KINGDOM 874 ROCKCREST STREET CROWN POINT 46307 GB 46307 CROWN POINT ROCKCREST ST 874 4 16
RON WEASLEY MR. RON WEASLEY RON RONALD WEASLEY UNITED KINGDOM 7835 CRESCENT STREET OAK FOREST 60452 GB 60452 OAK FOREST CRESCENT ST 7835 4 17
RUBEUS HAGRID RUBEUS HAGRID RUBEUS HAGRID UNITED KINGDOM 37 WOODSMAN LANE CALHOUN 30701 GB 30701 CALHOUN WOODSMAN LA 37 4 18
ALBUS DUMBLEDORE ALBUS DUMBLEDORE ALBUS DUMBLEDORE UNITED KINGDOM 10 LILAC COURT SARASOTA 34231 GB 34231 SARASOTA LILAC CT 10 4 19

Cleansed Person Data (Example Data Source 1)

It can be observed that in the column “STD_PERSON_PRE“名(一些人this has not been identified). In the columns “MATCH_PERSON_GN“, “MATCH_PERSON_GN_STD” and “MATCH_PERSON_GN_STD2” alternative first names are maintained (e.g. “TED”, “EDWARD”, “THEODORE”), which can be used for matching. Furthermore, it can be seen that parts of the street name are stored in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_DIR” (e.g. “South Winchester Drive” becomes “WINCHESTER”, “DR” and “S”). In the field “MATCH_ADDR_COUNTRY” you can see that “UK” is converted to “GB”.

Example 2: Cleansing Company Data

The following example shows the cleansing of the organization data from our example.

Example%20Flowgraph%20Cleansing%20Company%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Cleansing Company Data (Source: Own Image)

After the flowgraph is deployed, a procedure is created with the name of the flowgraph including the suffix “_SP“. This can be called as follows:

CALL "". "_SP"(_ _TAB => ' ‘);

In the SAP WebIDE, information about the started tasks can be obtained in the “Tasks” tab (theTASK_EXECUTION_IDcan also be seen here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the“_SYS_TASK” schema.

Task%20Information%20in%20SAP%20WebIDE%20%28Source%3A%20Own%20Image%29

Task Information in SAP WebIDE (Source: Own Image)

The result of the cleansing is shown in the following table (some cleansing columns have been hidden):

STD_FIRM STD_ADDR_COUNTRY_NAME STD_ADDR_ADDRESS_DELIVERY STD_ADDR_LOCALITY_FULL STD_ADDR_POSTCODE_FULL MATCH_FIRM MATCH_FIRM_STD MATCH_ADDR_COUNTRY MATCH_ADDR_POSTCODE1 MATCH_ADDR_LOCALITY MATCH_ADDR_PRIM_NAME MATCH_ADDR_PRIM_TYPE MATCH_ADDR_PRIM_NUMBER CLEANSING_SOURCE_1_TABLE_ID CLEANSING_SOURCE_1_ROW_ID
TIM HORTONS CANADA 656 N. STONYBROOK ST SPRING VALLEY 10977 TIM HORTONS CA 10977 SPRING VALLEY N STONYBROOK ST 656 4 0
汉堡王 UNITED STATES 8354 SOUTH DR WINDSOR 6095 汉堡王 US 6095 WINDSOR SOUTH DR 8354 4 1
MCDONALD’S UNITED STATES 8354 SOUTH DR WINDSOR 6095 MCDONALDS US 6095 WINDSOR SOUTH DR 8354 4 2
KENTUCKY FRIED CHICKEN UNITED STATES 599 PENN AVE LAUREL 20707 KENTUCKY FRIED CHICKEN US 20707 LAUREL PENN AVE 599 4 3
PIZZA HUT UNITED STATES 498 PAWNEE RD WOBURN 1801 PIZZA HUT US 1801 WOBURN PAWNEE RD 498 4 4

Cleansed Company Data (Example Data Source 1)

You can see that in the column “STD_FIRM” the company names are standardized and in “MATCH_FIRM” the name is stored optimized for matching (“McDonalds” becomes “MCDONALD’S” and “MCDONALDS”). Furthermore you can see that in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_NUMBER” parts of the street name and house number are stored, which in Data Source 1 all come from the field “Address Line” (e.g. “498 Pawnee Road” becomes “PAWNEE”, “RD” and “498”).

Since in this example the Side Effect Data Level was set to “Basic“, the columns “CLEANSING_SOURCE_1_TABLE_ID” and “CLEANSING_SOURCE_1_ROW_ID” are included. By means of these IDs information about the cleansing can be extracted in the“_SYS_TASK” schema, among others in the “CLEANSE_ADDRESS_RECORD_INFO” view.

In the following the information about the cleansing is shown in the side effect tables:

Side%20Effect%20Data%20for%20Cleansing%20%28Source%3A%20Own%20Image%29

Side Effect Data for Cleansing (Source: Own Image)

Step 2 – Data Matching

To be able to match data using HANA SDQ and identify potential duplicates, the “Match” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has 1 or more input ports and exactly one output port for the data flow. This node returns match groups with potentially related entries based on match components (such as address + company name), specified match policies, as well as match settings (e.g. should John Doe also match J. Doe?).

The node looks like this:

Match%20Node%20in%20HANA%20SDQ%20%28Source%3A%20Own%20Image%29

Match Node in HANA SDQ (Source: Own Image)

The documentation can be foundhere. Note: The example is not shown in this blog, but if only one data source is attached to the node, the best record of a match group can be identified as the master record based on settings.

Components + Match Policies

Acomponentcomprises a category of data to be cleansed and/or matched. Components include Firm, Address, Person, Email, Phone, Custom (custom defined components e.g. matching of proprietary IDs). Components can be identified in the match node that have already been cleansed via the “Cleanse” node in SDQ or uncleansed components can be defined.

Amatch policyis a rule that defines when records should match. They consist of at least one component (e.g. Person, Address, Name). Multiple match policies can be defined in descending order of priority.

The following graphic shows how match policies can be created in the “Policies” tab.

Definition%20of%20Match%20Policies%20%28Source%3A%20Own%20Image%29

Definition of Match Policies (Source: Own Image)

Match Settings

For each component there are different special settings that can change the behavior of the matching, for example, whether abbreviations in company names should also match. The match strictness can be set on a 7-level scale for each component.

Additional “Source Settings” can be set to get statistical data per data source and to optimize performance.

TheSide Effect Data Levelcan also be configured in the settings. This allows additional statistics on the matching behavior. Depending on the set level, the information is stored in additional tables in the“_SYS_TASK” schema. For example, starting with the “Basic” level, the match score and the match rule for each record can be taken from the “MATCH_TRACING” view.

The following figure shows how the match settings (using the example of companies, persons and addresses) can be adjusted.

Configuration%20of%20Match%20Settings%20%28Source%3A%20Own%20Image%29

Configuration of Match Settings (Source: Own Image)

Match Output

In the tab “Output” can be defined which fields are added to the match output. Output fields are:

  • GROUP_ID (INTEGER): ID of the match group; non-matching entries don’t have a GROUP_ID
  • CONFLICT_GROUP (NVARCHAR(1)): Conflict Cases (several entries of a group match only indirectly)
  • REVIEW_GROUP (NVARCHAR(1)): Matches within a group with low confidence

Optionally, the following fields can be added depending on Match Settings:

  • MATCH_POLICY (NVARCHAR(50)): Drawn Policy, z.B. “Person, Phone”
  • MATCH_SCORE (INTEGER): Confidence value derived from SDQ
  • ROW_ID (INTEGER)/ TABLE_ID (INTEGER): IDs to identify records in Side Effect Tables

In addition, fields from the input of the match node can be passed through.

The following graphic shows how fields can be defined for the matching output.

Definition%20of%20Match%20Output%20%28Source%3A%20Own%20Image%29

Definition of Match Output (Source: Own Image)

Example 1: Matching Person Data

The following example shows the matching of the person data from our example.

Example%20Flowgraph%20Matching%20Person%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Matching Person Data (Source: Own Image)

After the flowgraph has been deployed, a procedure is created with the name of the flowgraph including the suffix “_SP”. This can be called as follows:

CALL "". "_SP"(_ _TAB => '‘, _ _TAB => '‘,);

In the SAP WebIDE, information on the started tasks can be obtained in the “Tasks” tab (theTASK_EXECUTION_IDcan also be obtained here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the“_SYS_TASK” schema.

Task%20Information%20in%20SAP%20WebIDE%20%28Source%3A%20Own%20Image%29

Task Information in SAP WebIDE (Source: Own Image)

The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):

MEMBER_NO FIRST_NAME LAST_NAME BIRTH_DATE COUNTRY POST_CODE CITY ADDRESS_LINE SOURCE ID FIRST LAST BIRTHDAY COUNTRY ZIP CITY STREET STREETNUMBER SOURCE REVIEW_GROUP CONFLICT_GROUP
1 Ted Mosby 04.11.1983 US 10977 Spring Valley 656 N. Stonybrook Street 1 932093 Teddy Mosby 04.11.1983 US 10977 Spring Valley North Stonybrook Street 656 2 N N
2 Lilly Aldrin 01.12.1994 US 6095 Windsor 8354 South Drive 1 83294 Lilly Aldrin 01.12.1994 US 6095 Windsor South Driv 8354 2 N N
3 Marshall Eriksen 12/14/1989 US 6095 Windsor 8354 South Drive 1
4 Robin Scherbatsky 03/18/1987 US 20707 Laurel 599 Penn Ave. 1
5 Barney Stinson 03/23/1968 US 1801 Woburn 498 Pawnee Road 1 219410 Barney Stins 03/23/1968 US 1801 Woburn Pawnee Road 498 2 R N
5 Barney Stinson 03/23/1968 US 1801 Woburn 498 Pawnee Road 1 219409 Barney Stinson 03/23/1968 US 1801 Woburn Pawnee Rd. 498 2 R N
6 Donna Paulsen 08.04.1993 US 33030 Homestead 8586 S. Winchester Dr. 1 4329 Donna Paulsen 08.04.1993 US 33030 Homestead South Winchester Drive 8586 2 N N
7 Harvey Specter 12.04.1994 US 33030 Homestead 8586 S. Winchester Dr. 1
8 Mike Ross 12/23/1996 US 12203 Albany 8992 East Harvey St. 1 3432 Michael James Ross 12/23/1996 US 12203 Albany E. Harvey Street 8992 2 N N
9 Rachel Zane 01/20/1980 US 12203 Albany 8992 East Harvey St. 1
10 Louis Litt 06/24/1985 US 7712 Asbury Park 8701 Bear Hill Street 1 1290383 Louis Litt 06/24/1985 US 7712 Asbury Park Bear Hill St. 8701 2 N N
11 Jessica Pearson 09.10.2000 US 18940 Newtown 319 Brown Rd. 1 3129321 Jessica Pearson 09.10.2000 US 18940 Newtown Brown Rd. 319 2 N N
12 Charlie Harper 11/25/1984 US 7110 Malibu 758 Jones Rd. 1 43213 Charles Harper 11/25/1984 US 7110 Malibu Jones Road 758 2 N N
13 Alan Harper 05.08.1992 US 7110 Malibu 758 Jones Rd. 1 1243221 Alan Harper 05.08.1992 US 7110 Jones Rd. 758 2 N N
14 Jake Harper 08.05.1980 US 7103 Newark 7709 El Dorado Court 1 2342322 Jake Harper 08.05.1980 US 7103 Newark El Dorado Court 7709 2 N N
15 Walden Schmidt 11/13/1987 US 8080 Sewell 5 Pacific Drive 1
16 Harry Potter 06/16/1981 UK 8701 Lakewood 66 Myrtle Dr. 1 213234 Harry Potter 06/16/1981 UK 8701 Lakewood Myrtle Drive 66 2 N N
17 Hermione Granger 07/15/1991 UK 46307 Crown Point 874 Rockcrest St. 1
18 Ron Weasley 10/28/1987 UK 60452 Oak Forest 7835 Crescent St. 1 2321321 Ronald Weasley UK 60452 Oak Forest Crescent St. 7835 2 N N
19 Rubeus Hagrid 11/25/1982 UK 30701 Calhoun 37 Woodsman Lane 1 234232 Rubeus Hagrid 11/25/1982 UK 30701 Woodmans Ln. 37 2 N N
20 Albus Dumbledore 01.11.1968 UK 34231 Sarasota 10 Lilac Court 1
234212 Hermione Granger 07/15/1991 UK 46307 Crown Point Rockcrest St. 2
82302 Rachel Zane 01/20/1980 US 12203 Albany 8992 2
129321 Robin Scherbatsky US 20707 Laurel Penn Avenue 2
83209841 Marshall Eriksen 12/14/1989 US 6095 Windsor S. Drive 8354 2

Match Output Person Data (Example)

It is apparent that many matches were found. For Member No. 5 from Data Source 1 (Barney Stinson), two entries were identified in Data Source 2. No match was found for Member No. 7, but there was also no matching entry in Data Source 2. Member No. 8 from Data Source 1 found a match, although the first names “Mike” and “Michael James” as well as the address data differed. Member No 18 (Ron Weasley) was able to identify a match even though no date of birth was maintained (because multiple match policies were configured). Several matches could not be identified, among them Robin Scherbatsky and Hermione Granger, because no street numbers were maintained. In this use case, one needs to think about setting the match strictness more loosely.

Since the Side Effect Data Level was turned on to “Basic” in this example, IDs can be used to extract matching information in the “_SYS_TASK” schema, including in the “MATCH_TRACING” view. In the following the information about the matching in the side effect tables is shown:

Side%20Effect%20Data%20for%20Matching%20%28Source%3A%20Own%20Image%29

Side Effect Data for Matching (Source: Own Image)

It is visible that in one case only the policy “Person and Address” were matched (Ron Weasley) otherwise “Person, Address and Date” were identified. Furthermore, it can be seen that in one case a match was found inside one data source (Barney Stinson).

Example 2: Matching Company Data

The following example shows the matching of the organizational data from our example.

Example%20Flowgraph%20Matching%20Company%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Matching Company Data (Source: Own Image)

The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):

COMPANY_ID COMPANY_NAME COUNTRY POST_CODE CITY ADDRESS_LINE SOURCE ID COMPANY COUNTRY ZIP LOCATION STREETNAME STREETNUMBER SOURCE REVIEW_GROUP CONFLICT_GROUP
1 Tim Hortons CA 10977 Spring Valley 656 N. Stonybrook Street 1
2 Burger King US 6095 Windsor 8354 South Drive 1 2 Burger King Food US 6095 Windsor South Drive 8354 2 N N
2 Burger King US 6095 Windsor 8354 South Drive 1 3 汉堡王 US Windsor South Drive 8354 2 N N
3 McDonalds US 6095 Windsor 8354 South Drive 1
4 Kentucky Fried Chicken US 20707 Laurel 599 Penn Ave. 1 4 KFC US 20707 Laurel Penn Avenue 599 2 N N
5 Pizza Hut US 1801 Woburn 498 Pawnee Road 1
1 Tim Hortons CA 10977 Spring Valley North Stonybrook Street 656 2
5 PIZZA HUT INC US 1801 Woburn Pawnee Road 2

Match Output Company Data (Example)

It is evident that no match was identified for Company ID 3 (McDonalds) from data source 1. However, there is also no potential match in data source 2. In addition, it can be seen that “Kentucky Fried Chicken” (Company ID 4 in data source 1) and the abbreviation “KFC” were identified as a match. No matches were identified for the Company ID 1 and 5 (“Tim Hortons” and “Pizza Hut”) entries, although potential matches were present in Data Source 2 with a changed name and address.

Subsequently, the match strictness for the company name was loosened and the flowgraph was redeployed and restarted.

Matching%20for%20company%20names%20is%20set%20more%20loosely%20%28Source%3A%20Own%20Image%29

Matching for company names is set more loosely (Source: Own Image)

The result is now shown in the following.

COMPANY_ID COMPANY_NAME COUNTRY POST_CODE CITY ADDRESS_LINE SOURCE ID COMPANY COUNTRY ZIP LOCATION STREETNAME STREETNUMBER SOURCE REVIEW_GROUP CONFLICT_GROUP
1 Tim Hortons CA 10977 Spring Valley 656 N. Stonybrook Street 1
2 Burger King US 6095 Windsor 8354 South Drive 1 2 Burger King Food US 6095 Windsor South Drive 8354 2 N N
2 Burger King US 6095 Windsor 8354 South Drive 1 3 汉堡王 US Windsor South Drive 8354 2 N N
3 McDonalds US 6095 Windsor 8354 South Drive 1
4 Kentucky Fried Chicken US 20707 Laurel 599 Penn Ave. 1 4 KFC US 20707 Laurel Penn Avenue 599 2 N N
5 Pizza Hut US 1801 Woburn 498 Pawnee Road 1 5 PIZZA HUT INC. US 1801 Woburn Pawnee Road 2 R N
1 Tim Hortons CA 10977 Spring Valley North Stonybrook Street 656 2

Match Output Company Data with looser Firmname Settings (Example)

Company_ID 5 from data source 1 and ID 5 from data source 2 could now be identified as an additional match (“Pizza Hut” and “PIZZA HUT INC.”).

Conclusion

This blog post has demonstrated how cleansing and matching use cases can be implemented with SAP HANA by using the Smart Data Quality capabilities. Within a few implementation steps it is possible to create a proof of concept and already achieve a good matching quality. Smart Data Quality includes several rules for cleaning and validating addresses, e.g. to identify street suffixes.

I would like to emphasize again at this point that the cleansing feature provides added value even detached from matching, to be used in additional use cases. Moreover, one should consider persisting the cleansed data in order not to need a cleansing step every time the matching node is called.

Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post.

Best wishes and a happy holiday season,

Tim

Find more information and related blog posts on thetopic page for Database and Data Management.

Assigned Tags

      2 Comments
      You must beLogged onto comment or reply to a post.
      Author's profile photo RAMACHANDRAN RAKHUNATHAN
      RAMACHANDRAN RAKHUNATHAN

      Tim Huse: Great Blog . Is there an equivalent of Match and Cleanse node in Data Intelligence? (I could think of an ML operator to do a match but something else thats simpler to maintain)

      Author's profile photo Tim Huse
      Tim Huse
      Blog Post Author

      Hi Ramachandran, thanks for your feedback!

      Unfortunately, there is no equivalent operator in Data Intelligence.
      I could see two possibilities with Data Intelligence:

      a) If you have a HANA in place you can develop a flowgraph with cleansing and matching in Smart Data Quality and then call the flowgraph via Data Intelligence withthis operator. SAP Data Services has a similar concept as HANA SDQ (based on the same engine,see this blogpost). If you don't have a HANA, but DI and Data Services in place, you could usethis operatorin DI to trigger a Data Services job. If you just need Cleansing capabilities of SAP DI then check thesedata quality operatorsout.

      b) You would need to use a custom operator (for example Python Operator) and develop your own solution. There are a lot of Data Cleansing Webservices and String Matching libraries available.

      Best wishes
      Tim