Technical Articles

Tim Huse

December 15, 2022 18 minute read

Data Cleansing and Data Matching with SAP HANA – Smart Data Quality

Data%20Cleansing%20and%20Data%20Matching%20with%20SAP%20HANA%20%u2013%20Smart%20Data%20Quality%20%28Source%3A%20SAP%29

Data Cleansing and Data Matching with SAP HANA – Smart Data Quality (Source: SAP)

Introduction

Today, companies leverage data from a variety of internal and external data sources. In order to integrate these successfully, the data must not only fit semantically, but also have a common key in order to be able to merge them in joins.

This endeavor often fails because the data contains different data qualities (cleansing problem) or because there is no common key, especially when external data sources are involved (matching problem).

SAP HANA Smart Data Quality (integrated with SAP HANA) provides a high-performance, rule-based solution to cleanse and merge the data (e.g. address data) even if no common key exists or to identify duplicates in the data sources. In this blogpost, the data cleansing and data matching functionalities of SAP HANA Smart Data Quality are demonstrated using person data and company data.

The blogpost is structured as follows: First, an example is presented, this includes two data sets with person data and two data sets with company data. Then, data cleansing is introduced and demonstrated using the data sets as examples. In the next step, data matching is shown and demonstrated on the examples. Finally, a conclusion is drawn.

Please note: The following information is presented from my point of view and does not reflect the view of SAP. All data shown is generated sample data.

Prerequisite

To implement the use case shown, a HANA on-premise instance is required. Currently, the matching and cleansing functionality is not available in HANA Cloud. For the example, a HANA database in version 2.0 SPS05 was used.

SAP HANA Smart Data Integration (SDI for short) and Smart Data Quality (SDQ for short) are the data integration and data quality capabilities of the HANA platform. These are embedded in SAP HANA and can be used via an intuitive UI using so-called SDI flow graphs. More information can be found in thisblog post. To use SDI and SDQ, the script server of the HANA database must be activated (see here).

Basic knowledge of building HANA SDI flowgraphs (see here) and of HANA 2.0 (see here) is advantageous for implementing the use case.

Example Use Case

A company wants to combine data from two different data sources. Data source 1 is an ERP system with transactional data of the company, data source 2 is a CRM system with customer relationship data. Both data sources contain data on customer companies (company data) and their contacts (person data). Both data sources have stored the company and personal data with different IDs and also the names and address data are not created identically. Therefore a simple join is not possible.

The example data records of the data sources are presented below:

Example 1 Person Data

MEMBER_NO	FIRST_NAME	LAST_NAME	BIRTH_DATE	COUNTRY	POST_CODE	CITY	ADDRESS_LINE	SOURCE
1	Ted	Mosby	04/11/1983	US	10977	Spring Valley	656 N. Stonybrook Street	1
2	Lilly	Aldrin	01/12/1994	US	6095	Windsor	8354 South Drive	1
3	Marshall	Eriksen	12/14/1989	US	6095	Windsor	8354 South Drive	1
4	Robin	Scherbatsky	03/18/1987	US	20707	Laurel	599 Penn Ave.	1
5	Barney	Stinson	03/23/1968	US	1801	Woburn	498 Pawnee Road	1
6	Donna	Paulsen	08/04/1993	US	33030	Homestead	8586 S. Winchester Dr.	1
7	Harvey	Specter	12/04/1994	US	33030	Homestead	8586 S. Winchester Dr.	1
8	Mike	Ross	12/23/1996	US	12203	Albany	8992 East Harvey St.	1
9	Rachel	Zane	01/20/1980	US	12203	Albany	8992 East Harvey St.	1
10	Louis	Litt	06/24/1985	US	7712	Asbury Park	8701 Bear Hill Street	1
11	Jessica	Pearson	09/10/2000	US	18940	Newtown	319 Brown Rd.	1
12	Charlie	Harper	11/25/1984	US	7110	Malibu	758 Jones Rd.	1
13	Alan	Harper	05/08/1992	US	7110	Malibu	758 Jones Rd.	1
14	Jake	Harper	08/05/1980	US	7103	Newark	7709 El Dorado Court	1
15	Walden	Schmidt	11/13/1987	US	8080	Sewell	5 Pacific Drive	1
16	Harry	Potter	06/16/1981	UK	8701	Lakewood	66 Myrtle Dr.	1
17	Hermione	Granger	07/15/1991	UK	46307	Crown Point	874 Rockcrest St.	1
18	Ron	Weasley	10/28/1987	UK	60452	Oak Forest	7835 Crescent St.	1
19	Rubeus	Hagrid	11/25/1982	UK	30701	Calhoun	37 Woodsman Lane	1
20	Albus	Dumbledore	01/11/1968	UK	34231	Sarasota	10 Lilac Court	1

Person Data – Data Source 1

ID	FIRST	LAST	BIRTHDAY	COUNTRY	ZIP	CITY	STREET	STREETNUMBER	SOURCE
932093	Teddy	Mosby	04/11/1983	US	10977	Spring Valley	North Stonybrook Street	656	2
83294	Lilly	Aldrin	01/12/1994	US	6095	Windsor	South Driv	8354	2
83209841	Marshall	Eriksen	12/14/1989	US	6095	Windsor	S. Drive	8354	2
129321	Robin	Scherbatsky		US	20707	Laurel	Penn Avenue		2
219409	Barney	Stinson	03/23/1968	US	1801	Woburn	Pawnee Rd.	498	2
4329	Donna	Paulsen	08/04/1993	US	33030	Homestead	South Winchester Drive	8586	2
3432	Michael James	Ross	12/23/1996	US	12203	Albany	E. Harvey Street	8992	2
82302	Rachel	Zane	01/20/1980	US	12203	Albany		8992	2
1290383	Louis	Litt	06/24/1985	US	7712	Asbury Park	Bear Hill St.	8701	2
3129321	Jessica	Pearson	09/10/2000	US	18940	Newtown	Brown Rd.	319	2
43213	Charles	Harper	11/25/1984	US	7110	Malibu	Jones Road	758	2
1243221	Alan	Harper	05/08/1992	US	7110		Jones Rd.	758	2
2342322	Jake	Harper	08/05/1980	US	7103	Newark	El Dorado Court	7709	2
213234	Harry	Potter	06/16/1981	UK	8701	Lakewood	Myrtle Drive	66	2
234212	Hermione	Granger	07/15/1991	UK	46307	Crown Point	Rockcrest St.		2
2321321	Ronald	Weasley		UK	60452	Oak Forest	Crescent St.	7835	2
234232	Rubeus	Hagrid	11/25/1982	UK	30701		Woodmans Ln.	37	2
219410	Barney	Stins	03/23/1968	US	1801	Woburn	Pawnee Road	498	2

Person Data – Data Source 2

它可以被识别,记录包含不结盟运动e information (First Name, Last Name), date of birth, as well as address information (country, postal code, city, street, house number). Nevertheless, the records have different IDs (MEMBER_NO, ID). For storing the street and house number, two fields are used in data source 2 (STREET, STREETNUMBER) and one field in source 1 (ADDRESS_LINE).

Example 2 Company Data

COMPANY_ID	COMPANY_NAME	COUNTRY	POST_CODE	CITY	ADDRESS_LINE	SOURCE
1	Tim Hortons	CA	10977	Spring Valley	656 N. Stonybrook Street	1
2	Burger King	US	6095	Windsor	8354 South Drive	1
3	McDonalds	US	6095	Windsor	8354 South Drive	1
4	Kentucky Fried Chicken	US	20707	Laurel	599 Penn Ave.	1
5	Pizza Hut	US	1801	Woburn	498 Pawnee Road	1

Company Data – Data Source 1

ID	COMPANY	COUNTRY	ZIP	LOCATION	STREETNAME	STREETNUMBER	SOURCE
1	Tim Hortons	CA	10977	Spring Valley	North Stonybrook Street	656	2
2	Burger King Food	US	6095	Windsor	South Drive	8354	2
3	汉堡王	US		Windsor	South Drive	8354	2
4	KFC	US	20707	Laurel	Penn Avenue	599	2
5	PIZZA HUT INC	US	1801	Woburn	Pawnee Road		2

Company Data – Data Source 2

The two data sets have different IDs (COMPANY_ID, ID). The column names differ in both datasets. For storing the street and house number, two fields are used in data source 2 (STREETNAME, STREETNUMBER) and one field in source 1 (ADDRESS_LINE). In some cases, individual values are missing from the data records.

Step 1 – Data Cleansing

为了能够使用韩亚SDQ清理数据, the “Cleanse” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has exactly one input port and one output port for the data flow. It is used for identifying, parsing, validating and formatting data, these include: Addresses, people, company names, job titles, phone numbers, email addresses.

The node looks like this:

Cleanse%20Node%20in%20HANA%20SDQ%20%28Source%3A%20Own%20Image%29

Cleanse Node in HANA SDQ (Source: Own Image)

The documentation can be foundhere. In addition to the pure rule-based validation,address reference datacan be purchased with additional licenses (per country) to achieve address validation and to usegeocodingand reverse geocoding. In addition, there is the node “DQMm Cleanse” with which data can be cleaned based on a microservice (see here).

Components + Content Types

Acontent type是一个类型的数据中包含一个列的dat吗a source, e.g. city, region, postal code, email address. The mapping of the content types to the source data is done automatically based on the name. The mapping can be adjusted manually if a component is not recognized or is recognized incorrectly.

The following graphic shows how a content type mapping can be adjusted in the “Input Components” tab so that the component (in this case Address) is correctly identified.

Content%20type%20mapping%20conversion%20%28Source%3A%20Own%20Image%29

Content type mapping conversion (Source: Own Image)

Cleanse Settings

For each component there are different special settings that can change the behavior of the cleanse, e.g. Casing, Diacritics, Script Conversion. The cleanse domain and the output format can be freely defined, e.g. based on special country formats.

The Side Effect Data Level can also be configured in the Settings. This allows additional statistics on the cleanse behavior. Depending on the level set, the information is stored in additional tables in the“_SYS_TASK” schema.

The following figure shows how the cleanse settings can be adjusted (using companies, persons and addresses as examples).

Configuration of the Cleanse Settings (Source: Own Image)

Cleansed Output

In the tab “Cleansed Output” you can define which fields are added to the output. Cleansed columns that are used specifically for SDQ matching must be explicitly switched on. Uncleansed source data can also be passed on through the output. Standardized fields get the prefix “STD_” and match fields get the prefix “MATCH_“.

The following graphic shows how the cleaned fields can be added for matching.

Adding%20the%20cleansed%20Matching%20Columns%20%28Source%3A%20Own%20Image%29

Adding the cleansed Matching Columns (Source: Own Image)

Example 1: Cleansing Person Data

The following example shows the cleansing of the person data from our example. The SDI flowgraph for the cleansing of the Company Data Source 1 looks as follows:

Example%20Flowgraph%20Cleansing%20Person%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Cleansing Person Data (Source: Own Image)

The result of the cleansing is shown in the following table (some cleansing columns have been hidden):

MATCH_PERSON

STD_PERSON_PRE

STD_PERSON_GN_FULL

STD_PERSON_FN_FULL

MATCH_PERSON_GN

MATCH_PERSON_GN_STD

MATCH_PERSON_GN_STD2

MATCH_PERSON_FN

STD_ADDR_COUNTRY_NAME

STD_ADDR_ADDRESS_DELIVERY

STD_ADDR_LOCALITY_FULL

STD_ADDR_POSTCODE_FULL

MATCH_ADDR_COUNTRY

MATCH_ADDR_POSTCODE1

MATCH_ADDR_REGION

MATCH_ADDR_LOCALITY

MATCH_ADDR_PRIM_NAME

MATCH_ADDR_PRIM_TYPE

MATCH_ADDR_PRIM_DIR

MATCH_ADDR_PRIM_NUMBER

…

CLEANSING_SOURCE_1_TABLE_ID

CLEANSING_SOURCE_1_ROW_ID

TED MOSBY

MR.

TED

MOSBY

TED

EDWARD

THEODORE

MOSBY

UNITED STATES

656 N STONYBROOK ST

SPRING VALLEY

10977

SPRING VALLEY

STONYBROOK

656

…

LILLY ALDRIN

MS.

LILLY

ALDRIN

LILLY

LILLIAN

ALDRIN

UNITED STATES

8354 SOUTH DR

WINDSOR

6095

WINDSOR

SOUTH

8354

…

MARSHALL ERIKSEN

MR.

MARSHALL

ERIKSEN

MARSHALL

ERIKSEN

UNITED STATES

8354 SOUTH DR

WINDSOR

6095

WINDSOR

SOUTH

8354

…

ROBIN SCHERBATSKY

ROBIN

SCHERBATSKY

ROBIN

ROBBIN

ROBINSON

SCHERBATSKY

UNITED STATES

599 PENN AVE

LAUREL

20707

LAUREL

PENN

AVE

599

…

BARNEY STINSON

MR.

BARNEY

STINSON

BARNEY

BARNABAS

BERNARD

STINSON

UNITED STATES

498 PAWNEE RD

WOBURN

1801

WOBURN

PAWNEE

498

…

DONNA PAULSEN

MS.

DONNA

PAULSEN

DONNA

PAULSEN

UNITED STATES

8586 S WINCHESTER DR

HOMESTEAD

33030

HOMESTEAD

WINCHESTER

8586

…

HARVEY SPECTER

MR.

HARVEY

SPECTER

HARVEY

SPECTER

UNITED STATES

8586 S WINCHESTER DR

HOMESTEAD

33030

HOMESTEAD

WINCHESTER

8586

…

MIKE ROSS

MR.

MIKE

ROSS

MIKE

MICHAEL

ROSS

UNITED STATES

8992 E HARVEY ST

ALBANY

12203

ALBANY

HARVEY

8992

…

RACHEL ZANE

MS.

RACHEL

ZANE

RACHEL

ZANE

UNITED STATES

8992 E HARVEY ST

ALBANY

12203

ALBANY

HARVEY

8992

…

LOUIS LITT

MR.

LOUIS

LITT

LOUIS

LITT

UNITED STATES

8701 BEAR HILL ST

ASBURY PARK

7712

ASBURY PARK

BEAR HILL

8701

…

JESSICA PEARSON

MS.

JESSICA

PEARSON

JESSICA

PEARSON

UNITED STATES

319年布朗RD

NEWTOWN

18940

NEWTOWN

BROWN

319

…

CHARLIE HARPER

MR.

CHARLIE

HARPER

CHARLIE

CHARLENE

CHARLES

HARPER

UNITED STATES

758 JONES RD

MALIBU

7110

MALIBU

JONES

758

…

ALAN HARPER

MR.

ALAN

HARPER

ALAN

HARPER

UNITED STATES

758 JONES RD

MALIBU

7110

MALIBU

JONES

758

…

JAKE HARPER

MR.

JAKE

HARPER

JAKE

JACOB

JAKOB

HARPER

UNITED STATES

7709 EL DORADO CT

NEWARK

7103

NEWARK

EL DORADO

7709

…

WALDEN SCHMIDT

MR.

WALDEN

SCHMIDT

WALDEN

SCHMIDT

UNITED STATES

5 PACIFIC DR

SEWELL

8080

SEWELL

PACIFIC

…

HARRY POTTER

MR.

HARRY

POTTER

HARRY

HAROLD

HENRY

POTTER

UNITED KINGDOM

66 MYRTLE DRIVE

LAKEWOOD

8701

LAKEWOOD

MYRTLE

…

赫敏GRANGER

MS.

赫敏

GRANGER

赫敏

GRANGER

UNITED KINGDOM

874 ROCKCREST STREET

CROWN POINT

46307

CROWN POINT

ROCKCREST

874

…

RON WEASLEY

MR.

RON

WEASLEY

RON

RONALD

WEASLEY

UNITED KINGDOM

7835 CRESCENT STREET

OAK FOREST

60452

OAK FOREST

CRESCENT

7835

…

RUBEUS HAGRID

RUBEUS

HAGRID

RUBEUS

HAGRID

UNITED KINGDOM

37 WOODSMAN LANE

CALHOUN

30701

CALHOUN

WOODSMAN

…

ALBUS DUMBLEDORE

ALBUS

DUMBLEDORE

ALBUS

DUMBLEDORE

UNITED KINGDOM

10 LILAC COURT

SARASOTA

34231

SARASOTA

LILAC

…

Cleansed Person Data (Example Data Source 1)

It can be observed that in the column “STD_PERSON_PRE“名(一些人this has not been identified). In the columns “MATCH_PERSON_GN“, “MATCH_PERSON_GN_STD” and “MATCH_PERSON_GN_STD2” alternative first names are maintained (e.g. “TED”, “EDWARD”, “THEODORE”), which can be used for matching. Furthermore, it can be seen that parts of the street name are stored in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_DIR” (e.g. “South Winchester Drive” becomes “WINCHESTER”, “DR” and “S”). In the field “MATCH_ADDR_COUNTRY” you can see that “UK” is converted to “GB”.

Example 2: Cleansing Company Data

The following example shows the cleansing of the organization data from our example.

Example%20Flowgraph%20Cleansing%20Company%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Cleansing Company Data (Source: Own Image)

After the flowgraph is deployed, a procedure is created with the name of the flowgraph including the suffix “_SP“. This can be called as follows:

CALL "". "_SP"(_ _TAB => ' ‘);

In the SAP WebIDE, information about the started tasks can be obtained in the “Tasks” tab (theTASK_EXECUTION_IDcan also be seen here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the“_SYS_TASK” schema.

Task%20Information%20in%20SAP%20WebIDE%20%28Source%3A%20Own%20Image%29

Task Information in SAP WebIDE (Source: Own Image)

The result of the cleansing is shown in the following table (some cleansing columns have been hidden):

STD_FIRM

STD_ADDR_COUNTRY_NAME

STD_ADDR_ADDRESS_DELIVERY

STD_ADDR_LOCALITY_FULL

STD_ADDR_POSTCODE_FULL

MATCH_FIRM

MATCH_FIRM_STD

MATCH_ADDR_COUNTRY

MATCH_ADDR_POSTCODE1

MATCH_ADDR_LOCALITY

MATCH_ADDR_PRIM_NAME

MATCH_ADDR_PRIM_TYPE

MATCH_ADDR_PRIM_NUMBER

…

CLEANSING_SOURCE_1_TABLE_ID

CLEANSING_SOURCE_1_ROW_ID

TIM HORTONS

CANADA

656 N. STONYBROOK ST

SPRING VALLEY

10977

TIM HORTONS

10977

SPRING VALLEY

N STONYBROOK

656

…

汉堡王

UNITED STATES

8354 SOUTH DR

WINDSOR

6095

汉堡王

6095

WINDSOR

SOUTH

8354

…

MCDONALD’S

UNITED STATES

8354 SOUTH DR

WINDSOR

6095

MCDONALDS

6095

WINDSOR

SOUTH

8354

…

KENTUCKY FRIED CHICKEN

UNITED STATES

599 PENN AVE

LAUREL

20707

KENTUCKY FRIED CHICKEN

20707

LAUREL

PENN

AVE

599

…

PIZZA HUT

UNITED STATES

498 PAWNEE RD

WOBURN

1801

PIZZA HUT

1801

WOBURN

PAWNEE

498

…

Cleansed Company Data (Example Data Source 1)

You can see that in the column “STD_FIRM” the company names are standardized and in “MATCH_FIRM” the name is stored optimized for matching (“McDonalds” becomes “MCDONALD’S” and “MCDONALDS”). Furthermore you can see that in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_NUMBER” parts of the street name and house number are stored, which in Data Source 1 all come from the field “Address Line” (e.g. “498 Pawnee Road” becomes “PAWNEE”, “RD” and “498”).

Since in this example the Side Effect Data Level was set to “Basic“, the columns “CLEANSING_SOURCE_1_TABLE_ID” and “CLEANSING_SOURCE_1_ROW_ID” are included. By means of these IDs information about the cleansing can be extracted in the“_SYS_TASK” schema, among others in the “CLEANSE_ADDRESS_RECORD_INFO” view.

In the following the information about the cleansing is shown in the side effect tables:

Side%20Effect%20Data%20for%20Cleansing%20%28Source%3A%20Own%20Image%29

Side Effect Data for Cleansing (Source: Own Image)

Step 2 – Data Matching

To be able to match data using HANA SDQ and identify potential duplicates, the “Match” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has 1 or more input ports and exactly one output port for the data flow. This node returns match groups with potentially related entries based on match components (such as address + company name), specified match policies, as well as match settings (e.g. should John Doe also match J. Doe?).

The node looks like this:

Match%20Node%20in%20HANA%20SDQ%20%28Source%3A%20Own%20Image%29

Match Node in HANA SDQ (Source: Own Image)

The documentation can be foundhere. Note: The example is not shown in this blog, but if only one data source is attached to the node, the best record of a match group can be identified as the master record based on settings.

Components + Match Policies

Acomponentcomprises a category of data to be cleansed and/or matched. Components include Firm, Address, Person, Email, Phone, Custom (custom defined components e.g. matching of proprietary IDs). Components can be identified in the match node that have already been cleansed via the “Cleanse” node in SDQ or uncleansed components can be defined.

Amatch policyis a rule that defines when records should match. They consist of at least one component (e.g. Person, Address, Name). Multiple match policies can be defined in descending order of priority.

The following graphic shows how match policies can be created in the “Policies” tab.

Definition%20of%20Match%20Policies%20%28Source%3A%20Own%20Image%29

Definition of Match Policies (Source: Own Image)

Match Settings

For each component there are different special settings that can change the behavior of the matching, for example, whether abbreviations in company names should also match. The match strictness can be set on a 7-level scale for each component.

Additional “Source Settings” can be set to get statistical data per data source and to optimize performance.

TheSide Effect Data Levelcan also be configured in the settings. This allows additional statistics on the matching behavior. Depending on the set level, the information is stored in additional tables in the“_SYS_TASK” schema. For example, starting with the “Basic” level, the match score and the match rule for each record can be taken from the “MATCH_TRACING” view.

The following figure shows how the match settings (using the example of companies, persons and addresses) can be adjusted.

Configuration%20of%20Match%20Settings%20%28Source%3A%20Own%20Image%29

Configuration of Match Settings (Source: Own Image)

Match Output

In the tab “Output” can be defined which fields are added to the match output. Output fields are:

GROUP_ID (INTEGER): ID of the match group; non-matching entries don’t have a GROUP_ID
CONFLICT_GROUP (NVARCHAR(1)): Conflict Cases (several entries of a group match only indirectly)
REVIEW_GROUP (NVARCHAR(1)): Matches within a group with low confidence

Optionally, the following fields can be added depending on Match Settings:

MATCH_POLICY (NVARCHAR(50)): Drawn Policy, z.B. “Person, Phone”
MATCH_SCORE (INTEGER): Confidence value derived from SDQ
ROW_ID (INTEGER)/ TABLE_ID (INTEGER): IDs to identify records in Side Effect Tables

In addition, fields from the input of the match node can be passed through.

The following graphic shows how fields can be defined for the matching output.

Definition%20of%20Match%20Output%20%28Source%3A%20Own%20Image%29

Definition of Match Output (Source: Own Image)

Example 1: Matching Person Data

The following example shows the matching of the person data from our example.

Example%20Flowgraph%20Matching%20Person%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Matching Person Data (Source: Own Image)

After the flowgraph has been deployed, a procedure is created with the name of the flowgraph including the suffix “_SP”. This can be called as follows:

CALL "". "_SP"(_ _TAB => '‘, _ _TAB => '‘,);

In the SAP WebIDE, information on the started tasks can be obtained in the “Tasks” tab (theTASK_EXECUTION_IDcan also be obtained here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the“_SYS_TASK” schema.

Task%20Information%20in%20SAP%20WebIDE%20%28Source%3A%20Own%20Image%29

Task Information in SAP WebIDE (Source: Own Image)

The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):

MEMBER_NO

FIRST_NAME

LAST_NAME

BIRTH_DATE

COUNTRY

POST_CODE

CITY

ADDRESS_LINE

SOURCE

FIRST

LAST

BIRTHDAY

COUNTRY

ZIP

CITY

STREET

STREETNUMBER

SOURCE

REVIEW_GROUP

CONFLICT_GROUP

Ted

Mosby

04.11.1983

10977

Spring Valley

656 N. Stonybrook Street

932093

Teddy

Mosby

04.11.1983

10977

Spring Valley

North Stonybrook Street

656

Lilly

Aldrin

01.12.1994

6095

Windsor

8354 South Drive

83294

Lilly

Aldrin

01.12.1994

6095

Windsor

South Driv

8354

Marshall

Eriksen

12/14/1989

6095

Windsor

8354 South Drive

Robin

Scherbatsky

03/18/1987

20707

Laurel

599 Penn Ave.

Barney

Stinson

03/23/1968

1801

Woburn

498 Pawnee Road

219410

Barney

Stins

03/23/1968

1801

Woburn

Pawnee Road

498

Barney

Stinson

03/23/1968

1801

Woburn

498 Pawnee Road

219409

Barney

Stinson

03/23/1968

1801

Woburn

Pawnee Rd.

498

Donna

Paulsen

08.04.1993

33030

Homestead

8586 S. Winchester Dr.

4329

Donna

Paulsen

08.04.1993

33030

Homestead

South Winchester Drive

8586

Harvey

Specter

12.04.1994

33030

Homestead

8586 S. Winchester Dr.

Mike

Ross

12/23/1996

12203

Albany

8992 East Harvey St.

3432

Michael James

Ross

12/23/1996

12203

Albany

E. Harvey Street

8992

Rachel

Zane

01/20/1980

12203

Albany

8992 East Harvey St.

Louis

Litt

06/24/1985

7712

Asbury Park

8701 Bear Hill Street

1290383

Louis

Litt

06/24/1985

7712

Asbury Park

Bear Hill St.

8701

Jessica

Pearson

09.10.2000

18940

Newtown

319 Brown Rd.

3129321

Jessica

Pearson

09.10.2000

18940

Newtown

Brown Rd.

319

Charlie

Harper

11/25/1984

7110

Malibu

758 Jones Rd.

43213

Charles

Harper

11/25/1984

7110

Malibu

Jones Road

758

Alan

Harper

05.08.1992

7110

Malibu

758 Jones Rd.

1243221

Alan

Harper

05.08.1992

7110

Jones Rd.

758

Jake

Harper

08.05.1980

7103

Newark

7709 El Dorado Court

2342322

Jake

Harper

08.05.1980

7103

Newark

El Dorado Court

7709

Walden

Schmidt

11/13/1987

8080

Sewell

5 Pacific Drive

Harry

Potter

06/16/1981

8701

Lakewood

66 Myrtle Dr.

213234

Harry

Potter

06/16/1981

8701

Lakewood

Myrtle Drive

Hermione

Granger

07/15/1991

46307

Crown Point

874 Rockcrest St.

Ron

Weasley

10/28/1987

60452

Oak Forest

7835 Crescent St.

2321321

Ronald

Weasley

60452

Oak Forest

Crescent St.

7835

Rubeus

Hagrid

11/25/1982

30701

Calhoun

37 Woodsman Lane

234232

Rubeus

Hagrid

11/25/1982

30701

Woodmans Ln.

Albus

Dumbledore

01.11.1968

34231

Sarasota

10 Lilac Court

234212

Hermione

Granger

07/15/1991

46307

Crown Point

Rockcrest St.

82302

Rachel

Zane

01/20/1980

12203

Albany

8992

129321

Robin

Scherbatsky

20707

Laurel

Penn Avenue

83209841

Marshall

Eriksen

12/14/1989

6095

Windsor

S. Drive

8354

Match Output Person Data (Example)

It is apparent that many matches were found. For Member No. 5 from Data Source 1 (Barney Stinson), two entries were identified in Data Source 2. No match was found for Member No. 7, but there was also no matching entry in Data Source 2. Member No. 8 from Data Source 1 found a match, although the first names “Mike” and “Michael James” as well as the address data differed. Member No 18 (Ron Weasley) was able to identify a match even though no date of birth was maintained (because multiple match policies were configured). Several matches could not be identified, among them Robin Scherbatsky and Hermione Granger, because no street numbers were maintained. In this use case, one needs to think about setting the match strictness more loosely.

Since the Side Effect Data Level was turned on to “Basic” in this example, IDs can be used to extract matching information in the “_SYS_TASK” schema, including in the “MATCH_TRACING” view. In the following the information about the matching in the side effect tables is shown:

Side%20Effect%20Data%20for%20Matching%20%28Source%3A%20Own%20Image%29

Side Effect Data for Matching (Source: Own Image)

It is visible that in one case only the policy “Person and Address” were matched (Ron Weasley) otherwise “Person, Address and Date” were identified. Furthermore, it can be seen that in one case a match was found inside one data source (Barney Stinson).

Example 2: Matching Company Data

The following example shows the matching of the organizational data from our example.

Example%20Flowgraph%20Matching%20Company%20Data%20%28Source%3A%20Own%20Image%29

Example Flowgraph Matching Company Data (Source: Own Image)

The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):

COMPANY_ID

COMPANY_NAME

COUNTRY

POST_CODE

CITY

ADDRESS_LINE

SOURCE

COMPANY

COUNTRY

ZIP

LOCATION

STREETNAME

STREETNUMBER

SOURCE

REVIEW_GROUP

CONFLICT_GROUP

Tim Hortons

10977

Spring Valley

656 N. Stonybrook Street

Burger King

6095

Windsor

8354 South Drive

Burger King Food

6095

Windsor

South Drive

8354

Burger King

6095

Windsor

8354 South Drive

汉堡王

Windsor

South Drive

8354

McDonalds

6095

Windsor

8354 South Drive

Kentucky Fried Chicken

20707

Laurel

599 Penn Ave.

KFC

20707

Laurel

Penn Avenue

599

Pizza Hut

1801

Woburn

498 Pawnee Road

Tim Hortons

10977

Spring Valley

North Stonybrook Street

656

PIZZA HUT INC

1801

Woburn

Pawnee Road

Match Output Company Data (Example)

It is evident that no match was identified for Company ID 3 (McDonalds) from data source 1. However, there is also no potential match in data source 2. In addition, it can be seen that “Kentucky Fried Chicken” (Company ID 4 in data source 1) and the abbreviation “KFC” were identified as a match. No matches were identified for the Company ID 1 and 5 (“Tim Hortons” and “Pizza Hut”) entries, although potential matches were present in Data Source 2 with a changed name and address.

Subsequently, the match strictness for the company name was loosened and the flowgraph was redeployed and restarted.

Matching%20for%20company%20names%20is%20set%20more%20loosely%20%28Source%3A%20Own%20Image%29

Matching for company names is set more loosely (Source: Own Image)

The result is now shown in the following.

COMPANY_ID

COMPANY_NAME

COUNTRY

POST_CODE

CITY

ADDRESS_LINE

SOURCE

COMPANY

COUNTRY

ZIP

LOCATION

STREETNAME

STREETNUMBER

SOURCE

REVIEW_GROUP

CONFLICT_GROUP

Tim Hortons

10977

Spring Valley

656 N. Stonybrook Street

Burger King

6095

Windsor

8354 South Drive

Burger King Food

6095

Windsor

South Drive

8354

Burger King

6095

Windsor

8354 South Drive

汉堡王

Windsor

South Drive

8354

McDonalds

6095

Windsor

8354 South Drive

Kentucky Fried Chicken

20707

Laurel

599 Penn Ave.

KFC

20707

Laurel

Penn Avenue

599

Pizza Hut

1801

Woburn

498 Pawnee Road

PIZZA HUT INC.

1801

Woburn

Pawnee Road

Tim Hortons

10977

Spring Valley

North Stonybrook Street

656

Match Output Company Data with looser Firmname Settings (Example)

Company_ID 5 from data source 1 and ID 5 from data source 2 could now be identified as an additional match (“Pizza Hut” and “PIZZA HUT INC.”).

Conclusion

This blog post has demonstrated how cleansing and matching use cases can be implemented with SAP HANA by using the Smart Data Quality capabilities. Within a few implementation steps it is possible to create a proof of concept and already achieve a good matching quality. Smart Data Quality includes several rules for cleaning and validating addresses, e.g. to identify street suffixes.

I would like to emphasize again at this point that the cleansing feature provides added value even detached from matching, to be used in additional use cases. Moreover, one should consider persisting the cleansed data in order not to need a cleansing step every time the matching node is called.

Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post.

Best wishes and a happy holiday season,

Tim

Find more information and related blog posts on thetopic page for Database and Data Management.

2 Comments

You must beLogged onto comment or reply to a post.

RAMACHANDRAN RAKHUNATHAN

December 16, 2022 at 6:35 am

Tim Huse: Great Blog . Is there an equivalent of Match and Cleanse node in Data Intelligence? (I could think of an ML operator to do a match but something else thats simpler to maintain)

Tim Huse

Blog Post Author

December 16, 2022 at 8:27 am

Hi Ramachandran, thanks for your feedback!

Unfortunately, there is no equivalent operator in Data Intelligence.
I could see two possibilities with Data Intelligence:

a) If you have a HANA in place you can develop a flowgraph with cleansing and matching in Smart Data Quality and then call the flowgraph via Data Intelligence withthis operator. SAP Data Services has a similar concept as HANA SDQ (based on the same engine,see this blogpost). If you don't have a HANA, but DI and Data Services in place, you could usethis operatorin DI to trigger a Data Services job. If you just need Cleansing capabilities of SAP DI then check thesedata quality operatorsout.

b) You would need to use a custom operator (for example Python Operator) and develop your own solution. There are a lot of Data Cleansing Webservices and String Matching libraries available.

Best wishes
Tim

Technical Articles

Data Cleansing and Data Matching with SAP HANA – Smart Data Quality

Introduction

Prerequisite

Example Use Case

Example 1 Person Data

Example 2 Company Data

Step 1 – Data Cleansing

Components + Content Types

Cleanse Settings

Cleansed Output

Example 1: Cleansing Person Data

Example 2: Cleansing Company Data

Step 2 – Data Matching

Components + Match Policies

Match Settings

Match Output

Example 1: Matching Person Data

Example 2: Matching Company Data

Conclusion

Assigned Tags