Test Data Manager

 View Only
Expand all | Collapse all

Data Generator and duplicate data

  • 1.  Data Generator and duplicate data

    Posted Apr 24, 2020 02:45 PM
    Hi,

    I'm using the Generator in the TDM portal version 4.9. I want to randomly generates data without any duplicates. The options "On Generated Duplicate" is set to remove and does the job. My problem is that I'm asking to generate 20K rows but after removing the duplicates it ends up with a number lower than 20K. To be clear, let's say I asked for 20K, TDM remove 5K duplicates so, at the end, I have 15K generated values instead of my target of 20K.

    Is there a way to get 20K rows without duplicates?

    Thanks for your help


  • 2.  RE: Data Generator and duplicate data

    Broadcom Employee
    Posted Apr 25, 2020 03:30 PM
    What is your primary key and how are you generating data for this key?


  • 3.  RE: Data Generator and duplicate data

    Posted Apr 27, 2020 01:35 PM

    Billy,

    Thanks for your help.

    Here's the Oracle table definition:

    CREATE TABLE SCRAMBLE.TEST_DUPLICATE_REPORT(COL1  VARCHAR2(1000 CHAR));
    CREATE UNIQUE INDEX SCRAMBLE.TEST_DUPLICATE_REPORT_IDX ON SCRAMBLE.TEST_DUPLICATE_REPORT(COL1);
    ALTER TABLE SCRAMBLE.TEST_DUPLICATE_REPORT ADD (
      CONSTRAINT TEST_DUPLICATE_REPORT_IDX
      PRIMARY KEY (COL1)
      USING INDEX SCRAMBLE.TEST_DUPLICATE_REPORT_IDX
      ENABLE VALIDATE);

    I generate 1 column with the following (quite easy) function:
    @randlov(0,@list(Company A,Company B,Company C,Company D,Company E)@)@

    Here's the PUBLISH screen snapshot. Hope it is readeable, If not I can repost it.


    At the end, I have a csv file with less that 5 rows.

    Let me know if you need more detailed informations.


  • 4.  RE: Data Generator and duplicate data

    Broadcom Employee
    Posted May 01, 2020 04:47 AM
    Random doesnt guarantee uniqueness.  A lot will depend on the number of  objects that are in the list that you are randomly trying to extract.  5 is a really small number.  I randomly selected from a list of 300 and they were unique.

    Maybe we should go back to what you are trying to accomplish.  From above you want to create 20,000 rows.  Lets concentrate on the value (column) that you are getting the duplicates on.   When you are generating these 20000 rows, where is the data for this particular column coming from? 

    Are you using a sqllist from data painter and extracting the data  from another database table?   If yes, how many unique items are in this table? 

    Are you pulling data from a seedlist?

    regards


  • 5.  RE: Data Generator and duplicate data

    Posted May 01, 2020 08:32 AM
    I understand that random doesnt guarentee uniqueness. And that's the reason why I use the remove duplicates option. 
    Of course, the 5 is only a test table to demonstrate what's happenned when I use teh remove duplicates option.

    For bigger number, I have two cases:

    1) I have an Oracle table with 49K unique company names. I ask to generate a text report with 37K companies. I was using a randlov-SQLlist fonction. This produce some duplicates. So, I activated the remove duplicates option and things goes well (meaning no more duplicates) but the number generated was lower than 37K. So, the remove duplicates removes the duplicates as expected. But I cannot achieve my target of 37K easily.

    2) Right now, I have a request to generate a list of firstnames abd lastnames. I will use 2 seedlists. The pair (Firstnames and lastname) has to be unique but the firstname may appear mutiple times (and lastname too). I'm working on it rigth now. 

    Again, like I said, the remove duplicates option works well. My concern that I cannot reach my target of generated lines.

    So, my question was is there a way to remove teh duplicate AND to reach the number of lines I need to reach? 
    Maybe this is the way teh remove duplicates option is design or maybe I misunderstood something.


  • 6.  RE: Data Generator and duplicate data

    Broadcom Employee
    Posted May 01, 2020 09:01 AM
    I now understand what you are asking....sorry about the misunderstanding
    You raise a very good question and today I think that is how it works.  

    I will double check and come back to you.


  • 7.  RE: Data Generator and duplicate data
    Best Answer

    Broadcom Employee
    Posted May 05, 2020 04:32 PM
    Following up on my last update.  This is how the publish works.  So any duplicates are part of the overall count.
    regards


  • 8.  RE: Data Generator and duplicate data

    Posted May 06, 2020 01:41 PM
    Thank you Billy for your help.

    I'll see with the team how we can work with that.

    JF


  • 9.  RE: Data Generator and duplicate data

    Posted Jun 25, 2020 07:51 AM
    Hi Jean,

    If you use RANDLOV and generating even 50% of records you will end up with duplicates, so use SEQLOV and then if you want random there is one parameter will jumble the records once before run so every time you run you get different set.

    Now if you think your source has duplicates and also you get data for multiple columns from the same source then I suggest to write a distinct query on the key column use seqlov on those columns and the use this values as reference to generate data for other columns.

    ------------------------------
    Thanks,
    Rajkumar
    ------------------------------



  • 10.  RE: Data Generator and duplicate data

    Posted Jul 10, 2020 08:33 AM
    Hi,

    Rigth now I'm using the SEQLOV-SQLLIST with the shuffle option (of course the source table does not have duplicate). That way I can have a result similar the the randlov but when I want to avoid dulicates with multiple columns (lets say I want to have a unique combinaison of first_name and last_name) it is more complex to implement.

    My life would be easier if tdm removed the duplicates before the quota calculation :)

    BTW, Sorry for the delay... summer time :)