Layer7 API Management

 View Only
  • 1.  Replication Auto-Fail with error:1236

    Posted Sep 18, 2020 02:12 AM
    Hi All,

    We have a cluster of 2 nodes, GW1 and GW2 with their databases SSG1 and SSG2 respectively. Our primary database SSG1 shut down automatically due to some technical issues. When we resatrted the mysql service for SSG1 database, we noticed below error on the secondary database SSG2:

    Last_IO_Errno: 1236

    Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'


    Based on the below tech doc link we understand that restarting replication of the secondary database in the correct solution. This is given in the section 'When Secondary Node Slave Fails' in the below link:

    https://techdocs.broadcom.com/us/en/ca-enterprise-software/layer7-api-management/api-gateway/9-4/install-configure-upgrade/configure-a-gateway-cluster/configuring-cluster-database-replication/restart-replication.html

    Pleaser let us know if this is the correct solution. If yes, there is a step in that link as mentioned below which is to be performed on the primary node (GW1) :-

    3>Restart the replication on the Primary database node:
    [primary]
    # ./restart_replication.sh
    Enter hostname or IP for the MASTER: [SET ME] machine.mycompany.com
    Enter replication user: [repluser] 
    repluser
    ........

    As per our knowledge the highlighted hostname should be the secondary gateway hostname GW2 as it is the master of the primary node. Please confirm if this understanding is correct or not. 

    Regards,
    Rohan

    ------------------------------
    [Technology Architect]
    [Infosys Limited]
    ------------------------------


  • 2.  RE: Replication Auto-Fail with error:1236

    Broadcom Employee
    Posted Sep 18, 2020 09:10 AM
    Hello Rohan,

    This error is common when the database gets full and then fixed without repairing replication.  Yes, reinitializing replication is the recommended steps.  This is done on both the primary and the secondary.  The below KB may better state the steps required.

    Reinitialize replication in a multi-node cluster: https://knowledge.broadcom.com/external/article?articleId=44402

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: Replication Auto-Fail with error:1236

    Posted Sep 18, 2020 10:32 AM
    Hi Matthew,

    Thanks for your reply. We don't think that our database would have got full. Can you provide me commands to check the database size. 

    Also, in the below step during replication I am getting a bit confused 

    3>Restart the replication on the Primary database node:
    [primary]
    # ./restart_replication.sh
    Enter hostname or IP for the MASTER: [SET ME] machine.mycompany.com
    Enter replication user: [repluser] 
    repluser
    ........
    I am running this command on the primary gateway GW1. Here it asks to enter the hostname for the master. Ideally the master for GW1 will be 'GW2'. So shall I put 'GW2' hostname here, or put 'GW1' (Primary gateway where I am running this command). 

    Please help me out in this step.

    Regards,
    Rohan

    ------------------------------
    [Technology Architect]
    [Infosys Limited]
    ------------------------------



  • 4.  RE: Replication Auto-Fail with error:1236
    Best Answer

    Broadcom Employee
    Posted Sep 18, 2020 11:47 AM
    Edited by ROHAN SINHA Sep 18, 2020 12:21 PM
    Hello Rohan,

    Please understand that you can get this error for more reasons then the partition being full, this was just a common example. 

    The MySQL partition in our OVA/appliance builds should be /var/lib/mysql you can use commands such as df -h for the size/space used.

    I would look over the provided KB as it answers your question, this is why I provided this as I feel it will provide more insight into this procedure. 

    Here is a quick look at that kb as it addresses this question:
    Execute the create_slave.sh script on the primary node: /opt/SecureSpan/Appliance/bin/create_slave.sh
    Provide the FQDN of the secondary node when prompted

    Execute the attached create_slave.sh script on the secondary node: /opt/SecureSpan/Appliance/bin/create_slave.sh
    Provide the fully qualified domain name (FQDN) of the primary node

    The KB provides a bit more detail to the steps

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 5.  RE: Replication Auto-Fail with error:1236

    Posted Sep 18, 2020 12:21 PM
    Thanks Matthew,

    We followed the steps in the link provided by you and post that the replication is working. So the issue is resolved now.



    Regards,
    Rohan

    ------------------------------
    [Technology Architect]
    [Infosys Limited]
    ------------------------------