Automic Workload Automation

View Only

Back to discussions

Expand all | Collapse all

Download files from a SFTP in parallel

1. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 05, 2017 09:43 AM

Reply Reply Privately
Hello,
we have to download a couple of thousands files from a remote SFTP server. We would like to parallelize the download on multiple servers to speed up the process.
I thought a possible implementation could be:
execute a ls command using the RA_FTP agent
parse the ls output to get the list of files in a variable
for each file trigger a file download
I could not get past the first point because the ls output is mixed with SSH debug information and parsing is hard and too easy too break.
Any hints on how to overcome the issue or other possible implementations?

Best,
Marco
2. Download files from a SFTP in parallel

0 Recommend
Anon Anon
Posted Jul 05, 2017 09:54 AM

Reply Reply Privately
If possible I would ask the provider of the files to send/upload a list of files with a defined name, this can be downloaded and further processed quite easy.
3. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 05, 2017 10:16 AM

Reply Reply Privately
Unfortunately this is not possible. :(
4. Download files from a SFTP in parallel

0 Recommend
Pete Wirfs
Posted Jul 05, 2017 11:31 AM

Reply Reply Privately
Your primary goal here is to speed up the download process of numerous files. Each RA-FTP job has a "number of transfer threads" option. Suggest you try some performance tests with a larger setting, and see if you can reach your performance goals that way?

To find this setting, go to the "FTP" tab, and on the "Command Sequence" title bar, go to the upper right corner and find the "job settings" option.

From RA-FTP help;
"Limits parallel processing when wildcards are used. Default value is 2."
5. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 10, 2017 11:04 AM

Reply Reply Privately
Thanks Pete,
that could be an approach, however we wanted to be flexible and be able to scale if we need to add more hosts in the future.
6. Download files from a SFTP in parallel

0 Recommend
Carsten Schmitz
Posted Jul 17, 2017 12:02 PM

Reply Reply Privately
Hi,

So you want to spread the action of downloading these files over multiple servers? That would imply that the bottleneck you're trying to address is the I/O on the downloading server itself, and not the actual network link. While possible, this is a rather unusual scenario. Are you sure about this?

Not to impose, but unless you are, I'd start by identifying the bottleneck. Especially with huge numbers of small files, ftp, scp and to some extend also sftp can be orders of magnitude to slow (not considering the RA agent, just by virtue of the protocols and file handling). If you're on UNIX or otherwise able (i.e. with a windows port), and have full SSH access on the remote side, I'd run a benchmark outside of UC4 with something like rsync, or even pipe your stuff through tar on the remote and local end. That might already solve much of your problem. That, or look into a potential I/O problem on that current server :)

If you still find multiple downloading servers are faster even with the RA agent out of the picture and using a well-performing transfer tool, are you sure that's not just because you're now using a greater number of tcp connections? If tcp connections is your bottleneck, that probably could be rectified on a single server as well without spreading out to multiple servers.

Failing all of that: if you're on UNIX (hint: it would really help a lot to know what OS this is on ;) I could possibly give you some pointers how to separate the ls output from ssh debug info (but why is there debug info in the first place?) and split it into parts to be used, if you'd post an example of the listing. But I doubt this alone will help much: You'd end up with a static split based on number or names of files, which will still not guarantee an even load distribution through to the end. Also, not to bash on UC4, but even if you put that into a variable and have multiple RA agents parse that, I doubt that would be the racecar option.

I'd personally think instead about putting my file listing (obtained from the "ls") into an sqlite database and have multiple servers lock one (or more) records, download the respective files, then remove the records from the table. These "worker" scripts which process the database records and do the actual downloading (couple of lines of shell script) could then easily be triggered from UC4. Poor man's message queue :) but very scalable.

Hope this helps.

edit: there's also https://www.gnu.org/software/parallel, which could be used to parallelize downloads across multiple machines as well. Haven't used it yet, but reportedly works like xargs, so should be able to achive proper distribution of load over a list of filenames as well.
7. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 18, 2017 04:36 AM

Reply Reply Privately
Hello Carsten,
thanks for your help. Maybe I can explain better the context, because it's not just downloading files.
What we would like to achieve is to have a pipeline of jobs . Each pipeline processes exactly one file, it can be run independently and is idempotent. Such setup, I believe, is simpler to understand and manage if anything bad happen during the processing which, in the beginning, will be likely since we have thousands of files.

To complicate things:
we have to check regularly the remote site because we do not know exactly when new files will show up and a push approach cannot be used.
we use Windows, i.e. no rsync (I know it can be setup but it is a mess and we would rather not doing it, also we have a lot of files but they are big and to over-complicate the remote site will move the file once downloaded)
For the moment I have achieved my goals using 2 workflows. The first workflow:
Runs a Windows JOB on the first available host in a agent group. It connects to the remote site, retrieves the list of files, parses the list and writes it to a file in a shared location.
Runs a script JOB. It reads the file (with PRE_PROCESS_FILE), fires up the processing workflow and waits until all triggered workflows have finished.
The second workflow:
Runs a Windows JOB to download a file.
Runs a number of processing steps.
With this setup:
Only one instance of the first workflow is allowed. This avoid the complexity of managing a queue, however it reduces our parallelism but maybe this is good enough.
If we add new servers we just need to add the new agent to the agent group, i.e. we can easily scale.
If the first workflow is still running we know there is some processing going on. Users can check what.
The parent/child relation between the first workflow and the triggered workflows is lost (at least afaik). This means: 1) when all the triggered workflows have finished, we are unable to report the status back to the parent (if any workflow failed the parent workflow shall exit with error). 2) users are not able to visually link the parent/child relation in the monitoring view.
Each file can be processed independently since each workflow can be triggered independently.
Using WinSCP we are able to resume file transfers (there is no option with the RA_FTP, I guess it is not supported).
Best,
Marco
8. Download files from a SFTP in parallel

0 Recommend
Carsten Schmitz
Posted Jul 18, 2017 05:16 AM

Reply Reply Privately
Hi Marco,

Sounds like you have found your solution then. Thanks for sharing the details.

Out of curiosity I looked at the resuming of sftp downloads. The library used in RA_FTP (com.jcraft.jsch) can theoretically resume sftp downloads just like WinSCP can, but it may be RA agent isn't making use of that particular functionality.

Best,
Carsten
9. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 18, 2017 06:52 AM

Reply Reply Privately
Do you know if there is a way for a child process to report its status back to the parent even if it is detached or, better, if the parent/child relation like in a workflow can be preserved so that all triggered workflows will be shown under one in the monitoring interface?

Thanks again,
Marco
10. Download files from a SFTP in parallel

0 Recommend
Carsten Schmitz
Posted Jul 18, 2017 07:02 AM

Reply Reply Privately
Hi Marco,

I'm not quite sure how that's meant. By "child process" you don't mean an OS process but something in UC4? Unfortunately I don't think I can help you with that. There's things like pset for passing variables between child and parent in UC4, but I don't believe one can alter the monitor view in any way.

Best,
Carsten.
11. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 18, 2017 07:13 AM

Reply Reply Privately
If I run a workflow (JOBP) then all JOBS in the workflow will appear under the parent JOBP in the monitoring view. Something like this:
JOBP |--> JOBS#1 |--> JOBS#2 |--> JOBS#3 |--> JOBS#4

In my case I cannot use a JOBP because the number of JOBS to run depends on the number of files. What I do instead is using a SCRIPT to activate a new JOBP with ACTIVATE_UC_OBJECT. In this way the hierarchical view is lost because the triggered JOBPs have no parent ID.

Best,
Marco
12. Download files from a SFTP in parallel

0 Recommend
Carsten Schmitz
Posted Jul 18, 2017 07:44 AM

Reply Reply Privately
Ah okay, now I understand. However, I still don't think there's any built-in way to display this in the monitor.

Best,
Carsten
13. Download files from a SFTP in parallel

0 Recommend
Anon Anon
Posted Jul 18, 2017 08:04 AM

Reply Reply Privately
Marco Tizzoni said:
In my case I cannot use a JOBP because the number of JOBS to run depends on the number of files.
What about using a Foreach Workflow basing one Static VARA or a Script Array ?
14. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 18, 2017 10:15 AM

Reply Reply Privately
Hi Wolfgang,
that was the original idea which I liked a lot because it solves the parent/child issue. However, a Foreach Workflow does not support parallelism, which means it will wait for the triggered JOBS to end before triggering the next one. At least this is my understanding maybe you have a better solution in mind.
15. Download files from a SFTP in parallel

0 Recommend
Anon Anon
Posted Jul 18, 2017 10:41 AM

Reply Reply Privately
I have in Mind a workflow (maybe a script) that creates a so called "Workload VARA" or an Script Array that holds files to process in an amount (configurable) of 100.

You could either via Script or with a Workflow (modify_task) or hardcoded (if there are no more 100-packs the FE.Workflow will STOP,NOMSG) start many FE Workflows each with the bunch of 100 files.

With that you could limit them either via Queue or max_parallel condition.

-- Just an Idea in my mind....
16. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Jul 18, 2017 10:55 AM

Reply Reply Privately
I am not sure I fully get it but I probably grasp the general idea, I will experiment a bit and see what comes out of it.
thanks.
m-
17. Download files from a SFTP in parallel

0 Recommend
Anon Anon
Posted Jul 18, 2017 04:13 PM

Reply Reply Privately
I played a bit with the logic.
the tricky part is splitting up the full amount of files (=the list)
into a useful amount of sub- packages.

creating an array within a loop is not the best choice for that I think.
Possibly the split up into some working-VARA object is better.
with these you can run the FE Workflow.
18. Download files from a SFTP in parallel

0 Recommend
Anon Anon
Posted Jul 19, 2017 04:20 AM
| view attached

Reply Reply Privately
Here is a short example how I imagined the split up into differenent varas.

just edit/start SCRI.MAIN

&PART_SIZE# => how many files should be processed within one VARA and FE Workflow
VARA_FOLDER# => Folder for the working VARAs in your UI

VARA.JOBLIST.DYNAMIC contains 25 dummy entries and is the source for the script

SCRI.PROCESS_ALL is a dummy for your FTP Jobs (I put the working VARA name and File Name into Archive KEY for a better overview)

Attachment(s)

V112_many_files_per_ftp.zip 2 KB 1 version
19. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Aug 03, 2017 01:35 PM

Reply Reply Privately
It looks like I achieved to have a parent workflow and all triggered workflows activated via UC_ACTIVATE_OBJECT shown in the monitoring view under a common parent task.

This is what I do:
I build a list with all files that I need to process (task A)
When task A is finished an empty script (task B) has a break point set, and another script (task C) is triggered via UC_ACTIVATE_OBJECT
Task C, modify the parent workflow by adding new sub-workflows and changing the &FILENAME# variable in the input prompt.
The break point is removed
This is the relevant part of the script (task C), maybe it will prove it useful to somebody else.
:ON_ERROR ABEND !:WAIT 5 !!! STOP THE WF FOR MODIFICATION :SET &RUNID# = GET_UC_OBJECT_NR("WF.FULL_COLLECTION.NEW") :P "RunID: &RUNID#" :SET &STATUS# = GET_STATISTIC_DETAIL(&RUNID#,STATUS) :P "Status: &STATUS#" :SET &RETSTOP# = MODIFY_TASK(&RUNID#, STOP_MODIFY) :P "Stop ret: &RETSTOP#" ! To process a file PREP_PROCESS_FILE needs a fixed host. ! If the host is unavailable the job will fail. To fix this we use hostgroups and pick ! the first available host. :SET &HND# = PREP_PROCESS_AGENTGROUP(&HOST_G#,"*",ALL) :PROCESS &HND# : SET &STATUS# = GET_PROCESS_LINE(&HND#,2) : IF &STATUS# = "Y" : SET &AGENT# = GET_PROCESS_LINE(&HND#,1) : PRINT "Agent: &AGENT#, Status: &STATUS#" : ENDIF :ENDPROCESS :CLOSE_PROCESS &HND# :SET &HND#=PREP_PROCESS_FILE(&AGENT#, &TEMP_FILE#) :PROCESS &HND# : SET &FILE# = GET_PROCESS_LINE(&HND#) : PRINT &FILE# : SET &RET# = MODIFY_TASK(&RUNID#,"WF.FETCH_AND_PREPROCESS",, ADD_TASK) : PRINT &RET# : SET &MODIFY# = MODIFY_TASK(&RUNID#, "WF.FETCH_AND_PREPROCESS", &RET#, VALUE, "PRPT.WF.FETCH_AND_PREPROCESS.FILENAME", "FILENAME#", &FILE#) : PRINT &MODIFY# : SET &MODIFY# = MODIFY_TASK(&RUNID#,, &RET#, ADD_DEPENDENCY,"SCRI.EMPTY",, "ANY_OK") : PRINT &MODIFY# : SET &MODIFY# = MODIFY_TASK(&RUNID#, "END",, ADD_DEPENDENCY,,&RET#, "ANY_OK") : PRINT &MODIFY# :ENDPROCESS :SET &RET# = MODIFY_TASK(&RUNID#, "SCRI.EMPTY",, BREAKPOINT, "NO") :SET &RETCOMMIT# = MODIFY_TASK(&RUNID#, COMMIT) :SET &RETGO# = MODIFY_TASK(&RUNID#, GO)
It is still to improve (error handling is missing for example). I also have the problem that all workflows are scheduled on the same agent even though the parent workflow has the "Workflow tasks of the same AgentGroup should use the same Agent" option unchecked. Any hint on that?

Best,
Marco
20. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Aug 11, 2017 07:14 AM

Reply Reply Privately
I found out a better way to implement this by using Job Groups.
I created a JOBG
I set the JOBG in the child workflow
The first task of the parent workflow activates all needed instances of the child JOB (via UC_ACTIVATE_OBJECT). However, since they are assigned to JOBG, they won't run immediately.
When the first task has completed, the control is passed to the JOBG which is activated by the parent workflow itself.
The Job Group starts all activated JOB instances in parallel.
With this solution the hierarchical view in the monitoring panel is preserved, i.e. it will show the parent workflow and underneath a JOBG and at a level below all triggered instances.
21. Download files from a SFTP in parallel

0 Recommend
Jean-Francois Anctil
Posted Aug 25, 2017 03:05 PM

Reply Reply Privately
That's exactly the solution we've implemented last year. The JOBG object is very powerful!!!!
22. Download files from a SFTP in parallel

0 Recommend
Alain Moisy
Posted Aug 26, 2017 08:34 AM

Reply Reply Privately
Just for information JOBG is/was used intensively by old users of "UC4" since it was available. Sometime returning to the basics is a good thing ;)
Reminder : using the option PASS_VALUES in ACTIVATE_UC_OBJECT is also useful to transfer parameters to the child process like a file name .... retreived from the variable populated in a previous job.
23. Download files from a SFTP in parallel

0 Recommend
MarcoTizzoni604411
Posted Aug 29, 2017 06:45 AM

Reply Reply Privately
Yeah, the thing is I am new user to AWA and documentation is quite extensive but it does not explain in which context objects/settings should be used, so it is difficult if you do not have somebody pointing you in the right direction or you do not know what to look for. Luckily we had a good consultant in for few hours and he tipped me off.

Automic Workload Automation

Download files from a SFTP in parallel

MarcoTizzoni604411Jul 05, 2017 09:43 AM

Anon AnonJul 05, 2017 09:54 AM

MarcoTizzoni604411Jul 05, 2017 10:16 AM

Pete WirfsJul 05, 2017 11:31 AM

MarcoTizzoni604411Jul 10, 2017 11:04 AM

Carsten SchmitzJul 17, 2017 12:02 PM

MarcoTizzoni604411Jul 18, 2017 04:36 AM

Carsten SchmitzJul 18, 2017 05:16 AM

MarcoTizzoni604411Jul 18, 2017 06:52 AM

Carsten SchmitzJul 18, 2017 07:02 AM

MarcoTizzoni604411Jul 18, 2017 07:13 AM

Carsten SchmitzJul 18, 2017 07:44 AM

Anon AnonJul 18, 2017 08:04 AM

MarcoTizzoni604411Jul 18, 2017 10:15 AM

Anon AnonJul 18, 2017 10:41 AM

MarcoTizzoni604411Jul 18, 2017 10:55 AM

Anon AnonJul 18, 2017 04:13 PM

Anon AnonJul 19, 2017 04:20 AM

MarcoTizzoni604411Aug 03, 2017 01:35 PM

MarcoTizzoni604411Aug 11, 2017 07:14 AM

Jean-Francois AnctilAug 25, 2017 03:05 PM

Alain MoisyAug 26, 2017 08:34 AM

MarcoTizzoni604411Aug 29, 2017 06:45 AM

1. Download files from a SFTP in parallel

2. Download files from a SFTP in parallel

3. Download files from a SFTP in parallel

4. Download files from a SFTP in parallel

5. Download files from a SFTP in parallel

6. Download files from a SFTP in parallel

7. Download files from a SFTP in parallel

8. Download files from a SFTP in parallel

9. Download files from a SFTP in parallel

10. Download files from a SFTP in parallel

11. Download files from a SFTP in parallel

12. Download files from a SFTP in parallel

13. Download files from a SFTP in parallel

14. Download files from a SFTP in parallel

15. Download files from a SFTP in parallel

16. Download files from a SFTP in parallel

17. Download files from a SFTP in parallel

18. Download files from a SFTP in parallel

19. Download files from a SFTP in parallel

20. Download files from a SFTP in parallel

21. Download files from a SFTP in parallel

22. Download files from a SFTP in parallel

23. Download files from a SFTP in parallel