How to count the number of files found by the ListFile component

Hi,

I want to count how many files are on HDFS.

If there is no file, send an email notifying that the file was not received. Execute a job flow if there is a file.

If HDFS have files, I can count the number of files found by the ListFile component.

However, if there are no files on HDFS, the components behind the ListFile component do not seem to execute.

The code for the Denormalizer component is as follows.

//#CTL2
integer fileCount;
// This transformation defines the way in which multiple input records 
// (with the same key) are denormalized into one output record. 

// This function is called for each input record from a group of records
// with the same key.
function integer append() {
	if($in.0.URL != "" && $in.0.isFile && !$in.0.isDirectory && $in.0.size > 0){
		fileCount += 1;
	}
	return OK;
}

// This function is called once after the append() function was called for all records
// of a group of input records defined by the key.
// It creates a single output record for the whole group.
function integer transform() {
	$out.0.count = fileCount;
	return OK;
}

// Called during component initialization.
// function boolean init() {}
function boolean init() {
	fileCount = 0;
	return true;
}

Thanks,
Davis Wang

Hello Davis,
from looking at your screenshots, it appears that CloverDX Designer, in fact, works as expected in this particular case. The reason why the EmailSender component does not send out an email is that no input records flows into it. Then, the reason why no input records flow into EmailSender is that the ListFiles component does not find any files in the location set up in the ‘File URL’ property, thus not sending any records on the edge.
Nevertheless, this challenge can be easily overcome by redesigning your jobflow slightly. Let me present you 2 possible approaches that you can take in the attached jobflows. Worth noting is the following:
fileExists_masterJob.jbf

  • Notice the DataGenerator component and SimpleGather component. The idea here is to send a single dummy record alongside the records from ListFiles.

  • This will ensure that even if there are no files to be found by ListFiles, you will still get at least a single record that can be sent to EmailSender using the Condition (or Filter) component, thus sending the email.

fileExists_masterJob2.jbf

  • You can split the data flow into 2 phases and pass the count from the one to another. Note that I am using the dictionary feature for this purpose.

  • In phase 0, ListFiles will send out records based on how many files it manages to find. If no files are found, it will not send any records on the edge (again, this is by design).

  • By using Aggregate, I am getting the count of those records and then writing this into dictionary. Note that if ListFiles does not send any records, no value will be written into dictionary so it will remain null.

  • In phase 1, I am getting the dictionary value and by using the Condition component, I am forking the data flow based on whether the dictionary value is null or not null.

Kind regards,
Vladi

Thanks for the reply Vladi.

The method you provided is very good. I have learnt a lot.

Thanks,
Davis Wang