Reading RSS Feeds

Is it possible to have Clover read the content in an RSS feed? I’m able to parse the feed page and see the recent posts using the XML Reader. However I would like to be able to pull the links and content behind the posts. I’m passing these links (essentially another web page) onto another XML Reader to read the content but all I get is

java.net.SocketException: Unexpected end of file from server

when it tries to open the page.

Any help would be appreciated.

Thanks

Hello,
please see the attached graph - it reads rss data from BBC server, passes the url of each article to DataReader and saves the full article in the output file. Note, that you need to increase some default properties to run the graph successfully:

DataParser.FIELD_BUFFER_LENGTH=262304
Record.MAX_RECORD_SIZE = 524608
DataFormatter.FIELD_BUFFER_LENGTH = 262304
DEFAULT_INTERNAL_IO_BUFFER_SIZE = 262304

Thanks Agata. That worked great. Can you also use an XML Extract component to parse html? I’ve been trying without much luck. Was trying to get the text in the tags of the document. After doing some research, I think I’m going doing the wrong path here…

Thanks again

Hello,
this is question about parsing a html document, what is not easy. If you know the exact structure of the document, you will be probably able to get the text content only, but I don’t see how to get the article’s text only in our example.

One of my co-workers suggested using a regex to grab the body of the web page so I tried going down this path. I have the content of the web page coming through as a record. I’m then using a reformat component to apply a regex and extract the body of the page. It looks like the regex works in the regex tester but doesn’t seem to work in the CTL2 code using the find() function. When I debug the output of the reformat component, no data is being placed into $0.body.

Am I using the find() function incorrectly? If not, is there a better approach here?


//#CTL2
string strBody = "";

// Transforms input record into output record.
function integer transform() {
	
	foreach (string item : find($0.content,'\<body\>.\</body\>'))
	{
		strBody	= concat(strBody, item);
	}

	$0.content = "none";
	$0.body = strBody;

	return ALL;
}

Hello,
the problem is that CTL regexp doesn’t support flags, so it is impossible to apply the pattern .* to the multiline input (https://bug.javlin.eu/browse/CL-1929). As a workaround you need to use java class (attached).

Great! Thank you Agata.

At last I realized that CTL regexp supports flags :-). It is described in description of replace function in String Functions page. So following code in CTL works as well:

//#CTL2

// Transforms input record into output record.
function integer transform() {
   
	string strBody = "";
   
   foreach (string item : find($0.content,"(?s)<body.*/body>"))
   {
      strBody   = concat(strBody, item);
   }

   $0.content = "none";
   $0.body = strBody;

   return ALL;
}