Is it possible to have Clover read the content in an RSS feed? I’m able to parse the feed page and see the recent posts using the XML Reader. However I would like to be able to pull the links and content behind the posts. I’m passing these links (essentially another web page) onto another XML Reader to read the content but all I get is
java.net.SocketException: Unexpected end of file from server
Hello,
please see the attached graph - it reads rss data from BBC server, passes the url of each article to DataReader and saves the full article in the output file. Note, that you need to increase some default properties to run the graph successfully:
Thanks Agata. That worked great. Can you also use an XML Extract component to parse html? I’ve been trying without much luck. Was trying to get the text in the tags of the document. After doing some research, I think I’m going doing the wrong path here…
Hello,
this is question about parsing a html document, what is not easy. If you know the exact structure of the document, you will be probably able to get the text content only, but I don’t see how to get the article’s text only in our example.
One of my co-workers suggested using a regex to grab the body of the web page so I tried going down this path. I have the content of the web page coming through as a record. I’m then using a reformat component to apply a regex and extract the body of the page. It looks like the regex works in the regex tester but doesn’t seem to work in the CTL2 code using the find() function. When I debug the output of the reformat component, no data is being placed into $0.body.
Am I using the find() function incorrectly? If not, is there a better approach here?
Hello,
the problem is that CTL regexp doesn’t support flags, so it is impossible to apply the pattern .* to the multiline input (https://bug.javlin.eu/browse/CL-1929). As a workaround you need to use java class (attached).
At last I realized that CTL regexp supports flags :-). It is described in description of replace function in String Functions page. So following code in CTL works as well: