Open source java library for HTML to text conversion

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

Answer:1

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

Answer:2

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

  1. How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
  2. How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

Answer:3

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

Answer:4

It seems that having a string that contains the characters { or } is rejected during regex processing. I can understand that these are reserved characters and I need to escape them so if I do: string....

There are 3 different ways to get data out of a Blob column: getBytes getBinaryStream getBlob Also, the Blob object returned by getBlob also has a getBytes and getBinaryStream on it. Are there any ...

I was studying about bluetooth and I was trying to write the code to keep listening to the input stream while connected and i came across this following code snippet: int data = mmInStream.read(); ...

Some time ago I asked this question. All solutions are workarounds. Now this can't be. I feel that something is wrong here, but I can't tell if it is Swing's MVC model that is conceptually wrong, or ...