Monday 6 February 2023

How to ignore special characters in Tesseract OCR using java

 

How to ignore special characters in Tesseract OCR using java

 

In tesseract you can set TessBaseAPI.VAR_CHAR_WHITELIST and TessBaseAPI.VAR_CHAR_BLACKLIST in order to ignore some special characters.

Following would make tesseract only recognize A-Z and digits

String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);

Next snippet would allow you to recognize everything except for ~ and fl

String blackList = "~fl";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST,blackList );

 

No comments:

Post a Comment

Note: only a member of this blog may post a comment.

Blog Archive