i am making an application which extract data from (doc, docx,pdf )files, but these files are written in Hebrew language, so how can i extract proper data from these files and apply reguler expression on that data and code must support UTF8 charset and support both LTR and RTL text direction.New line characters must be retained in the text.
1 Answer
You need to study RL a little more.
- PDF is sometime written in visual mode. So you just need to reverse it. Not the string - only the hebrew. http://php.net/manual/en/function.hebrevc.php will not help, since it does the opposite.
- Word and ODT are saved in logical mode, so no reversal is needed.
Arabic and Hebrew are only displayed in "reverse" but stored in the same order as in english (fist word is first on file).