Background and LimitationsTesseract

Background and Limitations

Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with, so please take this section into account before building up your hopes that it will work well on your particular language!

Tesseract 3.01 added top-to-bottom languages, and Tesseract 3.02 added Hebrew (right-to-left). Tesseract currently handles scripts like Arabic with an auxiliary engine called cube (included in Tesseract 3.0+)

Tesseract is slower with large character set languages (like Chinese), but it seems to work OK.

Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly. This used to be limited to 32 fonts, but the limit has been raised to 64. It is set by the constant MAX_NUM_CONFIGS defined in intproto.h. Note that runtime is heavily dependent on the number of fonts provided, and training more than 32 will result in a significant slow-down.

Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits. To be fixed in 3.0x for x>=2.

You need to run all commands in the same folder where are located your input files.

Background and Limitations

Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with, so please take this section into account before building up your hopes that it will work well on your particular language!

Tesseract 3.01 added top-to-bottom languages, and Tesseract 3.02 added Hebrew (right-to-left). Tesseract currently handles scripts like Arabic with an auxiliary engine called cube (included in Tesseract 3.0+)

Tesseract is slower with large character set languages (like Chinese), but it seems to work OK.

Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly. This used to be limited to 32 fonts, but the limit has been raised to 64. It is set by the constant MAX_NUM_CONFIGS defined in intproto.h. Note that runtime is heavily dependent on the number of fonts provided, and training more than 32 will result in a significant slow-down.

Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits. To be fixed in 3.0x for x>=2.

You need to run all commands in the same folder where are located your input files.

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

Nền và hạn chếTesseract ban đầu được thiết kế để nhận ra văn bản tiếng Anh chỉ. Những nỗ lực đã được thực hiện để thay đổi động cơ và các hệ thống đào tạo của mình để làm cho họ có thể để đối phó với các ngôn ngữ khác và các ký tự UTF-8. Tesseract 3.0 có thể xử lý bất kỳ ký tự Unicode (mã hoá với UTF-8), nhưng không có giới hạn về phạm vi của các ngôn ngữ mà nó sẽ được thành công với, vì vậy hãy dành phần này vào tài khoản trước khi xây dựng của bạn với hy vọng rằng nó sẽ làm việc tốt trên ngôn ngữ cụ thể của bạn! Tesseract 3,01 đưa trên dưới cùng ngôn ngữ, và Tesseract 3,02 tiếng Do Thái (phải sang trái). Tesseract hiện đang xử lý kịch bản như tiếng ả Rập với một động cơ phụ trợ được gọi là các khối lập phương (bao gồm trong Tesseract 3.0 +) Tesseract là chậm hơn với nhân vật lớn đặt ngôn ngữ (như Trung Quốc), nhưng nó có vẻ làm việc OK. Tesseract cần phải biết về các hình dạng khác nhau của nhân vật cùng một bởi có phông chữ khác nhau tách ra một cách rõ ràng. Điều này sử dụng được giới hạn 32 phông chữ, nhưng giới hạn đã được nâng lên đến 64. Thành phố này nằm ở hằng số MAX_NUM_CONFIGS định nghĩa trong intproto.h. Lưu ý rằng thời gian chạy rất nhiều phụ thuộc vào số lượng các phông chữ được cung cấp, và đào tạo hơn 32 sẽ dẫn đến một hổn đáng kể. Bất kỳ ngôn ngữ nào có dấu chấm câu khác nhau và số điện thoại sẽ được hoàn cảnh khó khăn bởi một số các thuật toán mã hóa cứng giả định ASCII dấu chấm câu và chữ số. Phải được cố định trong 3.0 x cho x > = 2. Bạn cần phải chạy tất cả các lệnh trong thư mục cùng một nơi đang có vị trí cách các tập tin đầu vào.

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

B Tesseract ban Nh Tesseract 3.0 có th Tesseract 3.01 thêm ngôn ng Tesseract hi Tesseract là ch Tesseract c Đ L B Để đượ B

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.