Wednesday, March 14, 2007

OCR taxa names from phylogenies

I have been trying out different open source OCR software, to recognise the taxa used in phylogenies. Tesseract (newly released by Google) did not fair as well as GOCR. If you train GOCR using a database of character images, it does even better.
________________________________
Tesseract
________________________________
M mii
is A
' 53 V S
i
3 7 @1
M `W gj i :[
T * 5?$fA
`Gy i i 8
V ~ 'E S` v~v :
;A bg fh , [
A `i $ > ggt wd
E V 1 Jvh xg E 4 i
A 3 ^ Awbt ) 4 V M
Tvlrjg A
W [ M Y 2 At A] [ vp
> 4 ii ; h i1
% V b a i % i i S
4-E A ) * k i VZ-WVYA S 4`
4 W AQAA A [A~Jw Q QT
VV 42 V V i mvi Q. A A4@ $
@ N [ 2 h Jii@ ' S
,1 > ) ~~%# A -` X * @8
A T V S E -3 A~I: W i
A ` J V V g - M ? A AE
3 `@ J g t. ,4
1 jji WE jzi Q E >
$ A ag H L 1 @ Q w$
vi V BE * V 3 V ` * miji ,)
I V `V as ` 7 ,! I i
5 ) i AVVM *4.3 ~# 4 p u ` 1 ~ V *
i J .41 " gm ^ ` V` U N9 7* 6) *`
2 =
` ~ " ,4 ` ` ` L.;i A>?, ps > ,` i V. g>v..Ve,g $?i[ X ` I j:)4;w . ( 'X A " K 7
4 1 qvb~: JJ S t *%%`~
T - V (A %%^) I As
3 I jwi @ @ A W J [
?g M 6* i Pg`( h b I S W
1 i ; v ^ T [f L AA A
v A H 2 3 Y44@ M V V [ A M { Vi
5 ? El t 52 Wi 1* AL i
* 1^ 4@=b> i-sm ^ 44J
Q *i b g N %A J
4 ; V A C A i ]i T J L ^ A$v J
1 ,V;gt t izm dmgg 4 A 7
T ~ "as A ` . , @ J AVE V7;z`* I?
i i O gm ] is Aig
`- ` V An 7 F x ``^ w ` `An pb _ A ~ e h 7 U ^,
W L - N 4 A #14$A i
M A @ * i
V Z V gi A E T v?)
` pig [A$[ X A 11 44 i L V
~$ Ivd? A ^~hr4 V mg
i $ ihb 4 4 A A p LLAE
& V L Ay M #
N A if '4jv[ 1 x
` V.>gg( A QA VIV \ v4Ji`
J7* **nif if @;ibA Mb 2 W A i & $@^@ Q
% & L% J 4i @ @w@ 4%
1 A x ][A Q ii ] ijN i&i A [
V A " F
i L t WAA! ia G iLAi*
!^=p~ .4 x vvgv V A
I - G 24 @1 V `'A. p m
VAF A i
1 @A[f A 2 X X 4A i @
W gv@ I gpa 'A @ ' E
,>gn I At M i-44 ` BE 4
m m M I V OF J; { V A V qil
)~'; E` > FF a ;; M `* ga ~
LA v 6 @ YAVZ4 A V 3 ) M VY J 8 (
A! i h A 1`3A& X *
HE % At _ ` 7 @ 2 " ' 5 4 "
S 9 4,p; 5
k A V j , i v [ MY >yAA SS 4
V ` ^%jV 4wJv` J 4 4
` Q? W ppgawi V 1 . 1
\ 1 _` 7 2* I-
74 [ ^*^- Av pw
2 ~T$ 'W`` V 1 i i i
Q) H pi In as i
3 4 1 'AY @4 ~ 4, Y
1 V Er A [ M $95 V 8 i `
V ( @4 V * ) jg.: A
V Q 4 s 1 ( vv ;
44 g E 3 iii A] >;1A 9;
C; 4%%3 2 P 4`4A%yV;` 4s
` v4'Y H4:tA? K ; i4 VA 1 VAV A`=$ g `
=?h V) V
4 s iqvv i @M,ib J .l@ A S V?
wbi` i4 $; ~ Q? V t
Aii 6 4 2; EYE V )AGv
A W g 1 ) S HE Jig
1 % AM fA4LN ii O Egg?
V; V @3 ;#
` V I iv V7>:v2 V VA N
* 'id; E ' X? V %.`i:h; 'M( @6i . 7 _ ` t V} .5 '
`V V4 gvgv E1`

M Q`;`A
' 'Apq
~ $$i
___________________________________
GOCR
___________________________________
c(PICTURE)100 C. _asicus (14)
C. nasicus (_8)
C. jo_e_sjs (30)
C. co_fusoy (32)
C. coMfusoy (12)
C. su_cafu_us (_3)
C. su_cafu_us (68)
C. humey__is(_4)
C. vicoyiensis (_5)
C. p_y__lis (26)
C. vicfoyie_sÌs (54)
C. _ongi_e_s (16)
C. longi_e_s (19)
C. ve_osus (82)
C. veMos_s (166)
C. ve_osus (160)
C. ve_osus (149)
C. s__icivoyus
C.pe_lifus (131)
C. pellifus (151)
C. pellifus (192)
C. eleph_s (_18)
C. eleph_s (_13)
C. eleph_s (_16)
C. c___e (204)
C. cR_Re (205)
C. ca__e (208)
C. pyobosci_eus (58)
C. humey__is (56)
C. p_o_osci_eus (22)
C. scufel1_pjs (5)
C. nucu_ (117)
C. nucum (_43)
C. g____ium (_OO)
C. g__n_ium (gg)
Pakjsfan _p. (_1)
C. c__elli_e (49)
C. c__el_i_e (20)
__iica_ sp. (8)
C. p__yhoce_as (1)
C. p_y_hoceyas (16)
_______________________
GOCR using database option
_______________________
c(PICTURE)100 C. nasicus (14)
C. nasicus (28)
C. iowensis (30)
C. confusor (32)
C. coMfusor (12)
C. sulcatulus (23)
C. sulcatulus (68)
C. humeyalis(24)
C. vicoyiensis (25)
C. paydalis (26)
C. vicforiensÌs (54)
C. longidens (16)
C. longidens (19)
C. venosus (82)
C. veMosus (166)
C. venosus (160)
C. venosus (149)
C. salicivoyus
C.pellifus (131)
C. pellitus (151)
C. pellitus (192)
C. elephas (218)
C. elephas (213)
C. elephas (216)
C. cawae (204)
C. cawae (205)
C. cawae (208)
C. pyoboscideus (58)
C. humeyalis (56)
C. proboscideus (22)
C. scufel1aris (5)
C. nucum (117)
C. nucum (243)
C. glandium (200)
C. glandium (99)
Pakistan sp. (21)
C. camelliae (49)
C. camelliae (20)
Afiican sp. (8)
C. pyrrhoceras (1)
C. pyyrhoceras (16)

3 comments:

Donat Agosti said...

I got the figures from your publication (fig 1, in gig format) and run it through ABBYY - it would not work at all. So I wonder, what you original source for the ocr was. I am pretty sure, though, if I am printing and then ocr the phylogenies, I would get a better result.

blOg said...

I forgot to mention that I first extracted the images from the pdf file using Adobe Acrobat (it turns all the images from the document into jpeg format), I then converted the jpeg into pcx using ImageMagick.

blJOg said...

I forgot to mention that I first extracted the images from the pdf file using Adobe Acrobat (it turns all the images from the document into jpeg format), I then converted the jpeg into pcx using ImageMagick.

Disqus for Evo-Karma