Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No speed advantage when using batches. #58

Open
Dario-Mantegazza opened this issue Apr 15, 2024 · 8 comments
Open

No speed advantage when using batches. #58

Dario-Mantegazza opened this issue Apr 15, 2024 · 8 comments

Comments

@Dario-Mantegazza
Copy link

Dario-Mantegazza commented Apr 15, 2024

I did some tests when using both detection+recognition with a set of 30 images and I've seen that there is no speed improvements when using batches.
So I checked the code and if I got it right in your implementation,

tamil_ocr/ocr_tamil/ocr.py

Lines 527 to 536 in 71a91db

# To handle multiple images
if isinstance(image,list):
text_list = []
if self.detect:
for img in image:
temp = self.read_image_input(img)
exported_regions,updated_prediction_result = self.craft_detect(temp)
inter_text_list,conf_list = self.text_recognize_batch(exported_regions)
final_result = self.output_formatter(inter_text_list,conf_list,updated_prediction_result)
text_list.append(final_result)
you split the batch into single images and then pass each image to craft, get the BB and pass those to ParSeq.

I'm not an expert in Parseq, but if it already can deal with batches of BB why not simply take all the BB from the all batch and pass those as a single input to parseq?

To recap my suggestion why don't you do something like the following:

bbs=[]
for image in batch:
     bb_preds=craft(image)
     bbs.appens(bb_preds)
texts=parseq_read_batch(bbs)

This should be faster as you call parseq only once per batch and not per image, albeit with a larger memory cost but that can be dealt by the batches size parameter.

Obviously even better would be to do something like:

bbs=craft_batch(batch)
texts=parseq_batch(bbs)
@Dario-Mantegazza
Copy link
Author

Apparently CRAFT can run in batches, here

I think running the inference in parallel is difficult due to the post-processing step, which is performed in CPU unless you use multi-processing technique. However, the batch-processing of deep networks is possible within a memory limit.

clovaai/CRAFT-pytorch#44 (comment)

and in other comments in the issue section of CRAFT's GitHub, it is stated that batch prediction is feasible.
It would be interesting if the batch functionality of ocr-tamil would exploit this.

@Dario-Mantegazza
Copy link
Author

Also, I think it would make more sense to decouple the batchsize used by parseq for the text recognition and the tamil-ocr batch size parameter. these should be two separate numbers.
I like this library, please keep working on it :)

@gnana70
Copy link
Owner

gnana70 commented Apr 15, 2024

Hi @Dario-Mantegazza , thanks for your feedback. I will try to include batch mode for CRAFT text detection in coming weeks.

@Dario-Mantegazza
Copy link
Author

Hi again @gnana70, in the meantime I will make a fork and see if I can implement a temporary workaround. I will keep you posted.
Cheers

@gnana70
Copy link
Owner

gnana70 commented Apr 16, 2024

Hi @Dario-Mantegazza , thanks for your help. Please share your workaround once done.

@Dario-Mantegazza
Copy link
Author

So I tried to change the code in the most simple and hacky way, but for now, I don't get better performances; I think that something is broken in my edited version and while all the model accepts batched input, something else curb the performance gain. I will upload my version that works partially on my fork but due to work deadlines I don't think I can spend more time on this.

@gnana70
Copy link
Owner

gnana70 commented Apr 16, 2024

@Dario-Mantegazza , no problem. I will investigate and fix it up

@JamesDConley
Copy link
Contributor

Most of the time in processing appears to be the cv2/numpy code for extracting the detected word images from the main image. I swapped this code out for a simple min/max rectangle and saw time for a page I was testing on a file that went from 360s to under 15s.

For images with larger numbers of bounding boxes, this will be an even more drastic speedup, since it reduces this from 1-2 seconds per bounding box to around 1/100000 of a second per bounding box.

the only downside is that this isn't straightening the text- it just pulls out a bounding box. This works for my use case though since I am extracting from documents without any tilted text.

Here are the timings before and after for the portion of the code I was in

Before

Timer started!
Read Image took 0.00 seconds (0.00 seconds total)
Timer started!
	Got size took 0.00 seconds (0.00 seconds total)
	Got prediction took 11.34 seconds (11.34 seconds total)
	Transformed bboxes initial took 0.00 seconds (11.34 seconds total)
	Sorted bounding boxes took 0.00 seconds (11.34 seconds total)
	Updated prediction results took 0.00 seconds (11.34 seconds total)
	**Exported file paths took 348.48 seconds** (359.82 seconds total)
	Updated prediction results again took 0.00 seconds (359.82 seconds total)

After

Timer started!
Read Image took 0.00 seconds (0.00 seconds total)
Timer started!
	Got size took 0.00 seconds (0.00 seconds total)
	Got prediction took 11.08 seconds (11.08 seconds total)
	Transformed bboxes initial took 0.00 seconds (11.08 seconds total)
	Sorted bounding boxes took 0.00 seconds (11.08 seconds total)
	Updated prediction results took 0.00 seconds (11.08 seconds total)
	**Exported file paths took 0.01 seconds** (11.09 seconds total)
	Updated prediction results again took 0.00 seconds (11.09 seconds total)

Code is at https://github.com/JamesDConley/faster_tamil_ocr
Got a bit of debugging/testing left to do but I'll likely have a PR tomorrow or the following night.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants