Sorry, we don't support your browser.  Install a modern browser

PDF search not identifying hyphenated words#315

The word search function in pdfs no longer identifies words that are hyphenated in the pdf as a result of a line break.

Attached a screenshot if my written explanation wasn’t clear– searched “mechanical” but nothing was highlighted even though it’s in the text (circled in red). I think the software picks it up as a completely separate word, i.e. “mechan-ical”

2 years ago

Thanks, great point Jade! Could you send me that PDF? I’d like to see which, if any PDF readers are wise to word breaks like this, and if it’s in fact a regression on our end.

2 years ago
Changed the status to
Under Consideration
2 years ago

And I believe you’re correct about the word being seen as literally mechan-ical by our search.

2 years ago

Here’s the link to the above pdf: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8239692/pdf/CNCR-9999-0.pdf

I believe regular downloaded pdfs/ones you open up in a search engine have the ability to search words when they’re hyphenated in the text!

2 years ago

Thanks! I’ve confirmed that Chrome search treats hyphenated line endings as a word break, while Firefox does not. I’ll see what level of control we have for this. I’d prefer to infer - + endof line = word-break in your search. It’s not a perfect strategy (e.g. when a normally hyphenated phrase occurs at the end of a line), but it must be less faulty than not matching any word break.

2 years ago

The unfortunate reality with chrome is that part of the reason we moved away from their default PDF viewer was that it wasn’t possible to search only within the PDF embedded on the page (vs. all text on the page); by dropping chrome’s default behavior and using our own, we lost this handy search functionality they offered!

2 years ago

Ah gotcha, I had no idea this wasn’t a feature across all search engines. Thanks Karl!

2 years ago

A sort of extension on this– the pdf search sometimes fails to search some words altogether. Attached screenshots, it only worked when I made atrial fibrillation one word even though it’s definitely two in the pdf. Tried opening the pdf in chrome and firefox separately and the search was fine, so I think it’s something on our end!

It’s the Pang 2021 paper within the HF nest:
https://nested-knowledge.com/gather/extraction_inspector/410

2 years ago

@Karl Holub @Jeffrey Johnson Not sure if you both saw my comment above since I didn’t tag you but I keep noticing the search function isn’t highlighting results when they’re definitely in the text. I think our search function adds and removes spaces in the pdf, which might cause this?

2 years ago

Hi Jade - sorry I never got back to you! We’re actively looking into the spacing issue, which is somewhat spotty over PDFs. It’s a frustrating bug because our software is only reproducing this behavior under certain (still not understood) conditions.

2 years ago

The spacing issue specifically is being tracked here: https://nested-knowledge.nolt.io/298

2 years ago

I’m going to let Karl diagnose the problem and answer here, but thank you for the report; that looks incredibly frustrating.

2 years ago

No worries, just wasn’t sure you saw it since I hadn’t tagged you. Thanks for the update!

2 years ago

To boot - sometimes it doesn’t even reproduce for the same PDF!

2 years ago