The word search function in pdfs no longer identifies words that are hyphenated in the pdf as a result of a line break.
Attached a screenshot if my written explanation wasn’t clear– searched “mechanical” but nothing was highlighted even though it’s in the text (circled in red). I think the software picks it up as a completely separate word, i.e. “mechan-ical”
Thanks, great point Jade! Could you send me that PDF? I’d like to see which, if any PDF readers are wise to word breaks like this, and if it’s in fact a regression on our end.
And I believe you’re correct about the word being seen as literally mechan-ical
by our search.
Here’s the link to the above pdf: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8239692/pdf/CNCR-9999-0.pdf
I believe regular downloaded pdfs/ones you open up in a search engine have the ability to search words when they’re hyphenated in the text!
Thanks! I’ve confirmed that Chrome search treats hyphenated line endings as a word break, while Firefox does not. I’ll see what level of control we have for this. I’d prefer to infer -
+ endof line = word-break in your search. It’s not a perfect strategy (e.g. when a normally hyphenated phrase occurs at the end of a line), but it must be less faulty than not matching any word break.
The unfortunate reality with chrome is that part of the reason we moved away from their default PDF viewer was that it wasn’t possible to search only within the PDF embedded on the page (vs. all text on the page); by dropping chrome’s default behavior and using our own, we lost this handy search functionality they offered!
Ah gotcha, I had no idea this wasn’t a feature across all search engines. Thanks Karl!
A sort of extension on this– the pdf search sometimes fails to search some words altogether. Attached screenshots, it only worked when I made atrial fibrillation one word even though it’s definitely two in the pdf. Tried opening the pdf in chrome and firefox separately and the search was fine, so I think it’s something on our end!
It’s the Pang 2021 paper within the HF nest:
https://nested-knowledge.com/gather/extraction_inspector/410
@Karl Holub @Jeffrey Johnson Not sure if you both saw my comment above since I didn’t tag you but I keep noticing the search function isn’t highlighting results when they’re definitely in the text. I think our search function adds and removes spaces in the pdf, which might cause this?
Hi Jade - sorry I never got back to you! We’re actively looking into the spacing issue, which is somewhat spotty over PDFs. It’s a frustrating bug because our software is only reproducing this behavior under certain (still not understood) conditions.
The spacing issue specifically is being tracked here: https://nested-knowledge.nolt.io/298
I’m going to let Karl diagnose the problem and answer here, but thank you for the report; that looks incredibly frustrating.
No worries, just wasn’t sure you saw it since I hadn’t tagged you. Thanks for the update!
To boot - sometimes it doesn’t even reproduce for the same PDF!