diacritics

ConversazioniTinyCat

Iscriviti a LibraryThing per pubblicare un messaggio.

diacritics

1BIHS
Nov 9, 2021, 11:56 am

A great majority of the books we're adding come from India, which means alot original cataloging and diacritic marks over and under the letters. When doing a search in LibraryThing and TinyCat without the diacritics, sometimes it comes up and sometimes it doesn't -- mostly we're not picking up the item when typing without the diacritics. What library user is going to use or know how to use them in searching a title? In any other library you can type in the title without the marks and still pick up the record. Is there anything we can do to allow this feature in TinyCat and Librarything? Thanks.

2Keeline
Nov 9, 2021, 3:00 pm

Although Unicode can store glyphs with diacritical marks, it can be very hard for computers to strip them away for searching purposes. There is a term for this. I think it is called delamination or something similar in spelling but different.

It is generally easier to type the characters with the accents on a Mac than it is on Windows or Linux. However, even if you do, there's no guarantee that the way you entered it will match the search.

There are some algorithms to strip away the most common diacritical marks from characters to facilitate searching. However, for some languages this is much harder and it all comes down to whether the right algorithms are used.

Since that part of the LT code is unlikely to change, you can make searches more successful by entering the title without the marks as someone might be likely to type it in for searching. You can do this in a couple places and it should work.

One of them is to put the alternate title in parentheses in the title field. Here's a simple example:

Über is my favorite way to get around (Uber is my favorite way to get around)

Another option is to put the alternate version of the title in the Comments or Private Comments field. These should be searched. You might test to see if Private Comments are indeed searched since there could be good reasons to not search them if you are not logged in as the user who can view/edit them.

James

3bnielsen
Modificato: Nov 10, 2021, 3:03 am

>2 Keeline: I have the same problem with some of my Danish books. One example is that Aarhus and Århus is the same word. I have a script that checks the Title and Comment field, so if the Title contains Århus, the Comment should contain Aarhus, etc. Same thing with hyphens, i.e. if the Title contains Klima-kampen the Comment should contain Klimakampen.
I do this exactly to be able to find both forms of the words.

So my solution would be to keep the original title and have a script check that if the title contains diacritics the Comment field should contain the title stripped of diacritics. (This works by exporting as TSV file and have a perl script or similar go through the TSV file.)

It is a bit of a pain, but something you can do while cataloguing the book for the first time and then forget about. Plus you get to control what's going on which is always nice. I also add original title for translated works and a translitterated version for Russian titles, etc.

You can also suggest to LT to add this feature to search. They already do "Stemming" to make searching a bit easier. This works best for English and not so well with Danish words, so I often get a few extra hits when I search.
Providing some examples will probably help.

4BIHS
Nov 10, 2021, 8:58 am

Thank you! Your responses have been very helpful.

5timspalding
Nov 10, 2021, 9:12 am

>1 BIHS:

Most common diacritics are handled natively by the engine—resume, résume, rêsume, rësume, etc. But it's downhill from there. Can you give us some specific situations that it doesn't work for?

T

6BIHS
Nov 10, 2021, 2:37 pm

The most common, prevelant are the ā & ī . No results if these are in the word. For example "Rājasthānī temple hangings of the Kṛishṇa cult from the Collection of Karl Mann, New York" bring up no results if you type in rajasthani. ś is also used alot.
Hindi, Sanskrit, and Bengali have many letters with diacritics that we've had trouble with --- ṭ ḍ ū ṁ ṣ ṇ ḥ Ṭ Ś ṛ ṅ

So for now, I'm going back through what I've already catalogued and writing the title without the diacritics in the comments field as James suggested. That seems to be working, and while it's more work, we're able to keep the original title with the diacritics and bring up results without them.

If you're able to incorporate these letters somehow into the search engine that would be great, however, if and when you're able to do this, at least there's a work around.
Thank you for asking.

7MarthaJeanne
Nov 10, 2021, 3:24 pm

You should be aware that even if they can be searched, diacritics will have problems in alphabetizing and in author page URLs.

8BIHS
Nov 10, 2021, 3:53 pm

Yes, we're seeing that. Academic libraries are able to handle these issues better, so we'll see as time goes on. We may have to forget about the diacritics all together, though we really would like to keep them.

9bnielsen
Nov 11, 2021, 1:27 am

>8 BIHS: Some library systems have a way of writing "display 'this' but sort it as 'that' inside titles etc. This allows you to decide the sort order exactly as you like at the cost of a little more work.
Variations on the LT scheme for chopping of A, The, An, etc are also common, like an extra character that indicates where the "real" title begins.

10Keeline
Nov 11, 2021, 4:15 pm

>9 bnielsen: , you will find that LibraryThing has a field after the title that indicates the character position where the sorting should begin. It will recognize many words and preset the value. If it gets it wrong, you can adjust it.

James

11MarthaJeanne
Modificato: Nov 11, 2021, 4:57 pm

>10 Keeline: I'm sure he knows that (whether or not the OP does). It is a big help for many cases, but it does not help with books that have a diacritic on the first or second letter.

https://www.librarything.com/catalog.php?tag=Birds&view=MarthaJeanne&col...

Look between the As and Bs in the list. Ü in German schemes can be interfiled with U. It can be filed as UE. It can be filed after U. But between A and B is not really where most people would look.

12bnielsen
Nov 11, 2021, 5:12 pm

>10 Keeline: Yes. I always set it to (start), so I don't have to worry about LT doing something clever that I don't want. Most of the time I don't mind how LT sorts stuff, since I try to design my searches to only give a few results of which I can pick the interesting ones by eye.

I consider sorting "correctly" a hard problem, that I tend to avoid :-)

And if necessary I'll export the data and write my own sort routine. I'm not kidding. I have a script that extracts series information from the Comment field and sorts by that.

cat /tmp/lt.rdb | perl /tmp/row Comment mat '/, bind 0-9/i' | perl /tmp/column Comment Title | perl /tmp/headchg -del | sed -e 's/\(^\*, bind 0-9\+\).*\t\(.*\))|(.*)/\1\t\2/' | sed -e 's/.*\\(^\+, bind 0-9\+\).*\t\(.*\)/\1\t\2/' | sort -V

giving a list like this:
...
Tempe Brennan, bind 1 Déjà Dead
Tempe Brennan, bind 2 Death du jour
Tempe Brennan, bind 3 Deadly decisions
Tempe Brennan, bind 4 Fatal voyage
Tempe Brennan, bind 5 Grave Secrets
Tempe Brennan, bind 6 Bare bones
Tempe Brennan, bind 7 Monday mourning
Tempe Brennan, bind 8 Cross bones
Tempe Brennan, bind 9 Break no bones
Tempe Brennan, bind 10 Bones to Ashes
Tempe Brennan, bind 11 Devil Bones
Tempe Brennan, bind 11 Devil Bones
Tempe Brennan, bind 12 206 Bones
Tempe Brennan, bind 13 Mortal remains
Tempe Brennan, bind 13 Mortal remains
...

Hmm, I probably have both a hardcover and a paperback version of two of the titles.

13Keeline
Nov 11, 2021, 5:23 pm

>11 MarthaJeanne: , I can guess what happened in the program. The system did not have a hint of how to sort the Ü so it ignored that character and looked at the next one in

Über die Schwalbe

Normally in ASCII character sets the capital letters have a lower number than their lower-case counterparts. However, why the lower-case "b" would sort before a capital "B" is a puzzle unless they have some custom code to deal with some sorting ideal.

A database table can have a character set and language defined. However, for a mixed collection like this, the default may not be very satisfactory.

So, if this can't be resolved at the site programming or the database side, we have to look at the tools available in the system to work out what may produce the desired result. It is like adding a space or a shift-space or an underscore at the beginning of a file name to force it to the top in a sort.

The option of including a un-accented version of the title in parentheses after the displayed title could help with searches. If that start position for sorting field I mentioned could be used, it might help with displaying the title in a better position. In your example, something like

Über die Schwalbe (Uber die Schwalbe)

might have a position around 20 or so to catch that "U".

As we long-time LT users recall, content in parentheses or after a colon are not considered for auto-combination code.

It may be tedious but it is a question of how important the sort in a list view is.

When I do searches in documents with typographers quotes and apostrophes, I routinely have problems when I don't type in the character just so in my search. It is another facet of the same problem. Computers are only as smart as the coders behind them and some edge cases are tough to work around.

James

14Keeline
Nov 11, 2021, 5:35 pm

>12 bnielsen: , If I am reading that line, you have scripts stored in /tmp that you are piping into? It's a little hard to tell what each one does since they don't seem to be standard ones. I recognize the complexity and can read the regular expressions.

Is the main point of this to sort by the volume number and keep the single-digit volumes at the beginning? If so, I might attempt another approach, perhaps with awk and sort -n. But if it works for you (and you know how it works) then that's all that is important.

For my library I am more likely to do a search than make a list. When I do, say for a series, I will include the volume information with leading zeros as appropriate and perhaps a printing year suffix. Then I will do my sort by year and comment fields. It generally works out for me and works within what is available in LT.

James

15MarthaJeanne
Nov 11, 2021, 5:39 pm

Seeing that German language institutions cannot agree on where to sort it, neither can they agree on how to code the character, and other languages also use it, there is not going to be a 'right' way to do it. That after A and after Z seem to be where these characters seem to end up is just something to try to remember.

16bnielsen
Nov 11, 2021, 5:47 pm

And things can be even worse. Look for the recent "Trojan source" vulnerability. Or see here for an invisible backdoor in node.js code:
https://certitude.consulting/blog/en/invisible-backdoor/

Exporting my LT catalogue gives some non UTF-8 byte sequences, so maybe that's also possible inside LT?

Anyway, I just wanted to say that there are worse things than diacritics :-)

17bnielsen
Nov 11, 2021, 6:10 pm

>14 Keeline: Yes, I have a rather complicated script that among other things load up some scripts in /tmp. Some of them allows me to treat a slightly modified version of the TSV export file as a database. /tmp/row selects rows. /tmp/column selects columns.

The idea in the one-liner above is to read lines in the Comment field where I've put information like
Tempe Brennan, bind 7
(bind = Danish word for Volume)
I then use this as sort key and adds the titles of the books.

I mostly use it to see if I "almost" have all the books in a series. And secondary if I have bought some books that belong to a series and wants to read them in the correct order. I'm currently reading some old police procedurals by J. J. Marric and would like to read them in publication order.

I should also mention that using this simple script database was not my first idea but importing a TSV file into sqlite or lobase gave me grief because neither has documented their import functions and their limitations all that well. I.e. is "TAB" two fields separated by a TAB or one field containing a TAB or one field containing "TAB"?

I think my current script converts the TSV file with iconv and forces it to utf-8 (i.e. no illegal byte sequences). It then adds TABs if they are missing and adds an extra header line to the file.

For convenience it then adds several extra columns computed from the original data but that's just frosting the cake :-)

18andyl
Nov 12, 2021, 5:07 am

>15 MarthaJeanne:

Yep. sorting is just hard. Mulit-lingual sorting is a nightmare.

As well as diacritics if you were being 100% correct you would have to cope with digraphs which are considered distinct letters. In Welsh 'ng' can be a digraph, a letter in its own right, which should sort between g and h. Similarly 'dd' and 'll' and 'rh' and a few more (they sort after d, l, and r - respectively) and a few more. So in 'unrhyw' the 'rh' is a digraph and in 'mawrhad' it is not. The Welsh have probably got used to English people mangling sort orders of digraphs though.

Some languages such as Spanish also had digraphs as letters until their Royal Spanish Academy changed the rules in the 90s.

Another thing to bear in mind is there are hundreds of different languages. Some may contain the same diacritics or same digraphs and choose to sort them in different places. I think the Swedes and Danes put Å in different places in their alphabets.

19MarthaJeanne
Modificato: Nov 12, 2021, 5:54 am

>13 Keeline: Nope. Ö and Ü sort between A and B. I also have Ä in there, and À. Some Ö come after Z along with books from non-latin alphabets. It would, of course, be nice if the Ös all sorted together, and I could go in and edit each Ö so that the underlying code all matches, but I can't be bothered. If I search on Ötzi it finds both books. The one that sorts between A and B, and the one that sorts after Z.