Forum:Semantic Search Bug
Semantic Search Bug[edit]
It seems that some pages cannot be individually excluded from SMW queries. That is to say, the condition [[!NAME]] should exclude the page "NAME" from the SMW results, but in some cases it currently does not.
This bug is what is eventually responsible for certain items displaying Shared BiS when they should say Strict BiS. (The BiS logic is written so that the list of comparable items excludes the item being compared against--this approach prevents a race condition.)
Test case: A query to search for A False-Star of your Own which also excludes that page produces no results:
Query: [[A False-Star of your Own]][[!A False-Star of your Own]]
Results: .
Query: [[Wind-Bitten Reins]][[!Wind-Bitten Reins]]
Results: Wind-Bitten Reins.
This is getting a bit beyond my expertise but I wonder if the issue is because an internal ID is getting too high. When I do Debug magic, the filter seems to correspond to a WHERE clause in the SQL which says " (t3.smw_sortkey!='Wind-Bitten Reins' AND t3.smw_id=892794)."
That ID seems to be getting monotonically larger. For the False-Star its value is 5027; for the Overwrought Bathtub (which also exhibits this bug), its ID is 798889. If this hypothesis is correct then the bug would arise for essentially all items created past a certain date, which I haven't been systematic about it but seems to match by rough impression.
I know I'm usually the SMW expert in these parts but this is at a lower level than what I know how to deal with. I'm out of my depth for this one.
- PSGarak (talk) 03:27, 14 December 2022 (UTC)
- We’re aware of this issue and I’ve done my bit of digging before, as resident snoop!
- The rising IDs shouldn’t be a problem, it works on “Gant Bowler of Fiduciary Responsibility” which is also one of our newest items.
- My idea was that exclusion via plain page name is a form of undefined behavior. There are no mentions of this method anywhere and the SMW docs specifically instruct exclusion using property values. (Wildcards and comparators in inline queries can be used only on properties of type 'Text'.)
- If you debug a working example, SQL Explain even warns you that “Impossible WHERE noticed after reading const tables”, hence the empty result. Why does it “work” and display the page in some cases is anyone’s guess. If someone’s curious enough this could be trailed deeper.
- (I Did find, however, that this might be connected to banners somehow? The method you mentioned seems to misbehave only on pages containing banners (like Christmas, Hallowmas, etc.) but I have no idea why that could affect SMW stuff. Or maybe that was just a coincidence. It’s unreliable, in any case.)
- We can just use the “Page ID” property for exclusion since it’s unique on every page.
- As for the BiS messages, I was aware of this cause but I’m currently doing a full rewrite of that anyway to account for more stuff so that’ll be fixed soon. CarrONoir (talk) 13:01, 14 December 2022 (UTC)
- Huh, I never knew that ! didn't work on pagename properties. I even found a page stating specifically that it doesn't work, which of course begs the question of why it ever worked at all. That help article mentions RevID, which is probably similar to your Page ID idea.
- Excluding the item itself is still desirable because it's non-deterministic whether a page will be included in a query from itself. But it might be more robust to exclude it in the Lua code rather than the SMW query?
- Good luck with the re-write!
- If the connection to banners is real, I suspect it's because banners set categories, so SMW data is depending on a SMW query. A quick test says that @annotation isn't fixing the problem, but as you say we're kinda if "why does this even work at all" territory so we've abandoned all hope of logic. PSGarak (talk) 16:20, 14 December 2022 (UTC)
- Curiouser and curiouser: if you do a query like
[[~Wind-Bitten Reins]]
that does a LIKE comparison on the smw_sortkey column, and that (correctly) returns Wind-Bitten Reins, witht1.smw_sortkey LIKE 'Wind-Bitten Reins'
in the WHERE clause.
But if you do[[Wind-Bitten Reins]] [[~Wind-Bitten Reins]]
you get no results, with a WHERE clause that includes(t3.smw_sortkey LIKE 'Wind-Bitten Reins' AND t3.smw_id=892794)
.
And meanwhile,[[Gant Bowler of Fiduciary Responsibility]] [[~Gant Bowler of Fiduciary Responsibility]]
returns Gant Bowler of Fiduciary Responsibility, with an equivalent WHERE clause.
So, something is munging the smw_sortkey column but only when it's involved in a JOIN, and only in certain cases. Regardless of being able to negate the page name, that feels like a bug. A query like[[Category:Transport]] [[~W*]]
(i.e. "show me every transport that starts with W", which should be valid) returns no results, even though it should have both Wind-Bitten Reins and Weasel-Infested Velocipede (which is another page that suffers from the best-in-slot problem). In fact nothing I've tried other than[[Category:Transport]] [[~*]]
seems to match those two, so the smw_sortkey might be appearing as the empty string.
I'm not sure whether it would be useful for the BiS rewrite, but as a general-purpose workaround we could simply set a page name property on all items (or even all pages), which would allow you to do things like[[Category:Transport]] [[Page name::!Wind-Bitten Reins]]
, as well as things like[[Category:Transport]] [[Page name::~W*]]
. Tirerim (talk) 21:20, 14 December 2022 (UTC) - From what I can tell, [[Wind-Bitten Reins]][[!Wind-Bitten Reins]] doesn't search for a page (named Wind-Bitten Reins) and (not named Wind-Bitten Reins) but rather (THE page named Wind-Bitten Reins by id) and (pages whose sortkey is not Wind-Bitten Reins). Since seasonal and retired items have a non-default sortkey, they show up in the search. [[Wind-Bitten Reins]][[!^]] on the other hand shows all pages that ARE Wind-Bitten Reins AND do not have ^ as their sortkey (since Wind-Bitten Reins has ^ as it's sortkey it gets excluded) Thorsb (talk) 18:34, 17 December 2022 (UTC)
- Oh man that's wild, thanks for finding that. I guess when you use a Text-based operator on Pagename it "casts" the pagename to text by using the sortkey.
Would it make sense to change the sortkey from just ^ on its own to using ^ as a prefix? I.e. the page's sortkey would be "^Wind-Bitten Reins." The current implementation of using just ^ means that seasonal content is excluded from all text-based Semantic Search on the Wiki (e.g. search for [[~Wind-*]]). Putting the full name back into the sortkey would leave Category pages as they are, while allowing text-searching of seasonal pages (albeit with extra care for those edge cases). PSGarak (talk) 03:42, 18 December 2022 (UTC)- Yeah, I think that would make a lot of sense. And maybe for retired pages, too, though for those I'd want to make sure that there weren't any other implications to doing it for those -- there are a lot of things that we do want to exclude retired pages from, but having them otherwise work like non-retired pages would be less confusing. Tirerim (talk) 21:46, 22 December 2022 (UTC)
- I'm still confused as to why the sortkey only gets set to ^ when there is a join in the query, though. Tirerim (talk) 22:11, 22 December 2022 (UTC)
- Oh man that's wild, thanks for finding that. I guess when you use a Text-based operator on Pagename it "casts" the pagename to text by using the sortkey.