Are we wasting our time?/Some thoughts...

Fri Sep 24 14:41:59 UTC 2004

There seem to be two issues here, or possibly three if you include 
the legal POV.

Firstly, what people are likely to want from an improved search 
function.

Secondly, whether the way we are working is the best/most efficient 
way of creating the improved search.

(1) What do people want?

IMO, people want the following:

(a) to search for threads on topics/characters
(b) to search for posts by particular author
(c) to be offered content to browse
(d) to have irrelevant or repetitive content filtered out
(e) to have their attention drawn to outstanding posts

I think it is important to have functions that enable you to look for 
what you want, but also menus to browse, which may prompt a whole 
string of new ideas. People really have no idea what is in the 
archives, and alongside their quest to find everything ever written 
on eg Snape, they should also be offered the chance to find 
collections of posts on a myriad of other topics that they'd never 
even thought of.

Once people find the type of content they are looking for, I cannot 
see why anyone would want to plough through endlessly repetitive, 
incorrect or trivial/mainly OT posts in their attempt to read up on 
past ideas. By the same token, surely anyone would like to have what 
are considered to be FPs flagged up, if only for the opportunity of 
disagreeing with the assessment?

I am stating the blindingly obvious perhaps, but I think it is 
necessary in order to point up the disadvantages of simply text-
searching the complete body of past posts. Anyone who uses Internet 
search engines regularly will be painfully aware of the difference 
between knowing exactly what you are looking for and locating it, and 
searching for relevant hits on questions you are still formulating; 
the rubbish quotient is formidable. As Debbie says (648):

>>>Text searching on a CD will bring up thousands of posts that will 
be
eliminated from the catalogue. Thus, the catalogue should be a much
better tool for directing readers to posts that are worth reading.
As a result, if we ever want FPs to actually be written, we'll want
a catalogue.<<<<

This issue has come up before. David (526) said:

>>>I confess part of my thinking here is that, given we have a 
database
of the entire list here, why not use it? It would be great to be able
to call up all posts that, say, include the text
string 'Hermione' and are categorised as 'Snape'. This would give a
different (but overlapping) set to those categorised as 'Hermione'
and including the string 'Snape'. What about all posts
categorised 'Snape' but *not* including the text string 'vampire'?
Or all such posts whose author was Pippin, posted between GOF
release and OOP release. Really, I think you are sitting on
something of a goldmine here, and to focus only on the categories is
to miss out.<<<

My point is that it is important to have both the category approach, 
to prompt investigation, plus other types of search functions to 
locate content when you are more sure of what you are looking for. 
However, I am saying that it is not worth our while, or IMO, the 
members' time, to include all past posts in the content to be 
searched. Currently we are rejecting over 60%, and I think that is a 
good thing.

Paul has recently put forward some new ideas on how to build this 
part of the catalogue which we are going to discuss with Tim shortly. 
Just as soon as something clear emerges, I'll post it here for 
comment from you all. It is perhaps the second most important part of 
the catalogue, apart from the initial weeding, sorting and 
categorising of the posts themselves.

Finally, Paul suggests (650) that the practicality of the CD approach 
is severely limited by people's computing power:

>>>Once they get the files,
their PC must be zippy enough to search the files. I imagine only a
tiny percentage of people have the tools to speedily and conveniently
search these files. I know I don't. (Example I know would be X1 or
Lotus Magellan.)<<<

The eventual catalogue presented to the members will be easily 
searchable by anyone who can access the HPfGU website. To aim for 
anything less defeats the object - which is to try and get everyone 
to read, think and add to what has gone before, rather than keep re-
inventing the wheel.

I also really like Tim's idea of have a 'what's hot' section 
eventually, which keeps tabs on the areas attracting most posts at 
any particular time. Ahem, this assumes a catalogue team able to 
continue on and on into the far distant future..

(2) Our approach - right/wrong?

Dan comments (647):
>>>Carolyn - oh, absolutely, if it's possible. This is what I was 
doing on my own
as well. I always felt that the catalogue project, if possible, 
should take this
kind of thing and provide "approaches" to the text, ways to look at 
it,
suggestions of catagories or kwic (keyword in context) ways to see 
it, results
from some people's "regular expression" searches on the material.

What it doesn't do, however, is deal with cataloguing FPs.<<<<

>From a past YM conversation with Dan, what I think he is referring to 
here is an approach which takes the existing body of 100,000+ posts, 
and first of all sets out to allocate them to a number of main 
catagories. He then suggests that we go over those five main 
categories again using ever finer definitions.

The five main categories he suggested we might use were: reject, 
meta, star (funny, cute etc.), plot (incl characters, WW), outcome 
theory. He felt that reading posts initially to reject or accept 
would be very fast, and would quickly separate out the wheat from the 
chaff. The task of refining the coding on the accepted posts would 
then be a lot quicker.

The kwic (key word in context) approach would be used (I think) to 
determine both the initial main categories and the subsequent sub-
categories. Kwic describes how people quickly decide what gets their 
real attention, and what is just scanned (for example in assessing 
posts to read on the main list). It equates to a list of personal 
buzz words/phrases, in effect.

Essentially, he is arguing against building up/refining the list of 
categories from scratch as we have done, based on what we find post-
by-post, but proposing a top-down approach, based on people's 
collective search preferences. [Correct me if I have misunderstood, 
Dan; I am also not sure how far you are suggesting that any of this 
is automated].

My reservations about this approach are as follows:
(a) It is just not possible to run through 100 000+ posts any quicker 
than we are doing. If we are reading them once, and deciding whether 
to accept or reject, in my view it takes very little extra time to 
add the appropriate coding. The ones that take a long time to read 
and code would take just as long either route.

(b) The resulting main categories and sub-categories from the kwic 
approach are unlikely to be vastly different from those we are 
working with anyway.

(c) The organic, bottom-up approach reflects the way the list 
developed and evolves with the subjects. As we continue, we will be 
able to see which subjects attract the most posts in a highly 
scientific way, and make decisions on which topics to drop through 
lack of interest, and which to fragment further to reflect their 
growing complexity as ideas build on ideas. I think this reflects the 
membership's interests over time as accurately as Dan's proposed 
method.

However, if there is a consensus that we have set ourselves an un-
doable task and we would be better to stop and re-think, and consider 
a different approach, then do make some constructive suggestions, all 
of you.

Carolyn
Masochistically pleased to be able to get back into the catalogue 
again after a few days absence.