Welcome and questions on Stage One

Mon Feb 23 18:57:42 UTC 2004

Welcome everyone to the new HP catalogue group, and thanks again to 
Kelley for setting this up so quickly for us; I hope it helps with 
communication. 

As we have all got the various background documents, I'll launch 
straightaway into a series of topics that we probably need to discuss 
and agree before we go on. I've divided them up into different posts, 
as it will probably help to deal with them separately. The topics 
are: -

1 (this post) stage one of the plan and some technical stuff
2 stage two – reject codes
3 stage two – subject categories

 Looking forward to your responses and further questions of your own

STAGE ONE

We are looking to index approximately 100 000 posts on the main, 
current HPfGU list, plus approx 7000 posts on a pre-Aug 2000 archive 
list. (The early list is the beginning of the group before it 
transferred to Yahoo.)

It's a lot of posts, and the first time-consuming task is to get the 
complete message index in one place for people to work from. In my 
plan, under `Stage One', I simply suggested cutting and pasting the 
message indexes from the HPfGU site onto Excel worksheets. There are 
two possible ways this can be speeded up.

Firstly, Paul Kippes, the Admin team's technowizard (who maintains 
the back-up archives) could probably cut and paste the whole index in 
one go onto Excel for us. Kelley has just sent him an email to see if 
he would be prepared to give us some help on this project, and this 
is one of the first questions I'd like to put to him. My other 
questions, apart from can he do it are:

-	what is the best size of spreadsheet to work with ? (Units of 
500, 1000 posts or whatever ? Threads are no respecter of arbitrary 
boundaries.. it has to be easy to follow on from one sheet to the 
next when tracing an argument)
-	where should the Excel sheets be based – on our individual 
computers, or on a server somewhere ? They certainly need to end up 
in one central place for obvious reasons.

Secondly, there is a rather more complex and time-consuming solution, 
which would save a lot of time later. The gist of this is that all 
the old posts, with their existing headings could be put into a 
database (not Excel), and Carolina could write a link programme 
between this database and the website we use to access it, to enable 
us to not only work simultaneously on it, but also to overcome any 
computer incompatibility problems that might crop up, especially as 
more people work on the project as we go on. My questions relating to 
this solution are:
-	Can the archive posts just be dumped into such a database, 
and would you do it all at once, or in orderly chunks of eg 1000 at a 
time?
-	How long would it take?
-	Where would the database be located (what server, as before 
with the Excel sheets)
-	Which website would we use to access it and would there be 
any major problems in multiple people accessing it any one time?
-	will it still be possible to speed read the posts one after 
another if we do this? (central to rejecting and coding them up)

The overall advantage of this second solution, although it sounds 
rather complex, is that once you have coded up the posts, you can 
then do much more sophisticated searches for topics  than you would 
be able to do in Excel. This will help later, when you want to group 
sets of posts together in threads for further editing (if we ever get 
that far..). It would also become a great resource for HPfGU members 
to search, if it were put up on the main site in a read-only form.

However, if we stick with Excel and solution one, it is possible to 
write search routines that can find strings of words and characters, 
to enable us to group together individual posts belonging to threads.

Thoughts please (and Carolina, apologies if I have not explained 
correctly).

Carolyn