Welcome and questions on Stage One

Mon Feb 23 22:39:52 UTC 2004

If you receive this twice, you know who to blame and it's not Barry
this time.

I'll try to clarify some things. It is not that we have to choose one
approach 
or the other. We'll use excel first because it's easier, but any work
done 
will be put in a mysql database as soon as it's made. 

And of course, Barry, if you prefer to work directly with the group
database, 
do so, it provides a nice 'Export Table' function.

No, excel is not a database, as Barry has pointed out it is not nearly as 
effective in data treatment. After all, that's what databases are
designed 
for.

What I really want is that no one has to do the same work twice.

STAGE ONE:

We have the first set of information that from my PoV is the worst work, 
because it needs reading, rejecting and categoricing 100000 posts that
are 
kept prisoner online by yahoomort.

What information do we have in the spreadsheet?

1 -- Post information: 

The raw material extracted from the message index, copied and pasted
into a 
spreadsheet. 

2 -- Cataloguing Information:

Codes for rejection and up to six categories. Six categories is a random 
desition. Just using the number of the category is enough. Name can be
used 
or not but is not necessary.

3 -- Control information:

It is not in the spreadsheet, but I strongly suggest that we include
another 
code for the rewiever, from the start. 

> -     what is the best size of spreadsheet to work with ? (Units of
> 500, 1000 posts or whatever ? Threads are no respecter of arbitrary
> boundaries.. it has to be easy to follow on from one sheet to the
> next when tracing an argument)

The actual size of the manageable file is variable. Some programs
start having 
problems while managing too big documents. From my side it is
indifferent, 
each spreadsheet can contain a different number of posts.

I say indefferent because proofs of reliability against the mysql
database are 
made with millions of registers, so from the technical side, all posts
can go 
in one spreadsheet. 

> -     where should the Excel sheets be based - on our individual
> computers, or on a server somewhere ? They certainly need to end up
> in one central place for obvious reasons.

Now let's save that Excel (or spreadsheet) into .cvs format. You can
make it 
as long as you want, the only restriction is one sheet per spreadsheet.

Then send me the .csv document. 

I'll put that info into a database, retrieve it in any number of ways,
and 
make .cvs that can be opened by spreadsheets.

I'll have a database, and can open it to remote access for those brave
users 
who want to work from a text-only remote console.

STAGE TWO:

Making it possible on-line.

That's what I should have been doing this weekend, planning this in a
more 
detailed way.

One option is to access remotely to the database, via a client, but I
have not 
looked for nice graphical utilities that could already exist.

Other is to access trough formularies in a web page.

[  ].....database and the website we use to access it, to enable
> us to not only work simultaneously on it, but also to overcome any
> computer incompatibility problems that might crop up, especially as
> more people work on the project as we go on. My questions relating to
> this solution are:

> -     Can the archive posts just be dumped into such a database,
> and would you do it all at once, or in orderly chunks of eg 1000 at a
> time?

> -     Where would the database be located (what server, as before
> with the Excel sheets)

> -     How long would it take?

The database will be already set up. Dump it entirely, send it to another 
computer, start a Mysql database (runs in Linux and Windows and it's free 
use, no licenses to pay), grab the dump you made and insert it into the 
2.database. This should be done in a day.

> -     Which website would we use to access it and would there be
> any major problems in multiple people accessing it any one time?

I think limiting up to 30 users at time is the standard security
measure to 
avoid DoS attacks for small servers, but my plans include a very
'light' (in 
the sense of amount of data to be transmitted) web interface, we
should not 
be a problem for the http server. 

The database can be in another computer or in the same. 

Basically we need a fixed IP number, a 256kbps 24h/day connection and a 
computer. With that and Apache we are independent. Apache is an http
server 
running both in Windows and Linux and also free, and I can set it up
or get 
someone to do it.  

Barry, which Java version does your Mac run?

STAGE THREE:

Absorbing the whole posts in the database, not only the index information.

I have no idea of how to do it from yahoo, insincerely (I have an idea
but 
surely is ilegal). I'd be glad to hear what Paul has to say.  

Insert that new info in the database.

Absorbing previous groups in the database. This will create a conflict
with 
post numbers that somehow will be solved.

I'm tired, it is late, but I though I had to try to explain some
things. The 
key is no matter how you do it, your work won't be lost. Either trough
the 
group database (I will iron my hands a little for not having proposed it 
myself) or a spreadsheet.

I know all of you have questions. Shoot.

Carolina