Gå til innhold

Anbefalte innlegg

Videoannonse
Annonse

Etter å ha levert en big tidligere idag, har jeg fått 2 smp på denne maskinen, så det kan tyde på at big er mangelvare i øyeblikket.

 

Samme her, bare smp som kommer ned nå.

 

Edit: Og de er nokså kranglete å ha med å gjøre, det tar en hel evighet før ferdig WU kan lastes opp hvis den i det hele tatt går opp. Noen ganger sier server at den er beheftet med feil. I tillegg bruker klienten 2 - 3 timer før ny WU kommer ned. Det ser ut som at enten så er det feil med WU og den slettes, eller så er det ingen ledig server. Nå er det like før jeg stenger ned de maskinene som kjører smp.

 

Edit2: Samme problem igjen, har nå 2 WUer i queue som ikke sendes opp. Dette har jeg hatt før på en annen maskin hvor det endte med at de til slutt ble slettet. Jeg ser på forumet at dette har pågått en stund, og at det mest sannsynligvis skyldes at serveren(e) er overbelastet. Jeg antar at dette skjer fordi det er mangel på big i alle varianter slik at alle nå kjører med standard smp.

 

Det har jo lite for seg å folde WU etter WU og ikke få sendt noen av dem opp så jeg stopper nå de riggene som har dette problemet inntil ting har ordnet seg. Mine logger fra den ene riggen ser slik ut:

 

 

[09:06:48] Completed 495000 out of 500000 steps (99%)

[09:09:17] Completed 500000 out of 500000 steps (100%)

[09:09:18] DynamicWrapper: Finished Work Unit: sleep=10000

[09:09:28]

[09:09:28] Finished Work Unit:

[09:09:28] - Reading up to 5030784 from "work/wudata_06.trr": Read 5030784

[09:09:28] trr file hash check passed.

[09:09:28] - Reading up to 5401708 from "work/wudata_06.xtc": Read 5401708

[09:09:28] xtc file hash check passed.

[09:09:28] edr file hash check passed.

[09:09:28] logfile size: 326344

[09:09:28] Leaving Run

[09:09:33] - Writing 10791240 bytes of core data to disk...

[09:09:34] Done: 10790728 -> 10205249 (compressed to 94.5 percent)

[09:09:34] ... Done.

[09:09:34] - Shutting down core

[09:09:34]

[09:09:34] Folding@home Core Shutdown: FINISHED_UNIT

[09:09:34] CoreStatus = 64 (100)

[09:09:34] Unit 6 finished with 97 percent of time to deadline remaining.

[09:09:34] Updated performance fraction: 0.876799

[09:09:34] Sending work to server

[09:09:34] Project: 7504 (Run 20, Clone 80, Gen 2)

 

 

[09:09:34] + Attempting to send results [November 10 09:09:34 UTC]

[09:09:34] - Reading file work/wuresults_06.dat from core

[09:09:34] (Read 10205761 bytes from disk)

[09:09:34] Connecting to http://128.143.199.97:8080/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:8080)

[09:09:34] + Retrying using alternative port

[09:09:34] Connecting to http://128.143.199.97:80/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:80)

[09:09:34] - Error: Could not transmit unit 06 (completed November 10) to work server.

[09:09:34] - 1 failed uploads of this unit.

[09:09:34] Keeping unit 06 in queue.

[09:09:34] Trying to send all finished work units

[09:09:34] Project: 7504 (Run 2, Clone 170, Gen 0)

 

 

[09:09:34] + Attempting to send results [November 10 09:09:34 UTC]

[09:09:34] - Reading file work/wuresults_05.dat from core

[09:09:34] (Read 10185928 bytes from disk)

[09:09:34] Connecting to http://128.143.199.97:8080/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:8080)

[09:09:34] + Retrying using alternative port

[09:09:34] Connecting to http://128.143.199.97:80/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:80)

[09:09:34] - Error: Could not transmit unit 05 (completed November 10) to work server.

[09:09:34] - 4 failed uploads of this unit.

 

 

[09:09:34] + Attempting to send results [November 10 09:09:34 UTC]

[09:09:34] - Reading file work/wuresults_05.dat from core

[09:09:34] (Read 10185928 bytes from disk)

[09:09:34] Connecting to http://130.237.165.141:8080/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (130.237.165.141:8080)

[09:09:34] + Retrying using alternative port

[09:09:34] Connecting to http://130.237.165.141:80/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (130.237.165.141:80)

[09:09:34] Could not transmit unit 05 to Collection server; keeping in queue.

[09:09:34] Project: 7504 (Run 20, Clone 80, Gen 2)

 

 

[09:09:34] + Attempting to send results [November 10 09:09:34 UTC]

[09:09:34] - Reading file work/wuresults_06.dat from core

[09:09:34] (Read 10205761 bytes from disk)

[09:09:34] Connecting to http://128.143.199.97:8080/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:8080)

[09:09:34] + Retrying using alternative port

[09:09:34] Connecting to http://128.143.199.97:80/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (128.143.199.97:80)

[09:09:34] - Error: Could not transmit unit 06 (completed November 10) to work server.

[09:09:34] - 2 failed uploads of this unit.

 

 

[09:09:34] + Attempting to send results [November 10 09:09:34 UTC]

[09:09:34] - Reading file work/wuresults_06.dat from core

[09:09:34] (Read 10205761 bytes from disk)

[09:09:34] Connecting to http://130.237.165.141:8080/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (130.237.165.141:8080)

[09:09:34] + Retrying using alternative port

[09:09:34] Connecting to http://130.237.165.141:80/

[09:09:34] - Couldn't send HTTP request to server

[09:09:34] + Could not connect to Work Server (results)

[09:09:34] (130.237.165.141:80)

[09:09:34] Could not transmit unit 06 to Collection server; keeping in queue.

[09:09:34] + Sent 0 of 2 completed units to the server

[09:09:34] - Preparing to get new work unit...

[09:09:34] Cleaning up work directory

[09:09:34] + Attempting to get work packet

[09:09:34] Passkey found

[09:09:34] - Will indicate memory of 7980 MB

[09:09:34] - Connecting to assignment server

[09:09:34] Connecting to http://assign.stanford.edu:8080/

[09:09:35] Posted data.

[09:09:35] Initial: 8F80; - Successful: assigned to (128.143.199.97).

[09:09:35] + News From Folding@Home: Welcome to Folding@Home

[09:09:35] Loaded queue successfully.

[09:09:35] Sent data

[09:09:35] Connecting to http://128.143.199.97:8080/

[09:09:37] Posted data.

[09:09:37] Initial: 0000; - Receiving payload (expected size: 2167105)

[09:09:40] - Downloaded at ~705 kB/s

[09:09:40] - Averaged speed for that direction ~340 kB/s

[09:09:40] + Received work.

[09:09:40] Trying to send all finished work units

[09:09:40] Project: 7504 (Run 2, Clone 170, Gen 0)

 

 

[09:09:40] + Attempting to send results [November 10 09:09:40 UTC]

[09:09:40] - Reading file work/wuresults_05.dat from core

[09:09:40] (Read 10185928 bytes from disk)

[09:09:40] Connecting to http://128.143.199.97:8080/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (128.143.199.97:8080)

[09:09:40] + Retrying using alternative port

[09:09:40] Connecting to http://128.143.199.97:80/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (128.143.199.97:80)

[09:09:40] - Error: Could not transmit unit 05 (completed November 10) to work server.

[09:09:40] - 5 failed uploads of this unit.

 

 

[09:09:40] + Attempting to send results [November 10 09:09:40 UTC]

[09:09:40] - Reading file work/wuresults_05.dat from core

[09:09:40] (Read 10185928 bytes from disk)

[09:09:40] Connecting to http://130.237.165.141:8080/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (130.237.165.141:8080)

[09:09:40] + Retrying using alternative port

[09:09:40] Connecting to http://130.237.165.141:80/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (130.237.165.141:80)

[09:09:40] Could not transmit unit 05 to Collection server; keeping in queue.

[09:09:40] Project: 7504 (Run 20, Clone 80, Gen 2)

 

 

[09:09:40] + Attempting to send results [November 10 09:09:40 UTC]

[09:09:40] - Reading file work/wuresults_06.dat from core

[09:09:40] (Read 10205761 bytes from disk)

[09:09:40] Connecting to http://128.143.199.97:8080/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (128.143.199.97:8080)

[09:09:40] + Retrying using alternative port

[09:09:40] Connecting to http://128.143.199.97:80/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (128.143.199.97:80)

[09:09:40] - Error: Could not transmit unit 06 (completed November 10) to work server.

[09:09:40] - 3 failed uploads of this unit.

 

 

[09:09:40] + Attempting to send results [November 10 09:09:40 UTC]

[09:09:40] - Reading file work/wuresults_06.dat from core

[09:09:40] (Read 10205761 bytes from disk)

[09:09:40] Connecting to http://130.237.165.141:8080/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (130.237.165.141:8080)

[09:09:40] + Retrying using alternative port

[09:09:40] Connecting to http://130.237.165.141:80/

[09:09:40] - Couldn't send HTTP request to server

[09:09:40] + Could not connect to Work Server (results)

[09:09:40] (130.237.165.141:80)

[09:09:40] Could not transmit unit 06 to Collection server; keeping in queue.

[09:09:40] + Sent 0 of 2 completed units to the server

[09:09:40] + Closed connections

[09:09:40]

[09:09:40] + Processing work unit

[09:09:40] Core required: FahCore_a3.exe

[09:09:40] Core found.

[09:09:40] Working on queue slot 07 [November 10 09:09:40 UTC]

[09:09:40] + Working ...

[09:09:40] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 07 -np 8 -checkpoint 3 -verbose -lifeline 2675 -version 634'

 

[09:09:40]

[09:09:40] *------------------------------*

[09:09:40] Folding@Home Gromacs SMP Core

[09:09:40] Version 2.27 (Dec. 15, 2010)

[09:09:40]

[09:09:40] Preparing to commence simulation

[09:09:40] - Looking at optimizations...

[09:09:40] - Created dyn

[09:09:40] - Files status OK

[09:09:41] - Expanded 2166593 -> 3127236 (decompressed 144.3 percent)

[09:09:41] Called DecompressByteArray: compressed_data_size=2166593 data_size=3127236, decompressed_data_size=3127236 diff=0

[09:09:41] - Digital signature verified

[09:09:41]

[09:09:41] Project: 7507 (Run 0, Clone 95, Gen 27)

[09:09:41]

[09:09:41] Assembly optimizations on if available.

[09:09:41] Entering M.D.

[09:09:47] Mapping NT from 8 to 8

[09:09:47] Completed 0 out of 500000 steps (0%)

[09:12:46] Completed 5000 out of 500000 steps (1%)

 

 

 

Edit3:

Før jeg rakk å stoppe maskinen så gikk faktisk begge de 2 WUene opp, så det ser ut som ting nå har ordnet seg. Legger ut del av denne loggen også:

 

[09:45:46] Completed 60000 out of 500000 steps (12%)

[09:47:13] - Autosending finished units... [November 10 09:47:13 UTC]

[09:47:13] Trying to send all finished work units

[09:47:13] Project: 7504 (Run 2, Clone 170, Gen 0)

 

 

[09:47:13] + Attempting to send results [November 10 09:47:13 UTC]

[09:47:13] - Reading file work/wuresults_05.dat from core

[09:47:13] (Read 10185928 bytes from disk)

[09:47:13] Connecting to http://128.143.199.97:8080/

[09:47:48] Posted data.

[09:47:48] Initial: 0000; - Uploaded at ~284 kB/s

[09:47:48] - Averaged speed for that direction ~286 kB/s

[09:47:48] + Results successfully sent

[09:47:48] Thank you for your contribution to Folding@Home.

[09:47:48] + Number of Units Completed: 3

 

[09:47:48] Project: 7504 (Run 20, Clone 80, Gen 2)

 

 

[09:47:48] + Attempting to send results [November 10 09:47:48 UTC]

[09:47:48] - Reading file work/wuresults_06.dat from core

[09:47:48] (Read 10205761 bytes from disk)

[09:47:48] Connecting to http://128.143.199.97:8080/

[09:48:23] Posted data.

[09:48:23] Initial: 0000; - Uploaded at ~284 kB/s

[09:48:23] - Averaged speed for that direction ~286 kB/s

[09:48:23] + Results successfully sent

[09:48:23] Thank you for your contribution to Folding@Home.

[09:48:23] + Number of Units Completed: 4

 

[09:48:23] + Sent 2 of 2 completed units to the server

[09:48:23] - Autosend completed

[09:48:45] Completed 65000 out of 500000 steps (13%)

 

Endret av -alias-
Lenke til kommentar

Det er sannsynligvis Stanford som har trøbbel. Du kan prøve å stoppe klienten, slette workkatalogen og queue.dat + unitinfo.txt. Det er ingen garanti at det virker, men det blir ikke noe verre heller.

 

Den fikk tilsynelatende en 6900 wu men da fikke jeg bare io error så kom det en standrd smp(7504) som den kjører nå.

Får fortsatt bare vanlige smp på den ene 2600K maskinen som er satt opp til å kjøre bigbeta wu'er.

Lenke til kommentar
Får bare FILE_IO_ERROR hele tiden på en maskin nå, er det Stanford som har trøbbel eller er det noe galt med maskinen?

 

Har fått mye av det samme på to maskiner, men jeg var ikke tilstede så de ble bare stående å prøve og prøve, og til slutt så tok de ned en P6099 hver. En av dem er 2600K. Ellers så har det kommet ned 6903 og 6904 på de andre.

Endret av -alias-
Lenke til kommentar

Jeg opplever det under Linux også. Sendte nå for 3 minutter siden opp en ferdig 6904 fra en 2600K og fikk ned en 6901, så nå har jeg to stk. 2600K som ikke fikk ned bigbeta. Det ordner seg nok om noe tid igjen, vi få bare være litt tålmodige en stund. Men jeg ser på forumet hos Stanford at det er mange som klager på akkurat dette med at det går mye i standard smp.

Lenke til kommentar

Når du starter maskinen åpnes 2 vinduer, og foldingen starter i det ene. Gjør den det, og du i tillegg finner en linje der det står:

 

starting 12 threads

 

så har du sannsynligvis gjort det meste riktig.

 

Edit: Det med 12 tråder ble litt feil, sjekk heller at System Monitor tror at CPU'en har 12 tråder.

Endret av ei57
Lenke til kommentar

Når du starter maskinen åpnes 2 vinduer, og foldingen starter i det ene. Gjør den det, og du i tillegg finner en linje der det står:

 

starting 12 threads

 

så har du sannsynligvis gjort det meste riktig.

 

Edit: Det med 12 tråder ble litt feil, sjekk heller at System Monitor tror at CPU'en har 12 tråder.

 

System monitor tror at CPU'en har 12 tråder her ja så det virker som den er satt opp rett men den fikk enda en 7504.

 

3 stk bigavd og 2 vanlig smp kjører nå.

post-152551-0-85254800-1320967604_thumb.jpg

Lenke til kommentar

Hei,

 

Ser at dere legger ut disse grafiske oversiktene, hvordan får dere til dette? Er det ett eget program man må kjør for å få den oversikten?

Jeg har satt opp FAH til å kjøre som en service så kan ikke finne noe sted jeg kan sjekke fremskritt på WU'ene.

 

 

Mvh,

OEH

 

 

 

Når du starter maskinen åpnes 2 vinduer, og foldingen starter i det ene. Gjør den det, og du i tillegg finner en linje der det står:

 

starting 12 threads

 

så har du sannsynligvis gjort det meste riktig.

 

Edit: Det med 12 tråder ble litt feil, sjekk heller at System Monitor tror at CPU'en har 12 tråder.

 

System monitor tror at CPU'en har 12 tråder her ja så det virker som den er satt opp rett men den fikk enda en 7504.

 

3 stk bigavd og 2 vanlig smp kjører nå.

post-152551-0-85254800-1320967604_thumb.jpg

Lenke til kommentar

Opprett en konto eller logg inn for å kommentere

Du må være et medlem for å kunne skrive en kommentar

Opprett konto

Det er enkelt å melde seg inn for å starte en ny konto!

Start en konto

Logg inn

Har du allerede en konto? Logg inn her.

Logg inn nå
  • Hvem er aktive   0 medlemmer

    • Ingen innloggede medlemmer aktive
×
×
  • Opprett ny...